Compare revisions

WEICHU YANG · WEICHU YANG · dec05841 · dec05841
--- a/p4/README.md
+++ b/p4/README.md
@@ -14,14 +14,18 @@ Before starting, please review the [general project directions](../projects.md).
 ## Corrections/Clarifications
-* none yet
+* Mar 5: A hint about HDFS environment variables added; a dataflow diagram added; some minor typos fixed.
 ## Introduction
-You'll need to deploy a system including 5 docker containers like this:
+You'll need to deploy a system including 6 docker containers like this:
 <img src="arch.png" width=600>
+The data flow roughly follows this:
+<img src="dataflow.png" width=600>
 We have provided the other components; what you only need is to complete the work within the gRPC server and its Dockerfile.
 ### Client
 This project will use `docker exec -it` to run the client on the gRPC server's container. Usage of `client.py` is as follows:
@@ -71,23 +75,23 @@ In this part, your task is to implement the `DbToHdfs` gRPC call (you can find t
 **DbToHdfs:** To be more specific, you need to:
 1. Connect to the SQL server, with the database name as `CS544` and the password as `abc`. There are two tables in databse: `loans` ,and `loan_types`. The former records all information related to loans, while the latter maps the numbers in the loan_type column of the loans table to their corresponding loan types. There should be **447367** rows in table `loans`. It's like:
-      ```mysql
+   ```mysql
-      mysql> show tables;
+   mysql> show tables;
-      +-----------------+
+   +-----------------+
-      | Tables_in_CS544 |
+   | Tables_in_CS544 |
-      +-----------------+
+   +-----------------+
-      | loan_types      |
+   | loan_types      |
-      | loans           |
+   | loans           |
-      +-----------------+
+   +-----------------+
-      mysql> select count(*) from new_table;
+   mysql> select count(*) from loans;
-      +----------+
+   +----------+
-      | count(*) |
+   | count(*) |
-      +----------+
+   +----------+
-      |   426716 |
+   |   447367 |
-      +----------+
+   +----------+
-      ``` 
+   ```
-2. What are the actual types for those loans? 
+2. What are the actual types for those loans?
-Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`. 
+   Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
 3. Filter all rows where `loan_amount` is **greater than 30,000** and **less than 800,000**. After filtering, this table should have only **426716** rows.
 4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **2x** replication and a **1-MB** block size, using PyArrow (https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html).
@@ -124,6 +128,8 @@ The documents [here](https://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/h
 Use a `GETFILEBLOCKLOCATIONS` operation to find the block locations.
+**Hint:** You have to set appropriate environment variable `CLASSPATH` to access HDFS correctly. See example [here](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/blob/main/lec/18-hdfs/notebook.Dockerfile?ref_type=heads).          
 ## Part 3: `CalcAvgLoan` gRPC Call
 In this part, your task is to implement the `CalcAvgLoan` gRPC call (you can find the interface definition in the proto file).

--- a/p4/dataflow.png
+++ b/p4/dataflow.png
No results found