@@ -14,14 +14,18 @@ Before starting, please review the [general project directions](../projects.md).
## Corrections/Clarifications
*none yet
*Mar 5: A hint about HDFS environment variables added; a dataflow diagram added; some minor typos fixed.
## Introduction
You'll need to deploy a system including 5 docker containers like this:
You'll need to deploy a system including 6 docker containers like this:
The data flow roughly follows this:
We have provided the other components; what you only need is to complete the work within the gRPC server and its Dockerfile.
### Client
This project will use `docker exec -it` to run the client on the gRPC server's container. Usage of `client.py` is as follows:
@@ -71,23 +75,23 @@ In this part, your task is to implement the `DbToHdfs` gRPC call (you can find t
**DbToHdfs:** To be more specific, you need to:
1. Connect to the SQL server, with the database name as `CS544` and the password as `abc`. There are two tables in databse: `loans` ,and `loan_types`. The former records all information related to loans, while the latter maps the numbers in the loan_type column of the loans table to their corresponding loan types. There should be **447367** rows in table `loans`. It's like:
mysql> show tables;
| Tables_in_CS544 |
| loan_types |
| loans |
mysql> select count(*) from new_table;
| count(*) |
| 426716 |
2. What are the actual types for those loans?
Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
mysql> show tables;
| Tables_in_CS544 |
| loan_types |
| loans |
mysql> select count(*) from loans;
| count(*) |
| 447367 |
2. What are the actual types for those loans?
Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
3. Filter all rows where `loan_amount` is **greater than 30,000** and **less than 800,000**. After filtering, this table should have only **426716** rows.
4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **2x** replication and a **1-MB** block size, using PyArrow (https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html).
@@ -124,6 +128,8 @@ The documents [here](https://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/h
Use a `GETFILEBLOCKLOCATIONS` operation to find the block locations.
**Hint:** You have to set appropriate environment variable `CLASSPATH` to access HDFS correctly. See example [here](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/blob/main/lec/18-hdfs/notebook.Dockerfile?ref_type=heads).
## Part 3: `CalcAvgLoan` gRPC Call
In this part, your task is to implement the `CalcAvgLoan` gRPC call (you can find the interface definition in the proto file).