Skip to content
Snippets Groups Projects
Commit dec05841 authored by WEICHU YANG's avatar WEICHU YANG
Browse files

dataflow diagram added, minor typos fixed.

parent 8ad99870
No related branches found
No related tags found
No related merge requests found
......@@ -14,14 +14,18 @@ Before starting, please review the [general project directions](../projects.md).
## Corrections/Clarifications
* none yet
* Mar 5: A hint about HDFS environment variables added; a dataflow diagram added; some minor typos fixed.
## Introduction
You'll need to deploy a system including 5 docker containers like this:
You'll need to deploy a system including 6 docker containers like this:
<img src="arch.png" width=600>
The data flow roughly follows this:
<img src="dataflow.png" width=600>
We have provided the other components; what you only need is to complete the work within the gRPC server and its Dockerfile.
### Client
This project will use `docker exec -it` to run the client on the gRPC server's container. Usage of `client.py` is as follows:
......@@ -71,23 +75,23 @@ In this part, your task is to implement the `DbToHdfs` gRPC call (you can find t
**DbToHdfs:** To be more specific, you need to:
1. Connect to the SQL server, with the database name as `CS544` and the password as `abc`. There are two tables in databse: `loans` ,and `loan_types`. The former records all information related to loans, while the latter maps the numbers in the loan_type column of the loans table to their corresponding loan types. There should be **447367** rows in table `loans`. It's like:
```mysql
mysql> show tables;
+-----------------+
| Tables_in_CS544 |
+-----------------+
| loan_types |
| loans |
+-----------------+
mysql> select count(*) from new_table;
+----------+
| count(*) |
+----------+
| 426716 |
+----------+
```
2. What are the actual types for those loans?
Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
```mysql
mysql> show tables;
+-----------------+
| Tables_in_CS544 |
+-----------------+
| loan_types |
| loans |
+-----------------+
mysql> select count(*) from loans;
+----------+
| count(*) |
+----------+
| 447367 |
+----------+
```
2. What are the actual types for those loans?
Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
3. Filter all rows where `loan_amount` is **greater than 30,000** and **less than 800,000**. After filtering, this table should have only **426716** rows.
4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **2x** replication and a **1-MB** block size, using PyArrow (https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html).
......
p4/dataflow.png

290 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment