Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • cdis/cs/courses/cs544/s25/main
  • zzhang2478/main
  • spark667/main
  • vijayprabhak/main
  • vijayprabhak/544-main
  • wyang338/cs-544-s-25
  • jmin39/main
7 results
Show changes
Commits on Source (2)
...@@ -14,14 +14,18 @@ Before starting, please review the [general project directions](../projects.md). ...@@ -14,14 +14,18 @@ Before starting, please review the [general project directions](../projects.md).
## Corrections/Clarifications ## Corrections/Clarifications
* none yet * Mar 5: A hint about HDFS environment variables added; a dataflow diagram added; some minor typos fixed.
## Introduction ## Introduction
You'll need to deploy a system including 5 docker containers like this: You'll need to deploy a system including 6 docker containers like this:
<img src="arch.png" width=600> <img src="arch.png" width=600>
The data flow roughly follows this:
<img src="dataflow.png" width=600>
We have provided the other components; what you only need is to complete the work within the gRPC server and its Dockerfile. We have provided the other components; what you only need is to complete the work within the gRPC server and its Dockerfile.
### Client ### Client
This project will use `docker exec -it` to run the client on the gRPC server's container. Usage of `client.py` is as follows: This project will use `docker exec -it` to run the client on the gRPC server's container. Usage of `client.py` is as follows:
...@@ -71,23 +75,23 @@ In this part, your task is to implement the `DbToHdfs` gRPC call (you can find t ...@@ -71,23 +75,23 @@ In this part, your task is to implement the `DbToHdfs` gRPC call (you can find t
**DbToHdfs:** To be more specific, you need to: **DbToHdfs:** To be more specific, you need to:
1. Connect to the SQL server, with the database name as `CS544` and the password as `abc`. There are two tables in databse: `loans` ,and `loan_types`. The former records all information related to loans, while the latter maps the numbers in the loan_type column of the loans table to their corresponding loan types. There should be **447367** rows in table `loans`. It's like: 1. Connect to the SQL server, with the database name as `CS544` and the password as `abc`. There are two tables in databse: `loans` ,and `loan_types`. The former records all information related to loans, while the latter maps the numbers in the loan_type column of the loans table to their corresponding loan types. There should be **447367** rows in table `loans`. It's like:
```mysql ```mysql
mysql> show tables; mysql> show tables;
+-----------------+ +-----------------+
| Tables_in_CS544 | | Tables_in_CS544 |
+-----------------+ +-----------------+
| loan_types | | loan_types |
| loans | | loans |
+-----------------+ +-----------------+
mysql> select count(*) from new_table; mysql> select count(*) from loans;
+----------+ +----------+
| count(*) | | count(*) |
+----------+ +----------+
| 426716 | | 447367 |
+----------+ +----------+
``` ```
2. What are the actual types for those loans? 2. What are the actual types for those loans?
Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`. Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
3. Filter all rows where `loan_amount` is **greater than 30,000** and **less than 800,000**. After filtering, this table should have only **426716** rows. 3. Filter all rows where `loan_amount` is **greater than 30,000** and **less than 800,000**. After filtering, this table should have only **426716** rows.
4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **2x** replication and a **1-MB** block size, using PyArrow (https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html). 4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **2x** replication and a **1-MB** block size, using PyArrow (https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html).
...@@ -124,6 +128,8 @@ The documents [here](https://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/h ...@@ -124,6 +128,8 @@ The documents [here](https://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/h
Use a `GETFILEBLOCKLOCATIONS` operation to find the block locations. Use a `GETFILEBLOCKLOCATIONS` operation to find the block locations.
**Hint:** You have to set appropriate environment variable `CLASSPATH` to access HDFS correctly. See example [here](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/blob/main/lec/18-hdfs/notebook.Dockerfile?ref_type=heads).
## Part 3: `CalcAvgLoan` gRPC Call ## Part 3: `CalcAvgLoan` gRPC Call
In this part, your task is to implement the `CalcAvgLoan` gRPC call (you can find the interface definition in the proto file). In this part, your task is to implement the `CalcAvgLoan` gRPC call (you can find the interface definition in the proto file).
......
p4/dataflow.png

290 KiB