In this project, you will depoly a data system consisting of an SQL server, an HDFS cluster, a gRPC server, and a client. You will need to read and filter data from an SQL database and persist it to HDFS. Then, you will partition the data to enable efficient request processing. Additionally, you will need to utilize the fault-tolerant mechanism of HDFS to handle requests even in scenarios where HDFS data nodes fail, which you will simulate as part of this project.
In this project, you will depoly a data system consisting of an SQL server, an HDFS cluster, a gRPC server, and a client. You will need to read and filter data from a SQL database and persist it to HDFS. Then, you will partition the data to enable efficient request processing. Additionally, you will need to write a fault-tolerant application that works even when an HDFS DataNode fails (we will test this scenario).
Learning objectives:
*to communicate with the SQL Server using SQL queries.
*to use the WebHDFS API.
*to utilize PyArrow to interact with HDFS and process Parquet files.
*to handle data loss scenarios
* communicate with the SQL Server using SQL queries
* use the WebHDFS API
* utilize PyArrow to interact with HDFS and process Parquet files
* handle data loss scenario
Before starting, please review the [general project directions](../projects.md).
## Corrections/Clarifications
None yet.
* none yet
## Introduction
...
...
@@ -24,7 +24,6 @@ You'll need to deploy a system including 5 docker containers like this:
<imgsrc="arch.png"width=600>
We have provided the other components; what you only need is to complete the work within the gRPC server and its Dockerfile.
### Client
This project will use `docker exec -it` to run the client on the gRPC server's container. Usage of `client.py` is as follows:
Take a look at the provided Docker compose file. There are several services, including `datanodes` with 3 replicas, a `namenode`, a `SQL server`, a `gRPC Server`. The namenode service will serve at the host of `boss` within the docker compose network.
Take a look at the provided Docker compose file. There are several services, including `datanodes` with 3 replicas, a `namenode`, a `SQL server`, a `gRPC Server`. The NameNode service will serve at the host of `boss` within the docker compose network.
### gRPC
You are required to write a srever.py and Dockerfile.server based on the provided proto file (you may not modify it).
The gRPC interfaces have already been defined, see `lender.proto` for details. There are no constraints on the return values of `DbToHdfs` and `FilterByCounty`, so you may define them freely as needed.
You are required to write a server.py and Dockerfile.server based on the provided proto file (you may not modify it).
The gRPC interfaces have already been defined (see `lender.proto` for details). There are no constraints on the return values of `DbToHdfs` and `PartitionByCounty`, so you may return what you think is helpful.
### Docker image naming
...
...
@@ -66,12 +64,15 @@ export PROJECT=p4
```
**Hint 1:** Command `docker logs <container-name> -f` might be very useful when issue locating and bug fixing. It allows you to view real-time output from a specific container. If you don’t want the output to occupy your terminal, you can remove `-f` to view the logs up to the current point.
**Hint 1:** The command `docker logs <container-name> -f` might be very useful for troubleshooting. It allows you to view real-time output from a specific container.
**Hint 2:** Think about whether there is any .sh script that will help you quickly test code changes. For example, you may want it to rebuild your Dockerfiles, cleanup an old Compose cluster, and deploy a new cluster.
**Hint 3:** If you're low on disk space, consider running `docker system prune -a --volumes -f`
<!--
**Hint 3:** You might find it really helpful to use these command below to clean up the disk space occupied by Docker iamges/containers/networks/volumes artifacts. during the development of this project.
```bash
docker image prune -a-f
docker container prune -f
...
...
@@ -79,7 +80,7 @@ docker network prune -f
docker volume prune -f
docker system prune -a--volumes-f#Equivalent to the combination of all cmd above
```
-->
## Part 1: Load data from SQL Database to HDFS
...
...
@@ -105,16 +106,22 @@ In this part, your task is to implement the `DbToHdfs` gRPC call (you can find t
2. What are the actual types for those loans?
Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
3. Filter all rows where `loan_amount` is **greater than 30,000** and **less than 800,000**. After filtering, this table should have only **426716** rows.
4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **3x** replication.
4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **3x** replication, using PyArrow.
To check whether the upload was correct, you can use `docker exec -it` to enter the gRPC server's container and use HDFS command `hdfs dfs -du -h <path>`to see the file size. The expected result is:
```
14.4 M 43.2 M hdfs://boss:9000/hdma-wi-2021.parquet
```
**Hint1:** If you are not familiar with those tables, you can use SQL queries to print the table schema or retrieve sample data. A convenient way to do this is to use `docker exec -it` to enter the SQL Server, then run mysql client `mysql -p CS544` to access the SQL Server and then perform queries.
**Hint2:** After `docker compose up`, the SQL Server needs some time to load the data before it is ready. Therefore, you need to Wait for a while, or preferably, add a retry mechanism for the SQL connection.
**Hint 1:** We used similar tables in lecture: https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/15-sql
**Hint 2:** To get more familiar with these tables, you can use SQL queries to print the table schema or retrieve sample data. A convenient way to do this is to use `docker exec -it` to enter the SQL Server, then run mysql client `mysql -p CS544` to access the SQL Server and then perform queries.
**Hint 3:** After `docker compose up`, the SQL Server needs some time to load the data before it is ready. Therefore, you need to wait for a while, or preferably, add a retry mechanism for the SQL connection.
**Hint 4:** For PyArrow to be able to connect to HDFS, you'll need to configure some env variables carefully. Look at how Dockerfile.namenode does this for the start CMD, and do the same in your own Dockerfile for your server.
## Part 2: BlockLocations
In this part, your task is to implement the `BlockLocations` gRPC call (you can find the interface definition in the proto file).