@@ -82,7 +82,7 @@ docker system prune -a --volumes -f #Equivalent to the combination of all cmd ab
```
-->
## Part 1: Load data from SQL Database to HDFS
## Part 1: `DbToHdfs` gRPC Call
In this part, your task is to implement the `DbToHdfs` gRPC call (you can find the interface definition in the proto file).
...
...
@@ -106,8 +106,7 @@ In this part, your task is to implement the `DbToHdfs` gRPC call (you can find t
2. What are the actual types for those loans?
Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
3. Filter all rows where `loan_amount` is **greater than 30,000** and **less than 800,000**. After filtering, this table should have only **426716** rows.
4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **3x** replication, using PyArrow.
4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **3x** replication and a **1-MB** block size, using PyArrow (https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html).
To check whether the upload was correct, you can use `docker exec -it` to enter the gRPC server's container and use HDFS command `hdfs dfs -du -h <path>`to see the file size. The expected result is:
```
...
...
@@ -122,45 +121,27 @@ To check whether the upload was correct, you can use `docker exec -it` to enter
**Hint 4:** For PyArrow to be able to connect to HDFS, you'll need to configure some env variables carefully. Look at how Dockerfile.namenode does this for the start CMD, and do the same in your own Dockerfile for your server.
## Part 2: BlockLocations
## Part 2: `BlockLocations` gRPC Call
In this part, your task is to implement the `BlockLocations` gRPC call (you can find the interface definition in the proto file).
**BlockLocations:** To be more specific, for a given file path, you need to return a python `directory`(i.e.`map` in proto), recording how many blocks there are in each data node(key is the **datanode location** and value is **number** of blocks on that node).
**BlockLocations:** To be more specific, for a given file path, you need to return a Python dictionary (that is, a`map` in proto), recording how many blocks are stored by each DataNode(key is the **DataNode location** and value is **number** of blocks on that node).
The return value of gRPC call `BlockLocations` should be something like:
Note: datanode loacation is the randomly generated container ID for
the container running the DataNode, so yours will be different, and the distribution of blocks across different nodes may also vary.
For example, running `docker exec -it p4-server-1 python3 /client.py BlockLocations -f /hdma-wi-2021.parquet` should show something like this:
The documents [here](https://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/hadoop-hdfs/WebHDFS.html) describe how we can interact with HDFS via web requests. Many [examples](https://requests.readthedocs.io/en/latest/user/quickstart/) show these web requests being made with the curl command, but you'll adapt those examples to use requests.get. By default, WebHDFS runs on port 9870. So use port 9870 instead of 9000 to access HDFS for this part.
**Hint1:** Note that if `r` is a response object, then `r.content` will contain some bytes, which you could convert to a dictionary; alternatively,`r.json()` does this for you.
Note: DataNode location is the randomly generated container ID for the
container running the DataNode, so yours will be different, and the
distribution of blocks across different nodes may also vary.
**Hint2:** You can use operation `GETFILESTATUS` to get file `lenth` and `blocksize`, and use the `open`operation with appropriate offset to retrieve the node where each block is located.
The documents [here](https://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/hadoop-hdfs/WebHDFS.html) describe how we can interact with HDFS via web requests. Many [examples](https://requests.readthedocs.io/en/latest/user/quickstart/) show these web requests being made with the curl command, but you'll adapt those examples to use requests.get. By default, WebHDFS runs on port 9870. So use port 9870 instead of 9000 to access HDFS for this part.
Use a `GETFILEBLOCKLOCATIONS` operation to find the block locations.
## Part 3: Calculate the average loan for each county
## Part 3: `PartitionByCounty` and `CalcAvgLoan` gRPC Calls
In this part, your task is to implement the `PartitionByCounty` and `CalcAvgLoan` gRPC calls (you can find the interface definition in the proto file).
...
...
@@ -185,22 +166,21 @@ The root directory on HDFS should now look like this:
19.3 M 19.3 M hdfs://boss:9000/partitioned
```
## Part 4: Fault Tolerance
## Part 4: Fault Tolerant
In this part, your task is to modify the `CalcAvgLoan` gRPC calls you implemented in Part3.
In this part, your task is to modify the `CalcAvgLoan` gRPC calls you implemented in Part 3.
Imagine a scenario where one (or even two) of the three DataNodes fail and go offline. In this case, can the logic implemented in Part 3 still function correctly? If we still want our gRPC server to correctly respond to requests for calculating the average loan, what should we do?
Keep in mind that `/hdma-wi-2021.parquet` has a replication factor of 3, while `/partitioned` only has a replication factor of 1.
**CalcAvgLoan:** To be more specific, modify this gRPC call to support the server in correctly handling responses even when one (or even two) DataNode is offline. You may try reading the partitioned parquet file first. If unsuccessful, then go back to the large `hdma-wi-2021.parquet` file and complete the computation. What's more, you have to return with the field `source` filled:
**CalcAvgLoan:** To be more specific, modify this gRPC call to support the server in correctly handling responses even when one (or even two) DataNode is offline. You may try reading the partitioned parquet file first. If unsuccessful, then go back to the large `hdma-wi-2021.parquet` file and complete the computation. What's more, you have to return with the field `source` filled:
1. "partitioned": calculation performed on parquet partitioned before
2. "unpartitioned": parquet partitioned before is lost, and calculation performed on the initial unpartitioned table
To simulate a data node failure, you should use `docker kill` to terminate a node and then wait until you confirm that the number of `live DataNodes` has decreased using the `hdfs dfsadmin -fs <hdfs_path> -report` command.
## Submission
Read the directions [here](../projects.md) about how to create the
...
...
@@ -222,4 +202,5 @@ We will copy in the all the dockerfiles except "Dockerfile.server", "docker-comp
Please make sure you have `client.py` copied into p4-server-1. We will run client.py in p4-server-1 to test your code.