part 4

7fa2265d · TYLER CARAZA-HARTER · 58a86dfd · 7fa2265d
Commit 7fa2265d authored 1 week ago by TYLER CARAZA-HARTER
--- a/p4/README.md
+++ b/p4/README.md
@@ -158,18 +158,18 @@ After a `DbToHdfs` call and a few `CalcAvgLoan` calls, your HDFS directory struc

 ## Part 4: Fault Tolerance

-In this part, your task is to modify the `CalcAvgLoan` gRPC calls you implemented in Part 3.
+A "fault" is something that goes wrong, like a hard disk failing or an entire DataNode crashing.  Fault tolerant code continues functioning for some kinds of faults.

-Imagine a scenario where one (or even two) of the three DataNodes fail and go offline. In this case, can the logic implemented in Part 3 still function correctly? If we still want our gRPC server to correctly respond to requests for calculating the average loan, what should we do?
+In this part, your task is to make `CalcAvgLoan` tolerant to a single DataNode failure (we will kill one during testing!).

-Keep in mind that `/hdma-wi-2021.parquet` has a replication factor of 3, while `/partitioned` only has a replication factor of 1.
+Recall that `CalcAvgLoan` sometimes uses small, county-specific Parquet files that have 1x replication, and sometimes it uses the big Parquet file (hdma-wi-2021.parquet) of all loan applications that uses 2x replication.  Your fault tolerance strategy should be as follows:

-**CalcAvgLoan:** To be more specific, modify this gRPC call to support the server in correctly handling responses even when one (or even two) DataNode is offline. You may try reading the partitioned parquet file first. If unsuccessful, then go back to the large `hdma-wi-2021.parquet` file and complete the computation. What's more, you have to return with the field `source` filled:
+1. hdma-wi-2021.parquet: if you created this with 2x replication earlier, you don't need to do anything else here, because HDFS can automatically handle a single DataNode failure for you
+2. partitioned/<COUNTY_CODE>.parquet: this data only has 1x replication, so HDFS might lose it when the DataNode fails.  That's fine, because all the rows are still in the big Parquet file.  You should write code to detect this scenario and recreate the lost/corrupted county-specific file by reading the big file again with the county filter.  If you try to read an HDFS file with missing data using PyArrow, the client will retry for a while (perhaps 30 seconds or so), then raise an OSError exception, which you should catch and handle

-1. "partitioned": calculation performed on parquet partitioned before
-2. "unpartitioned": parquet partitioned before is lost, and calculation performed on the initial unpartitioned table
+CalcAvgLoan should now use the "source" field in the return value to indicate how the average was computed: "create" (from the big file, because a county-specific file didn't already exist), "recreate" (from the big file, because a county-specific file was corrupted/lost), or "reuse" (there was a valid county-specific file that was used).

-To simulate a data node failure, you should use `docker kill` to terminate a node and then wait until you confirm that the number of `live DataNodes` has decreased using the `hdfs dfsadmin -fs <hdfs_path> -report` command. 
+**Hint:** to manually test DataNode failure, you should use `docker kill` to terminate a node and then wait until you confirm that the number of `live DataNodes` has decreased using the `hdfs dfsadmin -fs <hdfs_path> -report` command. 

 ## Submission

@@ -187,9 +187,9 @@ docker build . -f Dockerfile.server -t p4-server
 docker compose up -d
 ```

-We will copy in the all the dockerfiles except "Dockerfile.server", "docker-compose.yml", "client.py", "lender.proto", and "hdma-wi-2021.sql.gz", overwriting anything you might have changed.
+Note that we will copy in the the provided files (docker-compose.yml, client.py, lender.proto, hdma-wi-2021.sql.gz, etc.), overwriting anything you might have changed.  Please do NOT push hdma-wi-2021.sql.gz to your repo because it is large, and we want to keep the repos small.

-Please make sure you have `client.py` copied into p4-server-1. We will run client.py in p4-server-1 to test your code. 
+Please make sure you have `client.py` copied into the p4-server image. We will run client.py in the p4-server-1 container to test your code. 

 ## Tester