Skip to content
Snippets Groups Projects
Commit 64ff3002 authored by TYLER CARAZA-HARTER's avatar TYLER CARAZA-HARTER
Browse files

examples

parent 7fa2265d
No related branches found
No related tags found
No related merge requests found
......@@ -4,7 +4,7 @@
## Overview
In this project, you will depoly a data system consisting of an SQL server, an HDFS cluster, a gRPC server, and a client. You will need to read and filter data from a SQL database and persist it to HDFS. Then, you will partition the data to enable efficient request processing. Additionally, you will need to write a fault-tolerant application that works even when an HDFS DataNode fails (we will test this scenario).
In this project, you will depoly a data system consisting of an SQL server, an HDFS cluster, a gRPC server, and a client. You will need to read and filter data from a SQL database and persist it to HDFS. Additionally, you will write a fault-tolerant application that works even when an HDFS DataNode fails (we will test this scenario).
Learning objectives:
* communicate with the SQL Server using SQL queries
......@@ -31,7 +31,6 @@ This project will use `docker exec -it` to run the client on the gRPC server's c
#Inside the server container
python3 client.py DbToHdfs
python3 client.py BlockLocations -f <file_path>
python3 client.py PartitionByCounty
python3 client.py CalcAvgLoan -c <county_code>
```
......@@ -43,8 +42,7 @@ Take a look at the provided Docker compose file. There are several services, inc
You are required to write a server.py and Dockerfile.server based on the provided proto file (you may not modify it).
The gRPC interfaces have already been defined (see `lender.proto` for details). There are no constraints on the return values of `DbToHdfs` and `PartitionByCounty`, so you may return what you think is helpful.
The gRPC interfaces have already been defined (see `lender.proto` for details). There are no constraints on the return values of `DbToHdfs`, so you may return what you think is helpful.
### Docker image naming
You need to build the Docker image following this naming convention.
......@@ -134,7 +132,7 @@ In this part, your task is to implement the `CalcAvgLoan` gRPC call (you can fin
The call should read hdma-wi-2021.parquet, filtering to rows with the specified county code. One way to do this would be to pass a `("column", "=", ????)` tuple inside a `filters` list upon read: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
The call should return the average loan amount from the filtered table.
The call should return the average loan amount from the filtered table as an integer (rounding down if necessary).
As an optimization, your code should also write the filtered data to a file named `partitions/<county_code>.parquet`. If there are later calls for the same county_code, your code should use the smaller, county-specific Parquet file (instead of filtering the big Parquet file with all loan applications). The county-specific Parquet file should have 1x replication. When `CalcAvgLoan` returns the average, it should also use the "source" field to indicate whether the data came from big Parquet file (`source="create"` because a new county-specific file had to be created) or a county-specific file was previously created (`source="reuse"`).
......@@ -150,7 +148,7 @@ After a `DbToHdfs` call and a few `CalcAvgLoan` calls, your HDFS directory struc
```
├── hdma-wi-2021.parquet
├── partitioned/
├── partitions/
│ ├── 55001.parquet
│ ├── 55003.parquet
│ └── ...
......@@ -165,7 +163,7 @@ In this part, your task is to make `CalcAvgLoan` tolerant to a single DataNode f
Recall that `CalcAvgLoan` sometimes uses small, county-specific Parquet files that have 1x replication, and sometimes it uses the big Parquet file (hdma-wi-2021.parquet) of all loan applications that uses 2x replication. Your fault tolerance strategy should be as follows:
1. hdma-wi-2021.parquet: if you created this with 2x replication earlier, you don't need to do anything else here, because HDFS can automatically handle a single DataNode failure for you
2. partitioned/<COUNTY_CODE>.parquet: this data only has 1x replication, so HDFS might lose it when the DataNode fails. That's fine, because all the rows are still in the big Parquet file. You should write code to detect this scenario and recreate the lost/corrupted county-specific file by reading the big file again with the county filter. If you try to read an HDFS file with missing data using PyArrow, the client will retry for a while (perhaps 30 seconds or so), then raise an OSError exception, which you should catch and handle
2. partitions/<COUNTY_CODE>.parquet: this data only has 1x replication, so HDFS might lose it when the DataNode fails. That's fine, because all the rows are still in the big Parquet file. You should write code to detect this scenario and recreate the lost/corrupted county-specific file by reading the big file again with the county filter. If you try to read an HDFS file with missing data using PyArrow, the client will retry for a while (perhaps 30 seconds or so), then raise an OSError exception, which you should catch and handle
CalcAvgLoan should now use the "source" field in the return value to indicate how the average was computed: "create" (from the big file, because a county-specific file didn't already exist), "recreate" (from the big file, because a county-specific file was corrupted/lost), or "reuse" (there was a valid county-specific file that was used).
......@@ -187,6 +185,14 @@ docker build . -f Dockerfile.server -t p4-server
docker compose up -d
```
Then run the client like this:
```
docker exec -it p4-server-1 python3 /client.py DbToHdfs
docker exec -it p4-server-1 python3 /client.py BlockLocations -f /hdma-wi-2021.parquet
docker exec -it p4-server-1 python3 /client.py CalcAvgLoan -c 55001
```
Note that we will copy in the the provided files (docker-compose.yml, client.py, lender.proto, hdma-wi-2021.sql.gz, etc.), overwriting anything you might have changed. Please do NOT push hdma-wi-2021.sql.gz to your repo because it is large, and we want to keep the repos small.
Please make sure you have `client.py` copied into the p4-server image. We will run client.py in the p4-server-1 container to test your code.
......
......@@ -9,7 +9,7 @@ import pandas as pd
parser = argparse.ArgumentParser(description="argument parser for p4 clinet")
parser.add_argument("mode", help="which action to take", choices=["SqlQuery", "DbToHdfs","PartitionByCounty","BlockLocations","CalcAvgLoan"])
parser.add_argument("mode", help="which action to take", choices=["DbToHdfs","BlockLocations","CalcAvgLoan"])
parser.add_argument("-c", "--code", type=int, default=0, help="county code to query average loan amount in CalcAvgLoan mode")
parser.add_argument("-f", "--file", type=str, default="", help="file path for BlockLocation")
......@@ -20,9 +20,6 @@ stub = lender_pb2_grpc.LenderStub(channel)
if args.mode == "DbToHdfs":
resp = stub.DbToHdfs(lender_pb2.Empty())
print(resp.status)
elif args.mode == "PartitionByCounty":
resp = stub.PartitionByCounty(lender_pb2.Empty())
print(resp.status)
elif args.mode == "CalcAvgLoan":
resp = stub.CalcAvgLoan(lender_pb2.CalcAvgLoanReq(county_code=args.code))
if resp.error:
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment