Take a look at the provided Docker compose file. There are several services, including `datanodes` with 3 replicas, a `namenode`, a `SQL server`, a `gRPC Server`. The NameNode service will serve at the host of `boss` within the docker compose network.
Take a look at the provided Docker compose file. There are several services, including 3 `datanodes`, a `namenode`, a `SQL server`, a `gRPC Server`. The NameNode service will serve at the host of `boss` within the docker compose network.
### gRPC
...
...
@@ -63,25 +63,12 @@ variable. You can set it to p4 in your environment:
export PROJECT=p4
```
**Hint 1:** The command `docker logs <container-name> -f` might be very useful for troubleshooting. It allows you to view real-time output from a specific container.
**Hint 2:** Think about whether there is any .sh script that will help you quickly test code changes. For example, you may want it to rebuild your Dockerfiles, cleanup an old Compose cluster, and deploy a new cluster.
**Hint 3:** If you're low on disk space, consider running `docker system prune -a --volumes -f`
<!--
**Hint 3:** You might find it really helpful to use these command below to clean up the disk space occupied by Docker iamges/containers/networks/volumes artifacts. during the development of this project.
```bash
docker image prune -a-f
docker container prune -f
docker network prune -f
docker volume prune -f
docker system prune -a--volumes-f#Equivalent to the combination of all cmd above
```
-->
## Part 1: `DbToHdfs` gRPC Call
In this part, your task is to implement the `DbToHdfs` gRPC call (you can find the interface definition in the proto file).
...
...
@@ -106,11 +93,11 @@ In this part, your task is to implement the `DbToHdfs` gRPC call (you can find t
2. What are the actual types for those loans?
Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
3. Filter all rows where `loan_amount` is **greater than 30,000** and **less than 800,000**. After filtering, this table should have only **426716** rows.
4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **3x** replication and a **1-MB** block size, using PyArrow (https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html).
4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **2x** replication and a **1-MB** block size, using PyArrow (https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html).
To check whether the upload was correct, you can use `docker exec -it` to enter the gRPC server's container and use HDFS command `hdfs dfs -du -h <path>`to see the file size. The expected result is:
```
14.4 M 43.2 M hdfs://boss:9000/hdma-wi-2021.parquet
15.3 M 30.5 M hdfs://nn:9000/hdma-wi-2021.parquet
```
**Hint 1:** We used similar tables in lecture: https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/15-sql
...
...
@@ -130,42 +117,45 @@ In this part, your task is to implement the `BlockLocations` gRPC call (you can
For example, running `docker exec -it p4-server-1 python3 /client.py BlockLocations -f /hdma-wi-2021.parquet` should show something like this:
Note: DataNode location is the randomly generated container ID for the
container running the DataNode, so yours will be different, and the
distribution of blocks across different nodes may also vary.
distribution of blocks across different nodes will also likely vary.
The documents [here](https://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/hadoop-hdfs/WebHDFS.html) describe how we can interact with HDFS via web requests. Many [examples](https://requests.readthedocs.io/en/latest/user/quickstart/) show these web requests being made with the curl command, but you'll adapt those examples to use requests.get. By default, WebHDFS runs on port 9870. So use port 9870 instead of 9000 to access HDFS for this part.
Use a `GETFILEBLOCKLOCATIONS` operation to find the block locations.
## Part 3: `PartitionByCounty` and `CalcAvgLoan` gRPC Calls
## Part 3: `CalcAvgLoan` gRPC Call
In this part, your task is to implement the `PartitionByCounty` and `CalcAvgLoan` gRPC calls (you can find the interface definition in the proto file).
In this part, your task is to implement the `CalcAvgLoan` gRPC call (you can find the interface definition in the proto file).
Imagine a scenario where there could be many queries differentiated by `county`, and one of them is to get the average loan amount for a county. In this case, it might be much more efficient to generate a set of 1x Parquet files filtered by county, and then read data from these partitioned, relatively much smaller tables for computation.
The call should read hdma-wi-2021.parquet, filtering to rows with the specified county code. One way to do this would be to pass a `("column", "=", ????)` tuple inside a `filters` list upon read: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
The call should return the average loan amount from the filtered table.
As an optimization, your code should also write the filtered data to a file named `partitions/<county_code>.parquet`. If there are later calls for the same county_code, your code should use the smaller, county-specific Parquet file (instead of filtering the big Parquet file with all loan applications). The county-specific Parquet file should have 1x replication. When `CalcAvgLoan` returns the average, it should also use the "source" field to indicate whether the data came from big Parquet file (`source="create"` because a new county-specific file had to be created) or a county-specific file was previously created (`source="reuse"`).
**PartitionByCounty:** To be more specific, you need to categorize the contents of that parquet file just stored in HDFS using `county_id` as the key. For each `county_id`, create a new parquet file that records all entries under that county, and then save them with a **1x replication**. Files should be written into folder `/partitioned/` and name for each should be their `county_id`.
One easy way to check if the county-specific file already exists is to just try reading it with PyArrow. You should get an `FileNotFoundError` exception if it doesn't exist.
<!--
Imagine a scenario where there could be many queries differentiated by `county`, and one of them is to get the average loan amount for a county. In this case, it might be much more efficient to generate a set of 1x Parquet files filtered by county, and then read data from these partitioned, relatively much smaller tables for computation.
**CalcAvgLoan:** To be more specific, for a given `county_id` , you need to return a int value, indicating the average `loan_amount` of that county. **Note:** You are required to perform this calculation based on the partitioned parquet files generated by `FilterByCounty`. `source` field in proto file can ignored in this part.
-->
The inside of the partitioned directory should look like this:
After a `DbToHdfs` call and a few `CalcAvgLoan` calls, your HDFS directory structure will look something like this:
```
├── hdma-wi-2021.parquet
├── partitioned/
│ ├── 55001.parquet
│ ├── 55003.parquet
│ └── ...
```
The root directory on HDFS should now look like this:
```
14.4 M 43.2 M hdfs://boss:9000/hdma-wi-2021.parquet
19.3 M 19.3 M hdfs://boss:9000/partitioned
```
## Part 4: Fault Tolerance
In this part, your task is to modify the `CalcAvgLoan` gRPC calls you implemented in Part 3.