From 6ba89a5d84ef09ccae3c26ae06a6f735e4dd2a31 Mon Sep 17 00:00:00 2001 From: wyang338 <weichuyang777@gmail.com> Date: Sun, 2 Mar 2025 21:28:51 -0600 Subject: [PATCH] Remove the format restriction on partitioned Parquet. --- p4/README.md | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/p4/README.md b/p4/README.md index d8eedc2..b61ed0b 100644 --- a/p4/README.md +++ b/p4/README.md @@ -157,19 +157,10 @@ In this part, your task is to implement the `PartitionByCounty` and `CalcAvgLoan Imagine a scenario where there could be many queries differentiated by `county`, and one of them is to get the average loan amount for a county. In this case, it might be much more efficient to generate a set of 1x Parquet files filtered by county, and then read data from these partitioned, relatively much smaller tables for computation. -**PartitionByCounty:** To be more specific, you need to categorize the contents of that parquet file just stored in HDFS using `county_id` as the key. For each `county_id`, create a new parquet file that records all entries under that county, and then save them with a **1x replication**. Files should be written into folder `/partitioned/` and name for each should be their `county_id`. +**PartitionByCounty:** To be more specific, you need to categorize the contents of that parquet file just stored in HDFS using `county_id` as the key. For each `county_id`, create a new parquet file that records all entries under that county, and then save them with a **1x replication**. Files should be written into folder `/partitioned/`. **CalcAvgLoan:** To be more specific, for a given `county_id` , you need to return a int value, indicating the average `loan_amount` of that county. **Note:** You are required to perform this calculation based on the partitioned parquet files generated by `FilterByCounty`. `source` field in proto file can ignored in this part. -The inside of the partitioned directory should look like this: - - ``` - ├── partitioned/ - │ ├── 55001.parquet - │ ├── 55003.parquet - │ └── ... - ``` - The root directory on HDFS should now look like this: ``` 14.4 M 43.2 M hdfs://boss:9000/hdma-wi-2021.parquet -- GitLab