Update P3 README

e7740eff · Jing Lan · 00099229 · e7740eff
Commit e7740eff authored 2 weeks ago by Jing Lan
--- a/p3/README.md
+++ b/p3/README.md
@@ -20,7 +20,9 @@ Before starting, please review the [general project directions](../projects.md).
 ## Clarifications/Corrections
-* none yet
+* Feb 24: feel free to use different tools to implement Part 2.
+* Feb 24: clarify that `bigdata.py` will be used in tests.
+* Feb 24: add link to lecture notes on parquet file operations.
 ## Part 1: Communication (gRPC)
@@ -126,7 +128,9 @@ file (for example, you could add the path to some data structure, like a
 list or dictionary).
 Your server should similarly write the same data to a parquet file
-somewhere, using pyarrow.
+somewhere, using `pyarrow`, `pandas`, or any available tools. Refer to
+the [lecture notes](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/14-file-formats?ref_type=heads)
+for a few examples of reading/writing parquet files.
 ## Part 3: Column Sum
@@ -174,22 +178,24 @@ be a performance depending on which format is used.
 Parquet is a column-oriented format, so all the data in a single file
 should be adjacent on disk.  This means it should be possible to read
-a column of data without reading the whole file.  See the `columns`
+a column of data without reading the whole file. Check out the `columns`
-parameter here:
+parameter of [`pyarrow.parquet.read_table`](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html).
-https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
+You can also find an example from the [lecture notes](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/14-file-formats?ref_type=heads).
 **Requirement:** when the server is asked to sum over the column of a
 Parquet file, it should only read the data from that column, not other
 columns.
-**Note:** we will run your server with a 512-MB limit on RAM.  Any
+**Note 1:** we will run your server with a 512-MB limit on RAM.  Any
 individual files we upload will fit within that limit, but the total
 size of the files uploaded will exceed that limit.  That's why your
 server will have to do sums by reading the files (instead of just
-keeping all table data in memory).  If you want manually test your
+keeping all table data in memory).
-code with some bigger uploads, use the `bigdata.py` client.  Instead
-of uploading files, it randomly generateds lots of CSV-formatted data
+**Note 2:** the `bigdata.py` randomly generates a large volumne of
-and directly uploads it via gRPC.
+CSV-formatted data and uploads it vis gRPC. You are *required* to
+test your upload implementation with this script and it will be used
+as part of our tests.
 ## Part 4: Locking
@@ -249,6 +255,7 @@ docker run --name=yournetid -d -m 512m -p 127.0.0.1:5440:5440 -v ./inputs:/input
 docker exec yournetid python3 upload.py /inputs/simple.csv
 docker exec yournetid python3 csvsum.py x
 docker exec yournetid python3 parquetsum.py x
+docker exec yournetid python3 bigdata.py
 ```
 Please do include the files built from the .proto (your Dockerfile