From e7740eff698c25ee11eaf0e2739c884b17b2cb14 Mon Sep 17 00:00:00 2001 From: Jing Lan <jlan25@cs544-jlan25.cs.wisc.edu> Date: Mon, 24 Feb 2025 14:23:49 -0600 Subject: [PATCH] Update P3 README --- p3/README.md | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/p3/README.md b/p3/README.md index 3ec78b2..471549b 100644 --- a/p3/README.md +++ b/p3/README.md @@ -20,7 +20,9 @@ Before starting, please review the [general project directions](../projects.md). ## Clarifications/Corrections -* none yet +* Feb 24: feel free to use different tools to implement Part 2. +* Feb 24: clarify that `bigdata.py` will be used in tests. +* Feb 24: add link to lecture notes on parquet file operations. ## Part 1: Communication (gRPC) @@ -126,7 +128,9 @@ file (for example, you could add the path to some data structure, like a list or dictionary). Your server should similarly write the same data to a parquet file -somewhere, using pyarrow. +somewhere, using `pyarrow`, `pandas`, or any available tools. Refer to +the [lecture notes](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/14-file-formats?ref_type=heads) +for a few examples of reading/writing parquet files. ## Part 3: Column Sum @@ -174,22 +178,24 @@ be a performance depending on which format is used. Parquet is a column-oriented format, so all the data in a single file should be adjacent on disk. This means it should be possible to read -a column of data without reading the whole file. See the `columns` -parameter here: -https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html +a column of data without reading the whole file. Check out the `columns` +parameter of [`pyarrow.parquet.read_table`](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html). +You can also find an example from the [lecture notes](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/14-file-formats?ref_type=heads). **Requirement:** when the server is asked to sum over the column of a Parquet file, it should only read the data from that column, not other columns. -**Note:** we will run your server with a 512-MB limit on RAM. Any +**Note 1:** we will run your server with a 512-MB limit on RAM. Any individual files we upload will fit within that limit, but the total size of the files uploaded will exceed that limit. That's why your server will have to do sums by reading the files (instead of just -keeping all table data in memory). If you want manually test your -code with some bigger uploads, use the `bigdata.py` client. Instead -of uploading files, it randomly generateds lots of CSV-formatted data -and directly uploads it via gRPC. +keeping all table data in memory). + +**Note 2:** the `bigdata.py` randomly generates a large volumne of +CSV-formatted data and uploads it vis gRPC. You are *required* to +test your upload implementation with this script and it will be used +as part of our tests. ## Part 4: Locking @@ -249,6 +255,7 @@ docker run --name=yournetid -d -m 512m -p 127.0.0.1:5440:5440 -v ./inputs:/input docker exec yournetid python3 upload.py /inputs/simple.csv docker exec yournetid python3 csvsum.py x docker exec yournetid python3 parquetsum.py x +docker exec yournetid python3 bigdata.py ``` Please do include the files built from the .proto (your Dockerfile -- GitLab