Skip to content
Snippets Groups Projects
Commit e7740eff authored by Jing Lan's avatar Jing Lan
Browse files

Update P3 README

parent 00099229
No related branches found
No related tags found
No related merge requests found
......@@ -20,7 +20,9 @@ Before starting, please review the [general project directions](../projects.md).
## Clarifications/Corrections
* none yet
* Feb 24: feel free to use different tools to implement Part 2.
* Feb 24: clarify that `bigdata.py` will be used in tests.
* Feb 24: add link to lecture notes on parquet file operations.
## Part 1: Communication (gRPC)
......@@ -126,7 +128,9 @@ file (for example, you could add the path to some data structure, like a
list or dictionary).
Your server should similarly write the same data to a parquet file
somewhere, using pyarrow.
somewhere, using `pyarrow`, `pandas`, or any available tools. Refer to
the [lecture notes](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/14-file-formats?ref_type=heads)
for a few examples of reading/writing parquet files.
## Part 3: Column Sum
......@@ -174,22 +178,24 @@ be a performance depending on which format is used.
Parquet is a column-oriented format, so all the data in a single file
should be adjacent on disk. This means it should be possible to read
a column of data without reading the whole file. See the `columns`
parameter here:
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
a column of data without reading the whole file. Check out the `columns`
parameter of [`pyarrow.parquet.read_table`](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html).
You can also find an example from the [lecture notes](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/14-file-formats?ref_type=heads).
**Requirement:** when the server is asked to sum over the column of a
Parquet file, it should only read the data from that column, not other
columns.
**Note:** we will run your server with a 512-MB limit on RAM. Any
**Note 1:** we will run your server with a 512-MB limit on RAM. Any
individual files we upload will fit within that limit, but the total
size of the files uploaded will exceed that limit. That's why your
server will have to do sums by reading the files (instead of just
keeping all table data in memory). If you want manually test your
code with some bigger uploads, use the `bigdata.py` client. Instead
of uploading files, it randomly generateds lots of CSV-formatted data
and directly uploads it via gRPC.
keeping all table data in memory).
**Note 2:** the `bigdata.py` randomly generates a large volumne of
CSV-formatted data and uploads it vis gRPC. You are *required* to
test your upload implementation with this script and it will be used
as part of our tests.
## Part 4: Locking
......@@ -249,6 +255,7 @@ docker run --name=yournetid -d -m 512m -p 127.0.0.1:5440:5440 -v ./inputs:/input
docker exec yournetid python3 upload.py /inputs/simple.csv
docker exec yournetid python3 csvsum.py x
docker exec yournetid python3 parquetsum.py x
docker exec yournetid python3 bigdata.py
```
Please do include the files built from the .proto (your Dockerfile
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment