Skip to content
Snippets Groups Projects
Commit e7740eff authored by Jing Lan's avatar Jing Lan
Browse files

Update P3 README

parent 00099229
Branches sadBucky
No related tags found
No related merge requests found
...@@ -20,7 +20,9 @@ Before starting, please review the [general project directions](../projects.md). ...@@ -20,7 +20,9 @@ Before starting, please review the [general project directions](../projects.md).
## Clarifications/Corrections ## Clarifications/Corrections
* none yet * Feb 24: feel free to use different tools to implement Part 2.
* Feb 24: clarify that `bigdata.py` will be used in tests.
* Feb 24: add link to lecture notes on parquet file operations.
## Part 1: Communication (gRPC) ## Part 1: Communication (gRPC)
...@@ -126,7 +128,9 @@ file (for example, you could add the path to some data structure, like a ...@@ -126,7 +128,9 @@ file (for example, you could add the path to some data structure, like a
list or dictionary). list or dictionary).
Your server should similarly write the same data to a parquet file Your server should similarly write the same data to a parquet file
somewhere, using pyarrow. somewhere, using `pyarrow`, `pandas`, or any available tools. Refer to
the [lecture notes](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/14-file-formats?ref_type=heads)
for a few examples of reading/writing parquet files.
## Part 3: Column Sum ## Part 3: Column Sum
...@@ -174,22 +178,24 @@ be a performance depending on which format is used. ...@@ -174,22 +178,24 @@ be a performance depending on which format is used.
Parquet is a column-oriented format, so all the data in a single file Parquet is a column-oriented format, so all the data in a single file
should be adjacent on disk. This means it should be possible to read should be adjacent on disk. This means it should be possible to read
a column of data without reading the whole file. See the `columns` a column of data without reading the whole file. Check out the `columns`
parameter here: parameter of [`pyarrow.parquet.read_table`](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html).
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html You can also find an example from the [lecture notes](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/14-file-formats?ref_type=heads).
**Requirement:** when the server is asked to sum over the column of a **Requirement:** when the server is asked to sum over the column of a
Parquet file, it should only read the data from that column, not other Parquet file, it should only read the data from that column, not other
columns. columns.
**Note:** we will run your server with a 512-MB limit on RAM. Any **Note 1:** we will run your server with a 512-MB limit on RAM. Any
individual files we upload will fit within that limit, but the total individual files we upload will fit within that limit, but the total
size of the files uploaded will exceed that limit. That's why your size of the files uploaded will exceed that limit. That's why your
server will have to do sums by reading the files (instead of just server will have to do sums by reading the files (instead of just
keeping all table data in memory). If you want manually test your keeping all table data in memory).
code with some bigger uploads, use the `bigdata.py` client. Instead
of uploading files, it randomly generateds lots of CSV-formatted data **Note 2:** the `bigdata.py` randomly generates a large volumne of
and directly uploads it via gRPC. CSV-formatted data and uploads it vis gRPC. You are *required* to
test your upload implementation with this script and it will be used
as part of our tests.
## Part 4: Locking ## Part 4: Locking
...@@ -249,6 +255,7 @@ docker run --name=yournetid -d -m 512m -p 127.0.0.1:5440:5440 -v ./inputs:/input ...@@ -249,6 +255,7 @@ docker run --name=yournetid -d -m 512m -p 127.0.0.1:5440:5440 -v ./inputs:/input
docker exec yournetid python3 upload.py /inputs/simple.csv docker exec yournetid python3 upload.py /inputs/simple.csv
docker exec yournetid python3 csvsum.py x docker exec yournetid python3 csvsum.py x
docker exec yournetid python3 parquetsum.py x docker exec yournetid python3 parquetsum.py x
docker exec yournetid python3 bigdata.py
``` ```
Please do include the files built from the .proto (your Dockerfile Please do include the files built from the .proto (your Dockerfile
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment