p3 draft update

4825f358 · Jing Lan · fb96c977 · 4825f358
Commit 4825f358 authored 3 weeks ago by Jing Lan
--- a/p3/README.md
+++ b/p3/README.md
@@ -76,31 +76,36 @@ The client program should then be able to communicate with the server program th
 docker compose up -d
 docker ps
 # should see:
-CONTAINER ID   IMAGE           COMMAND                  CREATED         ...
+CONTAINER ID   IMAGE       COMMAND                  CREATED         ...
-fa8de65e0e7c   mytest-client   "python3 -u /client.…"   2 seconds ago   ...
+fa8de65e0e7c   p3-client   "python3 -u /client.…"   2 seconds ago   ...
-4c899de6e43f   mytest-server   "python3 -u /server.…"   2 seconds ago   ...
+4c899de6e43f   p3-server   "python3 -u /server.…"   2 seconds ago   ...
 ```
 **HINT 1:** consider writing a .sh script that helps you merge code changes. Everytime you modify the source code `client.py/server.py/benchmark.py`, you may want to rebuild the images, bring down the previous docker cluster, and re-instantiate a new cluster.
 ## Part 2: Server Implementation
-When your server receives an upload request with some CSV data, your
+You will need to implement three RPC calls on the server side:
-program should write the CSV to a new file somewhere.  You can decide
-the name and location, but the server must remember the path to the
-file (for example, you could add the path to some data structure, like a
-list or dictionary).
-Your server should similarly write the same data to a parquet file
+### Upload
-somewhere, using pyarrow.
-## Part 3: Multi-threading Client
+This method should:
+1. Read table from bytes provided by the RPC request
+2. Write the table to a CSV file and write the same table to another file in Parquet format
+**HINT 1:** You are free to decide the names and locations of the stored files. However, the server must keep these records to process future queries (for instance, you can add paths to a data structure like a list or dictionary).
+**HINT 2:** Both `pandas` and `pyarrow` provide interfaces to write a table to file.
+### ColSum
+Whenever your server receives a column summation request, it should loop over all the data files that has been uploaded, compute a local sum for each such file, and finally return a total sum for the whole table.
 When your server receives a column summation request, it should loop
 over all the data that has been uploaded, computing a sum for each
 file, and returning a total sum.
-For example, assume file1.csv and file2.csv contain this:
+For example, assume sample1.csv and sample2.csv contain these records:
 ```
 x,y,z
@@ -108,7 +113,7 @@ x,y,z
 4,5,6
 ```
-And this:
+And:
 ```
 x,y
@@ -116,43 +121,30 @@ x,y
 0,20
 ```
-You should be able to upload the files and do sums as follows:
+You should be able to upload the files and do sums with the following `workload` description:
 ```
-python3 upload.py file1.csv
+u sample1.csv
-python3 upload.py file2.csv
+u sample2.csv
-python3 csvsum.py x # should print 10
+s p x # should print 10
-python3 csvsum.py z # should print 9
+s p z # should print 9
-python3 csvsum.py w # should print 0
+s c w # should print 0
 ```
-You can assume any column you sum over contains only integers, but
+You can assume columns contain only integers. The table does not have a fixed schema (i.e., it is not guaranteed that a column appears in any uploaded file). You should skip a file if it lacks the target column (e.g., z and w in the above example).
-some files may lack certain columns (e.g., it is OK to sum over z
-above, even though file2.csv doesn't have that column).
+The server should sum over either Parquet or CSV files according to the input `format` (not both). You should expect querying one column by two formats to produce the same output.
-The only difference between `csvsum.py` and `parquetsum.py` is that
+### Purge
-they will pass the format string to your gRPC method as "csv" or
-"parquet", respectively.  Your server is expected to do the summing
+This method facilitates testing and subsequent benchmarking. The method should:
-over either the CSV or parquet files accordingly (not both).  Given
+1. Remove all local file previously uploaded by method `Upload()`
-the CSVs and parquets contain the same data, running `csvsum.py x`
+2. Reset all associated server state (e.g., counters, paths, etc)
-should produce the same number as `parquetsum.py x`, though there may
-be a performance depending on which format is used.
+## Part 3: Multi-threading Client
-Parquet is a column-oriented format, so all the data in a single file
-should be adjacent on disk.  This means it should be possible to read
-a column of data without reading the whole file.  See the `columns`
-parameter here:
-https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
-**Requirement:** when the server is asked to sum over the column of a
-Parquet file, it should only read the data from that column, not other
-columns.
-**Note:** we will run your server with a 512-MB limit on RAM.  Any
-individual files we upload will fit within that limit, but the total
-size of the files uploaded will exceed that limit.  That's why your
-server will have to do sums by reading the files (instead of just
-keeping all table data in memory).
 ## Part 4: Benchmarking the System