Skip to content
Snippets Groups Projects
Commit 4825f358 authored by Jing Lan's avatar Jing Lan
Browse files

p3 draft update

parent fb96c977
No related branches found
No related tags found
No related merge requests found
...@@ -76,31 +76,36 @@ The client program should then be able to communicate with the server program th ...@@ -76,31 +76,36 @@ The client program should then be able to communicate with the server program th
docker compose up -d docker compose up -d
docker ps docker ps
# should see: # should see:
CONTAINER ID IMAGE COMMAND CREATED ... CONTAINER ID IMAGE COMMAND CREATED ...
fa8de65e0e7c mytest-client "python3 -u /client.…" 2 seconds ago ... fa8de65e0e7c p3-client "python3 -u /client.…" 2 seconds ago ...
4c899de6e43f mytest-server "python3 -u /server.…" 2 seconds ago ... 4c899de6e43f p3-server "python3 -u /server.…" 2 seconds ago ...
``` ```
**HINT 1:** consider writing a .sh script that helps you merge code changes. Everytime you modify the source code `client.py/server.py/benchmark.py`, you may want to rebuild the images, bring down the previous docker cluster, and re-instantiate a new cluster. **HINT 1:** consider writing a .sh script that helps you merge code changes. Everytime you modify the source code `client.py/server.py/benchmark.py`, you may want to rebuild the images, bring down the previous docker cluster, and re-instantiate a new cluster.
## Part 2: Server Implementation ## Part 2: Server Implementation
When your server receives an upload request with some CSV data, your You will need to implement three RPC calls on the server side:
program should write the CSV to a new file somewhere. You can decide
the name and location, but the server must remember the path to the
file (for example, you could add the path to some data structure, like a
list or dictionary).
Your server should similarly write the same data to a parquet file ### Upload
somewhere, using pyarrow.
## Part 3: Multi-threading Client This method should:
1. Read table from bytes provided by the RPC request
2. Write the table to a CSV file and write the same table to another file in Parquet format
**HINT 1:** You are free to decide the names and locations of the stored files. However, the server must keep these records to process future queries (for instance, you can add paths to a data structure like a list or dictionary).
**HINT 2:** Both `pandas` and `pyarrow` provide interfaces to write a table to file.
### ColSum
Whenever your server receives a column summation request, it should loop over all the data files that has been uploaded, compute a local sum for each such file, and finally return a total sum for the whole table.
When your server receives a column summation request, it should loop When your server receives a column summation request, it should loop
over all the data that has been uploaded, computing a sum for each over all the data that has been uploaded, computing a sum for each
file, and returning a total sum. file, and returning a total sum.
For example, assume file1.csv and file2.csv contain this: For example, assume sample1.csv and sample2.csv contain these records:
``` ```
x,y,z x,y,z
...@@ -108,7 +113,7 @@ x,y,z ...@@ -108,7 +113,7 @@ x,y,z
4,5,6 4,5,6
``` ```
And this: And:
``` ```
x,y x,y
...@@ -116,43 +121,30 @@ x,y ...@@ -116,43 +121,30 @@ x,y
0,20 0,20
``` ```
You should be able to upload the files and do sums as follows: You should be able to upload the files and do sums with the following `workload` description:
``` ```
python3 upload.py file1.csv u sample1.csv
python3 upload.py file2.csv u sample2.csv
python3 csvsum.py x # should print 10 s p x # should print 10
python3 csvsum.py z # should print 9 s p z # should print 9
python3 csvsum.py w # should print 0 s c w # should print 0
``` ```
You can assume any column you sum over contains only integers, but You can assume columns contain only integers. The table does not have a fixed schema (i.e., it is not guaranteed that a column appears in any uploaded file). You should skip a file if it lacks the target column (e.g., z and w in the above example).
some files may lack certain columns (e.g., it is OK to sum over z
above, even though file2.csv doesn't have that column). The server should sum over either Parquet or CSV files according to the input `format` (not both). You should expect querying one column by two formats to produce the same output.
The only difference between `csvsum.py` and `parquetsum.py` is that ### Purge
they will pass the format string to your gRPC method as "csv" or
"parquet", respectively. Your server is expected to do the summing This method facilitates testing and subsequent benchmarking. The method should:
over either the CSV or parquet files accordingly (not both). Given 1. Remove all local file previously uploaded by method `Upload()`
the CSVs and parquets contain the same data, running `csvsum.py x` 2. Reset all associated server state (e.g., counters, paths, etc)
should produce the same number as `parquetsum.py x`, though there may
be a performance depending on which format is used. ## Part 3: Multi-threading Client
Parquet is a column-oriented format, so all the data in a single file
should be adjacent on disk. This means it should be possible to read
a column of data without reading the whole file. See the `columns`
parameter here:
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
**Requirement:** when the server is asked to sum over the column of a
Parquet file, it should only read the data from that column, not other
columns.
**Note:** we will run your server with a 512-MB limit on RAM. Any
individual files we upload will fit within that limit, but the total
size of the files uploaded will exceed that limit. That's why your
server will have to do sums by reading the files (instead of just
keeping all table data in memory).
## Part 4: Benchmarking the System ## Part 4: Benchmarking the System
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment