Skip to content
Snippets Groups Projects
Commit 4825f358 authored by Jing Lan's avatar Jing Lan
Browse files

p3 draft update

parent fb96c977
No related branches found
No related tags found
No related merge requests found
......@@ -76,31 +76,36 @@ The client program should then be able to communicate with the server program th
docker compose up -d
docker ps
# should see:
CONTAINER ID IMAGE COMMAND CREATED ...
fa8de65e0e7c mytest-client "python3 -u /client.…" 2 seconds ago ...
4c899de6e43f mytest-server "python3 -u /server.…" 2 seconds ago ...
CONTAINER ID IMAGE COMMAND CREATED ...
fa8de65e0e7c p3-client "python3 -u /client.…" 2 seconds ago ...
4c899de6e43f p3-server "python3 -u /server.…" 2 seconds ago ...
```
**HINT 1:** consider writing a .sh script that helps you merge code changes. Everytime you modify the source code `client.py/server.py/benchmark.py`, you may want to rebuild the images, bring down the previous docker cluster, and re-instantiate a new cluster.
## Part 2: Server Implementation
When your server receives an upload request with some CSV data, your
program should write the CSV to a new file somewhere. You can decide
the name and location, but the server must remember the path to the
file (for example, you could add the path to some data structure, like a
list or dictionary).
You will need to implement three RPC calls on the server side:
Your server should similarly write the same data to a parquet file
somewhere, using pyarrow.
### Upload
## Part 3: Multi-threading Client
This method should:
1. Read table from bytes provided by the RPC request
2. Write the table to a CSV file and write the same table to another file in Parquet format
**HINT 1:** You are free to decide the names and locations of the stored files. However, the server must keep these records to process future queries (for instance, you can add paths to a data structure like a list or dictionary).
**HINT 2:** Both `pandas` and `pyarrow` provide interfaces to write a table to file.
### ColSum
Whenever your server receives a column summation request, it should loop over all the data files that has been uploaded, compute a local sum for each such file, and finally return a total sum for the whole table.
When your server receives a column summation request, it should loop
over all the data that has been uploaded, computing a sum for each
file, and returning a total sum.
For example, assume file1.csv and file2.csv contain this:
For example, assume sample1.csv and sample2.csv contain these records:
```
x,y,z
......@@ -108,7 +113,7 @@ x,y,z
4,5,6
```
And this:
And:
```
x,y
......@@ -116,43 +121,30 @@ x,y
0,20
```
You should be able to upload the files and do sums as follows:
You should be able to upload the files and do sums with the following `workload` description:
```
python3 upload.py file1.csv
python3 upload.py file2.csv
python3 csvsum.py x # should print 10
python3 csvsum.py z # should print 9
python3 csvsum.py w # should print 0
u sample1.csv
u sample2.csv
s p x # should print 10
s p z # should print 9
s c w # should print 0
```
You can assume any column you sum over contains only integers, but
some files may lack certain columns (e.g., it is OK to sum over z
above, even though file2.csv doesn't have that column).
The only difference between `csvsum.py` and `parquetsum.py` is that
they will pass the format string to your gRPC method as "csv" or
"parquet", respectively. Your server is expected to do the summing
over either the CSV or parquet files accordingly (not both). Given
the CSVs and parquets contain the same data, running `csvsum.py x`
should produce the same number as `parquetsum.py x`, though there may
be a performance depending on which format is used.
Parquet is a column-oriented format, so all the data in a single file
should be adjacent on disk. This means it should be possible to read
a column of data without reading the whole file. See the `columns`
parameter here:
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
**Requirement:** when the server is asked to sum over the column of a
Parquet file, it should only read the data from that column, not other
columns.
**Note:** we will run your server with a 512-MB limit on RAM. Any
individual files we upload will fit within that limit, but the total
size of the files uploaded will exceed that limit. That's why your
server will have to do sums by reading the files (instead of just
keeping all table data in memory).
You can assume columns contain only integers. The table does not have a fixed schema (i.e., it is not guaranteed that a column appears in any uploaded file). You should skip a file if it lacks the target column (e.g., z and w in the above example).
The server should sum over either Parquet or CSV files according to the input `format` (not both). You should expect querying one column by two formats to produce the same output.
### Purge
This method facilitates testing and subsequent benchmarking. The method should:
1. Remove all local file previously uploaded by method `Upload()`
2. Reset all associated server state (e.g., counters, paths, etc)
## Part 3: Multi-threading Client
## Part 4: Benchmarking the System
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment