@@ -76,31 +76,36 @@ The client program should then be able to communicate with the server program th
...
@@ -76,31 +76,36 @@ The client program should then be able to communicate with the server program th
docker compose up -d
docker compose up -d
docker ps
docker ps
# should see:
# should see:
CONTAINER ID IMAGE COMMAND CREATED ...
CONTAINER ID IMAGE COMMAND CREATED ...
fa8de65e0e7c mytest-client "python3 -u /client.…" 2 seconds ago ...
fa8de65e0e7c p3-client "python3 -u /client.…" 2 seconds ago ...
4c899de6e43f mytest-server "python3 -u /server.…" 2 seconds ago ...
4c899de6e43f p3-server "python3 -u /server.…" 2 seconds ago ...
```
```
**HINT 1:** consider writing a .sh script that helps you merge code changes. Everytime you modify the source code `client.py/server.py/benchmark.py`, you may want to rebuild the images, bring down the previous docker cluster, and re-instantiate a new cluster.
**HINT 1:** consider writing a .sh script that helps you merge code changes. Everytime you modify the source code `client.py/server.py/benchmark.py`, you may want to rebuild the images, bring down the previous docker cluster, and re-instantiate a new cluster.
## Part 2: Server Implementation
## Part 2: Server Implementation
When your server receives an upload request with some CSV data, your
You will need to implement three RPC calls on the server side:
program should write the CSV to a new file somewhere. You can decide
the name and location, but the server must remember the path to the
file (for example, you could add the path to some data structure, like a
list or dictionary).
Your server should similarly write the same data to a parquet file
### Upload
somewhere, using pyarrow.
## Part 3: Multi-threading Client
This method should:
1. Read table from bytes provided by the RPC request
2. Write the table to a CSV file and write the same table to another file in Parquet format
**HINT 1:** You are free to decide the names and locations of the stored files. However, the server must keep these records to process future queries (for instance, you can add paths to a data structure like a list or dictionary).
**HINT 2:** Both `pandas` and `pyarrow` provide interfaces to write a table to file.
### ColSum
Whenever your server receives a column summation request, it should loop over all the data files that has been uploaded, compute a local sum for each such file, and finally return a total sum for the whole table.
When your server receives a column summation request, it should loop
When your server receives a column summation request, it should loop
over all the data that has been uploaded, computing a sum for each
over all the data that has been uploaded, computing a sum for each
file, and returning a total sum.
file, and returning a total sum.
For example, assume file1.csv and file2.csv contain this:
For example, assume sample1.csv and sample2.csv contain these records:
```
```
x,y,z
x,y,z
...
@@ -108,7 +113,7 @@ x,y,z
...
@@ -108,7 +113,7 @@ x,y,z
4,5,6
4,5,6
```
```
And this:
And:
```
```
x,y
x,y
...
@@ -116,43 +121,30 @@ x,y
...
@@ -116,43 +121,30 @@ x,y
0,20
0,20
```
```
You should be able to upload the files and do sums as follows:
You should be able to upload the files and do sums with the following `workload` description:
```
```
python3 upload.py file1.csv
u sample1.csv
python3 upload.py file2.csv
u sample2.csv
python3 csvsum.py x # should print 10
s p x # should print 10
python3 csvsum.py z # should print 9
s p z # should print 9
python3 csvsum.py w # should print 0
s c w # should print 0
```
```
You can assume any column you sum over contains only integers, but
You can assume columns contain only integers. The table does not have a fixed schema (i.e., it is not guaranteed that a column appears in any uploaded file). You should skip a file if it lacks the target column (e.g., z and w in the above example).
some files may lack certain columns (e.g., it is OK to sum over z
above, even though file2.csv doesn't have that column).
The server should sum over either Parquet or CSV files according to the input `format` (not both). You should expect querying one column by two formats to produce the same output.
The only difference between `csvsum.py` and `parquetsum.py` is that
### Purge
they will pass the format string to your gRPC method as "csv" or
"parquet", respectively. Your server is expected to do the summing
This method facilitates testing and subsequent benchmarking. The method should:
over either the CSV or parquet files accordingly (not both). Given
1. Remove all local file previously uploaded by method `Upload()`
the CSVs and parquets contain the same data, running `csvsum.py x`
2. Reset all associated server state (e.g., counters, paths, etc)
should produce the same number as `parquetsum.py x`, though there may
be a performance depending on which format is used.
## Part 3: Multi-threading Client
Parquet is a column-oriented format, so all the data in a single file
should be adjacent on disk. This means it should be possible to read
a column of data without reading the whole file. See the `columns`