Skip to content
Snippets Groups Projects
Commit ef2f0b3b authored by Jing Lan's avatar Jing Lan
Browse files

p3 draft update

parent 1b732592
No related branches found
No related tags found
No related merge requests found
# P3 (5% of grade): Large, Thread-Safe Tables
# P3 (4% of grade): Large, Thread-Safe Tables
# DRAFT: DO NOT START
......@@ -8,6 +8,57 @@ In this project, you'll build a server that handles the uploading of CSV files,
The server will write two files for each uploaded CSV file: one in CSV format and another in Parquet (i.e., they are two copies of the table in different formats). Clients that we provide will communicate with your server via RPC calls.
### Workflow and Format Walkthrough
In P3, our `client.py` takes in a batch of operation commands stored in file `workload.txt` and executes them line by line. There are two types of commands you can put into `workload.txt` to control the client behavior. First, each *upload* command:
```
u file.csv
```
will instruct the client to read a CSV data file named `file.csv` as binary bytes, and use the corresponding RPC call to upload it to the server. Next, you can use a subsequent *sum* command to perform summation to the table over one specified column. For example:
```
s p x
```
asks the client to send an RPC request to the server and instruct the server to return the total sum value of column `x`. As there are two copies of the same table in `CSV` and `Parquet` format, `p` in the command asks the server to read column data only from `Parquet` files. Below is a minimal example. Assume that the `server` has uploaded two files `file1.csv` and `fil12.csv`, which contain these records respectively:
```
x,y,z
1,2,3
4,5,6
```
And:
```
x,y
5,10
0,20
10,15
```
You can assume columns contain only integers. You should be able to upload the files and do sums with the following `workload.txt` description:
```
u file1.csv
u file2.csv
s p x
s p z
s c w
```
Expected ouptut would be:
```
20
9
0
```
Inspect both the `workload.txt` file content and client code (i.e., `read_workload_file()`) to understand how each text command leads to one `gRPC` call. A separate `purge.txt` workload file is provided and *should not be modified*. The client can use a RPC call `Purge()` to reset the server and remove all files stored by the remote peer.
Learning objectives:
* Implement logic for uploading and processing CSV and Parquet files.
* Perform computations like summing values from specific columns.
......@@ -18,7 +69,7 @@ Before starting, please review the [general project directions](../projects.md).
## Clarifications/Corrections
* None yet
## Part 1: Communication (gRPC)
......@@ -38,7 +89,7 @@ Now build the .proto on your VM. Install the tools like this:
```bash
python3 -m venv venv
source venv/bin/activate
pip3 install grpcio==1.70.0 grpcio-tools==1.60.0 protobuf==5.29.3
pip3 install grpcio==1.70.0 grpcio-tools==1.70.0 protobuf==5.29.3
```
Then use `grpc_tools.protoc` to build your `.proto` file.
......@@ -49,12 +100,10 @@ If communication is working correctly so far, you should be able to start a serv
```bash
python3 -u server.py &> log.txt &
python3 client.py workload
python3 client.py workload.txt
# should see multiple "TODO"s
```
In P3, `client.py` takes in a batch of operation commands stored in file `workload` and executes them line by line. Inspect both the `workload` file content and client code (i.e., `read_workload_file()`) to understand how each text command leads to one `gRPC` call. A separate `purge` workload file is provided and *should not be modified*. The client can use a RPC call `Purge()` to reset the server and remove all files stored by the remote peer.
Create a `Dockerfile.server` to build an image that will also let you run your server in a container. It should be possible to build and run your server like this:
```bash
......@@ -101,35 +150,7 @@ This method should:
Whenever your server receives a column summation request, it should loop over all the data files that have been uploaded, compute a local sum for each such file, and finally return a total sum for the whole table.
For example, assume that the `server` has uploaded two files sample1.csv and sample2.csv, which contain these records respectively:
```
x,y,z
1,2,3
4,5,6
```
And:
```
x,y
5,10
0,20
```
You should be able to upload the files and do sums with the following `workload` description:
```
u sample1.csv
u sample2.csv
s p x # should print 10
s p z # should print 9
s c w # should print 0
```
You can assume columns contain only integers. The table does not have a fixed schema (i.e., it is not guaranteed that a column appears in any uploaded file). You should skip a file if it lacks the target column (e.g., z and w in the above example).
The server should sum over either Parquet or CSV files according to the input `format` (not both). For a given column, the query results for format="parquet" should be the same as for format="csv", while performance may differ.
The table does not have a fixed schema (i.e., it is not guaranteed that a column appears in any uploaded file). You should skip a file if it lacks the target column (e.g., z and w in the above example). The server should sum over either Parquet or CSV files according to the input `format` (not both). For a given column, the query results for format="parquet" should be the same as for format="csv", while performance may differ.
### Purge
......@@ -143,12 +164,12 @@ With the Global Interpreter Lock (GIL), commonly-used CPython does not support p
### Client
More specifically, you will need to manually create *N* threads for `client.py` (with thread management primitives come with the `threading` module) to concurrently process the provided `workload`. For example, each worker thread may repeatedly fetch one command line from `workload` and process it. You can load all command strings to a list, then provide thread-safe access to all launched threads (how?).
More specifically, you will need to manually create *N* threads for `client.py` (with thread management primitives come with the `threading` module) to concurrently process the provided `workload.txt`. For example, each worker thread may repeatedly fetch one command line from `workload.txt` and process it. You can load all command strings to a list, then provide thread-safe access to all launched threads (how?).
**HINT:** Before moving to work on the `server`, test your multi-threading client by running it with a single thread:
```bash
python3 client.py workload 1 # set to use only 1 thread
python3 client.py workload.txt 1 # set to use only 1 thread
```
### Server
......@@ -175,7 +196,7 @@ Congratulations, you have implemented a minimal multi-threading data system! Let
### Driving the Client
Each time `benchmark.py` should collect 4 pairs of data by running `client.py` with 1, 2, 4, and 8 thread(s). Wrap each `client.py` execution with a pair of timestamp collection. Then calculate the execution time. Make sure you always reset the server before sending the `workload`, by issuing a `Purge()` command through `client.py`:
Each time `benchmark.py` should collect 4 pairs of data by running `client.py` with 1, 2, 4, and 8 thread(s). Wrap each `client.py` execution with a pair of timestamp collection. Then calculate the execution time. Make sure you always reset the server before sending the `workload.txt`, by issuing a `Purge()` command through `client.py`:
```bash
python3 client.py purge
......@@ -188,21 +209,27 @@ You may also want `benchmark.py` to wait a few seconds for the `server` to get r
**HINT 1:** You can get a timestamp with `time.time()`.
**HINT 2:** There are multiple tools to launch a python program from within another. Examples are `os.system()` and `subprocess.run`.
**HINT 2:** There are multiple tools to launch a Python program from within another. Examples are [`os.system()`](https://docs.python.org/3/library/os.html#os.system) and [`subprocess.run`](https://docs.python.org/3/library/subprocess.html#subprocess.run).
### Visualizing the Results
Plot a simple line graph with the execution time acquired by the previous step. Save the figure to a file called `plot.png`. Your figure must include at least 4 data points as mentioned above.
**HINT 1:** `matplotlib` will be a standard toolkit to visualize your data.
**HINT:** Feel free to any tools to plot the figure. Below is a minimal example to plot a dictionary with 2 data points:
**HINT 2:** `benchmark.py` needs no more than 50 lines of code. Don't complicate your solution.
```python
import pandas as pd
data = {1: 100, 2: 200}
series = pd.Series(data)
ax = series.plot.line()
ax.get_figure().savefig("plot.png")
```
## Submission
Delirable should work with `docker-compose.yaml` we provide:
1. `Dockerfile.client` must launch `benchmark.py` **(NOT `client.py`)**. To achieve this, you need to copy both `client.py` and the driver `benchmark.py` to the image, as well as `workload`, `purge`, and the input CSV files. It is sufficient to submit a minimal working set as we may test your code with different datasets and workloads.
1. `Dockerfile.client` must launch `benchmark.py` **(NOT `client.py`)**. To achieve this, you need to copy both `client.py` and the driver `benchmark.py` to the image, as well as `workload.txt`, `purge.txt`, and the input CSV files. It is sufficient to submit a minimal working set as we may test your code with different datasets and workloads.
2. `Dockerfile.server` must launch `server.py`.
**Requirement:** Do **NOT** submit the `venv` directory (e.g., use `.gitignore`).
......
......@@ -24,10 +24,10 @@ class TableClient:
print(resp.error if resp.error else f"{resp.total}")
def exec_command(self, line):
command = line.strip().split(" ")
if len(command) == 1:
command = line.strip().split()
if len(command) == 1 and command[0] == "p":
self.Purge()
elif len(command) == 2:
elif len(command) == 2 and command[0] == "u":
self.Upload(command[1])
elif len(command) == 3:
self.ColSum("parquet" if command[1] == "p" else "csv", command[2])
......
File moved
x,y,z
1,2,3
4,5,6
x,y
5,10
0,20
10,15
p3/plot.png

90.2 KiB

p
\ No newline at end of file
u file1.csv
u file2.csv
u file1.csv
s p x
s p y
s c z
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment