Skip to content
Snippets Groups Projects
README.md 10.6 KiB
Newer Older
Jing Lan's avatar
Jing Lan committed
# P3 (5% of grade): Large, Thread-Safe Tables

# DRAFT: DO NOT START

## Overview

In this project, you'll build a server that handles the uploading of CSV files, storing their contents, and performing query operations on the data. The server maintains **only ONE** logical table. You should think of each uploaded CSV as containing a portion of that larger table, which grows with each upload.

The server will write two files for each uploaded CSV file: one in CSV format and another in Parquet (i.e., they are two copies of the table in different formats). Clients that we provide will communicate with your server via RPC calls.

Learning objectives:
Jing Lan's avatar
Jing Lan committed
* Implement logic for uploading and processing CSV and Parquet files.
* Perform computations like summing values from specific columns.
* Manage concurrency with locking in multi-threading server/client.
* Benchmark a server/client system and visualize the results.
Jing Lan's avatar
Jing Lan committed

Before starting, please review the [general project directions](../projects.md).

## Clarifications/Corrections



## Part 1: Communication (gRPC)

Jing Lan's avatar
Jing Lan committed
In this project, the client program `client.py` will communicate with a server, `server.py`, via gRPC. We provide starter code for the client program. Your job is to write a `.proto` file to generate a gRPC stub (used by our client) and servicer class that you will inherit from in server.py.
Jing Lan's avatar
Jing Lan committed

Take a moment to look at code for the client code and answer the following questions:

* what are the names of the imported gRPC modules?  This will determine what you name your `.proto` file.
* what methods are called on the stubs?  This will determine the RPC definitions in your `.proto` file.
* what arguments are passed to the methods, and what values are extracted from the return values? This will determine the fields in the messages in your `.proto` file.
* what port number does the client use? This will determine the port that the gRPC server should expose.

Write a `.proto` file based on your above observations and run the `grpc_tools.protoc` compiler to generate stub code for our client and servicer code for your server. All field types will be strings, except `total` and `csv_data`,which should be `int64` and `bytes` respectively.

Now build the .proto on your VM. Install the tools like this:

```bash
python3 -m venv venv
source venv/bin/activate
Jing Lan's avatar
Jing Lan committed
pip3 install grpcio==1.70.0 grpcio-tools==1.60.0 protobuf==5.29.3
Jing Lan's avatar
Jing Lan committed
```

Then use `grpc_tools.protoc` to build your `.proto` file.

In your server, override the *three* RPC methods for the generated servicer. For now, the methods do nothing but returning messages with the error field set to "TODO", leaving any other field unspecified.

If communication is working correctly so far, you should be able to start a server and used a client to get back a "TODO" error message via gRPC:

```bash
python3 -u server.py &> log.txt &
python3 client.py workload
# should see multiple "TODO"s
```

Jing Lan's avatar
Jing Lan committed
In P3, `client.py` takes in a batch of operation commands stored in file `workload` and executes them line by line. Inspect both the `workload` file content and client code (i.e., `read_workload_file()`) to understand how each text command leads to one `gRPC` call. A separate `purge` workload file is provided and *should not be modified*. The client can use a RPC call `Purge()` to reset the server and remove all files stored by the remote peer.
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
Create a `Dockerfile.server` to build an image that will also let you run your server in a container.  It should be possible to build and run your server like this:
Jing Lan's avatar
Jing Lan committed

```bash
docker build . -f Dockerfile.server -t ${PROJECT}-server
docker run -d -m 512m -p 127.0.0.1:5440:5440 
```

Like P2, the compose file assumes a "PROJECT" environment variable. You can set it to p3 in your environment with (the autograder may use another prefix for testing):

```bash
export PROJECT=p3
```

Jing Lan's avatar
Jing Lan committed
The client program should then be able to communicate with the server program the same way it communicated with that outside of a container. Once your client program successfully interacts with the dockerized server, you should similarly draft a `Dockerfile.client` to build a container for `client.py`. Finally, test your setup with `docker compose`:
Jing Lan's avatar
Jing Lan committed

```bash
docker compose up -d
docker ps
# should see:
Jing Lan's avatar
Jing Lan committed
CONTAINER ID   IMAGE       COMMAND                  CREATED         ...
fa8de65e0e7c   p3-client   "python3 -u /client.…"   2 seconds ago   ...
4c899de6e43f   p3-server   "python3 -u /server.…"   2 seconds ago   ...
Jing Lan's avatar
Jing Lan committed
```

Jing Lan's avatar
Jing Lan committed
**HINT:** consider writing a .sh script that helps you redeploy code changes. Everytime you modify the source code `client.py/server.py/benchmark.py`, you may want to rebuild the images, bring down the previous docker cluster, and re-instantiate a new cluster.
Jing Lan's avatar
Jing Lan committed

## Part 2: Server Implementation

Jing Lan's avatar
Jing Lan committed
You will need to implement three RPC calls on the server side:
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
### Upload
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
This method should:
Jing Lan's avatar
Jing Lan committed
1. Recover the uploaded CSV table from *binary* bytes carried by the RPC request message.
2. Write the table to a CSV file and write the same table to another file in Parquet format.
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
**Requirement:** Write two files to disk per upload. We will test your server with a 512MB memory limit. Do *NOT* keep the table data in memory.

Jing Lan's avatar
Jing Lan committed
**HINT 1:** You are free to decide the names and locations of the stored files. However, the server must keep these records to process future queries (for instance, you can add paths to a data structure like a list or dictionary).

**HINT 2:** Both `pandas` and `pyarrow` provide interfaces to write a table to file.

### ColSum

Jing Lan's avatar
Jing Lan committed
Whenever your server receives a column summation request, it should loop over all the data files that have been uploaded, compute a local sum for each such file, and finally return a total sum for the whole table.
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
For example, assume that the `server` has uploaded two files sample1.csv and sample2.csv, which contain these records respectively:
Jing Lan's avatar
Jing Lan committed

```
x,y,z
1,2,3
4,5,6
```

Jing Lan's avatar
Jing Lan committed
And:
Jing Lan's avatar
Jing Lan committed

```
x,y
5,10
0,20
```

Jing Lan's avatar
Jing Lan committed
You should be able to upload the files and do sums with the following `workload` description:
Jing Lan's avatar
Jing Lan committed

```
Jing Lan's avatar
Jing Lan committed
u sample1.csv
u sample2.csv
s p x # should print 10
s p z # should print 9
s c w # should print 0
Jing Lan's avatar
Jing Lan committed
```

Jing Lan's avatar
Jing Lan committed
You can assume columns contain only integers. The table does not have a fixed schema (i.e., it is not guaranteed that a column appears in any uploaded file). You should skip a file if it lacks the target column (e.g., z and w in the above example).

Jing Lan's avatar
Jing Lan committed
The server should sum over either Parquet or CSV files according to the input `format` (not both). For a given column, the query results for format="parquet" should be the same as for format="csv", while performance may differ.
Jing Lan's avatar
Jing Lan committed

### Purge

This method facilitates testing and subsequent benchmarking. The method should:
1. Remove all local file previously uploaded by method `Upload()`
Jing Lan's avatar
Jing Lan committed
2. Reset all associated server state (e.g., names, paths, etc.)
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
## Part 3: Multi-threading Server/Client
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
With the Global Interpreter Lock (GIL), commonly-used CPython does not support parallel multi-threading execution. However, multi-threading can still boost the performance of our small system (why?). In Part 3, you are required to add threading support to `client.py`, then `server.py`.
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
### Client
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
More specifically, you will need to manually create *N* threads for `client.py` (with thread management primitives come with the `threading` module) to concurrently process the provided `workload`. For example, each worker thread may repeatedly fetch one command line from `workload` and process it. You can load all command strings to a list, then provide thread-safe access to all launched threads (how?).
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
**HINT:** Before moving to work on the `server`, test your multi-threading client by running it with a single thread:
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
```bash
Jing Lan's avatar
Jing Lan committed
python3 client.py workload 1 # set to use only 1 thread
Jing Lan's avatar
Jing Lan committed
```
Jing Lan's avatar
Jing Lan committed

### Server

Jing Lan's avatar
Jing Lan committed
Now with concurrent requests sent from `client.py`, you must correspondingly protect your server from data race with `threading.Lock()`. Make sure only one thread can modify the server state (e.g., lists of names or paths). Note that you don't need to explicitly create threads for `server.py` as gRPC can do that for you. The following example code creates a thread pool with 8 threads:
Jing Lan's avatar
Jing Lan committed

```python
Jing Lan's avatar
Jing Lan committed
grpc.server(
Jing Lan's avatar
Jing Lan committed
    futures.ThreadPoolExecutor(max_workers=8),
    options=[("grpc.so_reuseport", 0)]
Jing Lan's avatar
Jing Lan committed
)
```

Jing Lan's avatar
Jing Lan committed
**Requirement 1:** The server should properly acquire then release the lock. A single global lock is sufficient. Lock release should also work with any potential exceptions.

**Requirement 2:** The server *MUST NOT* hold the lock when reading or writing files. A thread should release the lock right after it has done accessing the shared data structure. How could this behavior affect the performance?

## Part 4: Benchmarking the System

Jing Lan's avatar
Jing Lan committed
Congratulations, you have implemented a minimal multi-threading data system! Let's write a small script to finally benchmark it with different scales (i.e., number of worker threads). Overall, the script is expected to perform the following tasks:
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
1. Run `client.py` multiple times with different therading parameters, record their execution time.
2. plot the data to visualize the performance trend.
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
### Driving the Client
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
Each time `benchmark.py` should collect 4 pairs of data by running `client.py` with 1, 2, 4, and 8 thread(s). Wrap each `client.py` execution with a pair of timestamp collection. Then calculate the execution time. Make sure you always reset the server before sending the `workload`, by issuing a `Purge()` command through `client.py`:
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
```bash
python3 client.py purge
# let sometime for the reset to complete
time.sleep(3)
# test follows...
```
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
You may also want `benchmark.py` to wait a few seconds for the `server` to get ready for any client RPC requests.
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
**HINT 1:** You can get a timestamp with `time.time()`.
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
**HINT 2:** There are multiple tools to launch a python program from within another. Examples are `os.system()` and `subprocess.run`.
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
### Visualizing the Results
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
Plot a simple line graph with the execution time acquired by the previous step. Save the figure to a file called `plot.png`. Your figure must include at least 4 data points as mentioned above.
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
**HINT 1:** `matplotlib` will be a standard toolkit to visualize your data.
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
**HINT 2:** `benchmark.py` needs no more than 50 lines of code. Don't complicate your solution.
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
## Submission
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
Delirable should work with `docker-compose.yaml` we provide:
Jing Lan's avatar
Jing Lan committed

1. `Dockerfile.client` must launch `benchmark.py` **(NOT `client.py`)**. To achieve this, you need to copy both `client.py` and the driver `benchmark.py` to the image, as well as `workload`, `purge`, and the input CSV files. It is sufficient to submit a minimal working set as we may test your code with different datasets and workloads.
2. `Dockerfile.server` must launch `server.py`.
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
**Requirement:** Do **NOT** submit the `venv` directory (e.g., use `.gitignore`).

## Grading
Jing Lan's avatar
Jing Lan committed

Jing Lan's avatar
Jing Lan committed
TBD