P3 (4% of grade): Large, Thread-Safe Tables
DRAFT: DO NOT START
Overview
In this project, you'll build a server that handles the uploading of CSV files, storing their contents, and performing query operations on the data. The server maintains only ONE logical table. You should think of each uploaded CSV as containing a portion of that larger table, which grows with each upload.
The server will write two files for each uploaded CSV file: one in CSV format and another in Parquet (i.e., they are two copies of the table in different formats). Clients that we provide will communicate with your server via RPC calls.
Workflow and Format Walkthrough
In P3, our client.py
takes in a batch of operation commands stored in file workload.txt
and executes them line by line. There are two types of commands you can put into workload.txt
to control the client behavior. First, each upload command:
u file.csv
will instruct the client to read a CSV data file named file.csv
as binary bytes, and use the corresponding RPC call to upload it to the server. Next, you can use a subsequent sum command to perform summation to the table over one specified column. For example:
s p x
asks the client to send an RPC request to the server and instruct the server to return the total sum value of column x
. As there are two copies of the same table in CSV
and Parquet
format, p
in the command asks the server to read column data only from Parquet
files. Below is a minimal example. Assume that the server
has uploaded two files file1.csv
and fil12.csv
, which contain these records respectively:
x,y,z
1,2,3
4,5,6
And:
x,y
5,10
0,20
10,15
You can assume columns contain only integers. You should be able to upload the files and do sums with the following workload.txt
description:
u file1.csv
u file2.csv
s p x
s p z
s c w
Expected ouptut would be:
20
9
0
Inspect both the workload.txt
file content and client code (i.e., read_workload_file()
) to understand how each text command leads to one gRPC
call. A separate purge.txt
workload file is provided and should not be modified. The client can use a RPC call Purge()
to reset the server and remove all files stored by the remote peer.
Learning objectives:
- Implement logic for uploading and processing CSV and Parquet files.
- Perform computations like summing values from specific columns.
- Manage concurrency with locking in multi-threading server/client.
- Benchmark a server/client system and visualize the results.
Before starting, please review the general project directions.
Clarifications/Corrections
- None yet
Part 1: Communication (gRPC)
In this project, the client program client.py
will communicate with a server, server.py
, via gRPC. We provide starter code for the client program. Your job is to write a .proto
file to generate a gRPC stub (used by our client) and servicer class that you will inherit from in server.py.
Take a moment to look at code for the client code and answer the following questions:
- what are the names of the imported gRPC modules? This will determine what you name your
.proto
file. - what methods are called on the stubs? This will determine the RPC definitions in your
.proto
file. - what arguments are passed to the methods, and what values are extracted from the return values? This will determine the fields in the messages in your
.proto
file. - what port number does the client use? This will determine the port that the gRPC server should expose.
Write a .proto
file based on your above observations and run the grpc_tools.protoc
compiler to generate stub code for our client and servicer code for your server. All field types will be strings, except total
and csv_data
,which should be int64
and bytes
respectively.
Now build the .proto on your VM. Install the tools like this:
python3 -m venv venv
source venv/bin/activate
pip3 install grpcio==1.70.0 grpcio-tools==1.70.0 protobuf==5.29.3
Then use grpc_tools.protoc
to build your .proto
file.
In your server, override the three RPC methods for the generated servicer. For now, the methods do nothing but returning messages with the error field set to "TODO", leaving any other field unspecified.
If communication is working correctly so far, you should be able to start a server and used a client to get back a "TODO" error message via gRPC:
python3 -u server.py &> log.txt &
python3 client.py workload.txt
# should see multiple "TODO"s
Create a Dockerfile.server
to build an image that will also let you run your server in a container. It should be possible to build and run your server like this:
docker build . -f Dockerfile.server -t ${PROJECT}-server
docker run -d -m 512m -p 127.0.0.1:5440:5440
Like P2, the compose file assumes a "PROJECT" environment variable. You can set it to p3 in your environment with (the autograder may use another prefix for testing):
export PROJECT=p3
The client program should then be able to communicate with the server program the same way it communicated with that outside of a container. Once your client program successfully interacts with the dockerized server, you should similarly draft a Dockerfile.client
to build a container for client.py
. Finally, test your setup with docker compose
: