Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.
README.md 11.76 KiB

P3 (4% of grade): Large, Thread-Safe Tables

DRAFT: DO NOT START

Overview

In this project, you'll build a server that handles the uploading of CSV files, storing their contents, and performing query operations on the data. The server maintains only ONE logical table. You should think of each uploaded CSV as containing a portion of that larger table, which grows with each upload.

The server will write two files for each uploaded CSV file: one in CSV format and another in Parquet (i.e., they are two copies of the table in different formats). Clients that we provide will communicate with your server via RPC calls.

Workflow and Format Walkthrough

In P3, our client.py takes in a batch of operation commands stored in file workload.txt and executes them line by line. There are two types of commands you can put into workload.txt to control the client behavior. First, each upload command:

u file.csv

will instruct the client to read a CSV data file named file.csv as binary bytes, and use the corresponding RPC call to upload it to the server. Next, you can use a subsequent sum command to perform summation to the table over one specified column. For example:

s p x

asks the client to send an RPC request to the server and instruct the server to return the total sum value of column x. As there are two copies of the same table in CSV and Parquet format, p in the command asks the server to read column data only from Parquet files. Below is a minimal example. Assume that the server has uploaded two files file1.csv and fil12.csv, which contain these records respectively:

x,y,z
1,2,3
4,5,6

And:

x,y
5,10
0,20
10,15

You can assume columns contain only integers. You should be able to upload the files and do sums with the following workload.txt description:

u file1.csv
u file2.csv
s p x
s p z
s c w

Expected ouptut would be:

20
9
0

Inspect both the workload.txt file content and client code (i.e., read_workload_file()) to understand how each text command leads to one gRPC call. A separate purge.txt workload file is provided and should not be modified. The client can use a RPC call Purge() to reset the server and remove all files stored by the remote peer.

Learning objectives:

  • Implement logic for uploading and processing CSV and Parquet files.
  • Perform computations like summing values from specific columns.
  • Manage concurrency with locking in multi-threading server/client.
  • Benchmark a server/client system and visualize the results.

Before starting, please review the general project directions.

Clarifications/Corrections

  • None yet

Part 1: Communication (gRPC)

In this project, the client program client.py will communicate with a server, server.py, via gRPC. We provide starter code for the client program. Your job is to write a .proto file to generate a gRPC stub (used by our client) and servicer class that you will inherit from in server.py.

Take a moment to look at code for the client code and answer the following questions:

  • what are the names of the imported gRPC modules? This will determine what you name your .proto file.
  • what methods are called on the stubs? This will determine the RPC definitions in your .proto file.
  • what arguments are passed to the methods, and what values are extracted from the return values? This will determine the fields in the messages in your .proto file.
  • what port number does the client use? This will determine the port that the gRPC server should expose.

Write a .proto file based on your above observations and run the grpc_tools.protoc compiler to generate stub code for our client and servicer code for your server. All field types will be strings, except total and csv_data,which should be int64 and bytes respectively.

Now build the .proto on your VM. Install the tools like this:

python3 -m venv venv
source venv/bin/activate
pip3 install grpcio==1.70.0 grpcio-tools==1.70.0 protobuf==5.29.3

Then use grpc_tools.protoc to build your .proto file.

In your server, override the three RPC methods for the generated servicer. For now, the methods do nothing but returning messages with the error field set to "TODO", leaving any other field unspecified.

If communication is working correctly so far, you should be able to start a server and used a client to get back a "TODO" error message via gRPC:

python3 -u server.py &> log.txt &
python3 client.py workload.txt
# should see multiple "TODO"s

Create a Dockerfile.server to build an image that will also let you run your server in a container. It should be possible to build and run your server like this:

docker build . -f Dockerfile.server -t ${PROJECT}-server
docker run -d -m 512m -p 127.0.0.1:5440:5440 

Like P2, the compose file assumes a "PROJECT" environment variable. You can set it to p3 in your environment with (the autograder may use another prefix for testing):

export PROJECT=p3

The client program should then be able to communicate with the server program the same way it communicated with that outside of a container. Once your client program successfully interacts with the dockerized server, you should similarly draft a Dockerfile.client to build a container for client.py. Finally, test your setup with docker compose: