README.md

# P3 (5% of grade): Large, Thread-Safe Tables

# DRAFT: DO NOT START

## Overview

In this project, you'll build a server that handles the uploading of CSV files, storing their contents, and performing query operations on the data. The server maintains **only ONE** logical table. You should think of each uploaded CSV as containing a portion of that larger table, which grows with each upload.

The server will write two files for each uploaded CSV file: one in CSV format and another in Parquet (i.e., they are two copies of the table in different formats). Clients that we provide will communicate with your server via RPC calls.

Learning objectives:
* Implement logic for uploading and processing CSV and Parquet files.
* Perform computations like summing values from specific columns.
* Manage concurrency with locking in multi-threading server/client.
* Benchmark a server/client system and visualize the results.

Before starting, please review the [general project directions](../projects.md).

## Clarifications/Corrections


## Part 1: Communication (gRPC)

In this project, the client program `client.py` will communicate with a server, `server.py`, via gRPC. We provide starter code for the client program. Your job is to write a `.proto` file to generate a gRPC stub (used by our client) and servicer class that you will inherit from in server.py).

Take a moment to look at code for the client code and answer the following questions:

* what are the names of the imported gRPC modules?  This will determine what you name your `.proto` file.
* what methods are called on the stubs?  This will determine the RPC definitions in your `.proto` file.
* what arguments are passed to the methods, and what values are extracted from the return values? This will determine the fields in the messages in your `.proto` file.
* what port number does the client use? This will determine the port that the gRPC server should expose.

Write a `.proto` file based on your above observations and run the `grpc_tools.protoc` compiler to generate stub code for our client and servicer code for your server. All field types will be strings, except `total` and `csv_data`,which should be `int64` and `bytes` respectively.

Now build the .proto on your VM. Install the tools like this:

```bash
python3 -m venv venv
source venv/bin/activate
pip3 install grpcio==1.66.1 grpcio-tools==1.66.1 protobuf==5.27.2
```

Then use `grpc_tools.protoc` to build your `.proto` file.

In your server, override the *three* RPC methods for the generated servicer. For now, the methods do nothing but returning messages with the error field set to "TODO", leaving any other field unspecified.

If communication is working correctly so far, you should be able to start a server and used a client to get back a "TODO" error message via gRPC:

```bash
python3 -u server.py &> log.txt &
python3 client.py workload
# should see multiple "TODO"s
```

In P3, `client.py` takes in a batch of operation commands stored in `workload` and executes them line by line. Inspect both the `workload` content and client code (i.e., `read_workload_file()`) to understand how each text command leads to one `gRPC` call. A separate `purge` workload file is provided and *should not be modified*. The client can use a RPC call `Purge()` to reset the server and remove all files stored by the remote peer.

Create a `Dockerfile.server` to build an image that will also let you run your
server in a container.  It should be possible to build and run your
server like this:

```bash
docker build . -f Dockerfile.server -t ${PROJECT}-server
docker run -d -m 512m -p 127.0.0.1:5440:5440 
```

Like P2, the compose file assumes a "PROJECT" environment variable. You can set it to p3 in your environment with (the autograder may use another prefix for testing):

```bash
export PROJECT=p3
```

The client program should then be able to communicate with the server program the same way they communicated with that outside of a container. Once your client program successfully interacts with the dockerrized server, you should similarly draft a `Dockerfile.client` to build a container for `client.py`. Finally, test your setup with `docker compose`:

```bash
docker compose up -d
docker ps
# should see:
CONTAINER ID   IMAGE       COMMAND                  CREATED         ...
fa8de65e0e7c   p3-client   "python3 -u /client.…"   2 seconds ago   ...
4c899de6e43f   p3-server   "python3 -u /server.…"   2 seconds ago   ...
```

**HINT 1:** consider writing a .sh script that helps you merge code changes. Everytime you modify the source code `client.py/server.py/benchmark.py`, you may want to rebuild the images, bring down the previous docker cluster, and re-instantiate a new cluster.

## Part 2: Server Implementation

You will need to implement three RPC calls on the server side:

### Upload

This method should:
1. Read table from bytes provided by the RPC request
2. Write the table to a CSV file and write the same table to another file in Parquet format

**HINT 1:** You are free to decide the names and locations of the stored files. However, the server must keep these records to process future queries (for instance, you can add paths to a data structure like a list or dictionary).

**HINT 2:** Both `pandas` and `pyarrow` provide interfaces to write a table to file.

### ColSum

Whenever your server receives a column summation request, it should loop over all the data files that has been uploaded, compute a local sum for each such file, and finally return a total sum for the whole table.

When your server receives a column summation request, it should loop
over all the data that has been uploaded, computing a sum for each
file, and returning a total sum.

For example, assume sample1.csv and sample2.csv contain these records:

```
x,y,z
1,2,3
4,5,6
```

And:

```
x,y
5,10
0,20
```

You should be able to upload the files and do sums with the following `workload` description:

```
u sample1.csv
u sample2.csv
s p x # should print 10
s p z # should print 9
s c w # should print 0
```

You can assume columns contain only integers. The table does not have a fixed schema (i.e., it is not guaranteed that a column appears in any uploaded file). You should skip a file if it lacks the target column (e.g., z and w in the above example).

The server should sum over either Parquet or CSV files according to the input `format` (not both). You should expect querying one column by two formats to produce the same output.

### Purge

This method facilitates testing and subsequent benchmarking. The method should:
1. Remove all local file previously uploaded by method `Upload()`
2. Reset all associated server state (e.g., counters, paths, etc)

## Part 3: Multi-threading Client


## Part 4: Benchmarking the System

You don't need to explicitly create threads using Python calls because
gRPC will do it for you.  Set `max_workers` to 8 so that gRPC will
create 8 threads:

```
grpc.server(
        futures.ThreadPoolExecutor(max_workers=????),
        options=[("grpc.so_reuseport", 0)]
)
```

Now that your server has multiple threads, your code should hold a
lock (https://docs.python.org/3/library/threading.html#threading.Lock)
whenever accessing any shared data structures, including the list(s)
of files (or whatever data structure you used). Use a single global
lock for everything.  Ensure the lock is released properly, even when
there is an exception. Even if your chosen data structures provide any
guarantees related to thread-safe access, you must still hold the lock
when accessing them to gain practice protecting shared data.

**Requirement:** reading and writing files is a slow operation, so
your code must NOT hold the lock when doing file I/O.

## Grading

<!-- Details about the autograder are coming soon. -->

Copy `autograde.py` to your working directory 
then run `python3 -u autograde.py` to test your work.
This constitutes 75% of the total score. You can add `-v` flag to get a verbose output from the autograder.

If you want to manually test on a somewhat bigger dataset, run
`python3 bigdata.py`.  This generates 100 millions rows across 400
files and uploads them.  The "x" column only contains 1's, so you if
sum over it, you should get 100000000.

The other 25% of the total score will be graded by us.
Locking and performance-related details are hard to automatically
test, so here's a checklist of things we'll be looking for:

- are there 8 threads?
- is the lock held when shared data structures accessed?
- is the lock released when files are read or written?
- does the summation RPC use either parquets or CSVs based on the passed argument?
- when a parquet is read, is the needed column the only one that is read?

## Submission

You have some flexibility in how your organize your project
files. However, we need to be able to easily run your code.  In order
to be graded, please ensure to push anything necessary so that we'll
be able to run your client and server as follows:

```sh
git clone YOUR_REPO
cd YOUR_REPO

# copy in tester code and client programs...
python3 -m venv venv
source venv/bin/activate
pip3 install grpcio==1.66.1 grpcio-tools==1.66.1 numpy==2.1.1 protobuf==5.27.2 pyarrow==17.0.0 setuptools==75.1.0

# run server
docker build . -t p3
docker run -d -m 512m -p 127.0.0.1:5440:5440 p3

# run clients
python3 upload.py simple.csv
python3 csvsum.py x
python3 parquetsum.py x
```

Please do include the files built from the .proto.  Do NOT include the venv directory.

After pushing your code to the designated GitLab repository, 
you can also verify your submission. 
To do so, simply copy `check_sub.py` to your working directory and run 
the command `python3 check_sub.py`