Skip to content
Snippets Groups Projects
Commit b027ad84 authored by TYLER CARAZA-HARTER's avatar TYLER CARAZA-HARTER
Browse files
parents f6f2843c c9a29938
No related branches found
No related tags found
No related merge requests found
......@@ -20,7 +20,10 @@ Before starting, please review the [general project directions](../projects.md).
## Clarifications/Corrections
* none yet
* Feb 24: feel free to use different tools to implement Part 2.
* Feb 24: clarify that `bigdata.py` will be used in tests.
* Feb 24: add link to lecture notes on parquet file operations.
* Feb 24: remove port forwarding for `docker run` since we test server with `docker exec`
## Part 1: Communication (gRPC)
......@@ -79,7 +82,7 @@ server like this:
```
docker build . -t p3
docker run -d -m 512m -p 127.0.0.1:5440:5440 p3
docker run -d -m 512m p3
```
The client programs should then be able to communicate with the
......@@ -97,8 +100,8 @@ clients need to run. When we test your code, we will run the clients
in the same container as the server, like this:
```
docker run --name=server -d -m 512m -p 127.0.0.1:5440:5440 -v ./inputs:/inputs p3 # server
docker exec server python3 upload.py /inputs/test1.csv # client
docker run --name=server -d -m 512m -v ./inputs:/inputs p3 # server
docker exec server python3 upload.py /inputs/test1.csv # client
```
Note that you don't need to have an `inputs/test1.csv` file, as the
......@@ -114,7 +117,7 @@ to re-run your container with newer server.py code without rebuilding
first. Here's an example:
```
docker run --rm -m 512m -p 127.0.0.1:5440:5440 -v ./server.py:/server.py p3
docker run --rm -m 512m -v ./server.py:/server.py p3
```
## Part 2: Upload
......@@ -126,7 +129,9 @@ file (for example, you could add the path to some data structure, like a
list or dictionary).
Your server should similarly write the same data to a parquet file
somewhere, using pyarrow.
somewhere, using `pyarrow`, `pandas`, or any available tools. Refer to
the [lecture notes](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/14-file-formats?ref_type=heads)
for a few examples of reading/writing parquet files.
## Part 3: Column Sum
......@@ -174,22 +179,24 @@ be a performance depending on which format is used.
Parquet is a column-oriented format, so all the data in a single file
should be adjacent on disk. This means it should be possible to read
a column of data without reading the whole file. See the `columns`
parameter here:
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
a column of data without reading the whole file. Check out the `columns`
parameter of [`pyarrow.parquet.read_table`](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html).
You can also find an example from the [lecture notes](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/14-file-formats?ref_type=heads).
**Requirement:** when the server is asked to sum over the column of a
Parquet file, it should only read the data from that column, not other
columns.
**Note:** we will run your server with a 512-MB limit on RAM. Any
**Note 1:** we will run your server with a 512-MB limit on RAM. Any
individual files we upload will fit within that limit, but the total
size of the files uploaded will exceed that limit. That's why your
server will have to do sums by reading the files (instead of just
keeping all table data in memory). If you want manually test your
code with some bigger uploads, use the `bigdata.py` client. Instead
of uploading files, it randomly generateds lots of CSV-formatted data
and directly uploads it via gRPC.
keeping all table data in memory).
**Note 2:** the `bigdata.py` randomly generates a large volumne of
CSV-formatted data and uploads it vis gRPC. You are *required* to
test your upload implementation with this script and it will be used
as part of our tests.
## Part 4: Locking
......@@ -243,12 +250,13 @@ be able to run your client and server as follows:
docker build . -t p3
# run server in new container
docker run --name=yournetid -d -m 512m -p 127.0.0.1:5440:5440 -v ./inputs:/inputs p3
docker run --name=yournetid -d -m 512m -v ./inputs:/inputs p3
# run clients in same container
docker exec yournetid python3 upload.py /inputs/simple.csv
docker exec yournetid python3 csvsum.py x
docker exec yournetid python3 parquetsum.py x
docker exec yournetid python3 bigdata.py
```
Please do include the files built from the .proto (your Dockerfile
......
......@@ -173,9 +173,28 @@ pip install -r requirements.txt
## Submission
Whenever you push to `main`, we determine that as a "submission" and run `autobadger` on your `main` branch. We then push our results to your repository under `Issues`. This issue will contain the contents of `autobadger` as well as some other metadata and notes. This *should* have the same output as if you were to run it locally. If anything seems terribly wrong, please email your [assigned TA](https://docs.google.com/spreadsheets/d/1HwI0o3IE97AWe_P_sKRPrUITPPGEdvsLzfEKcrP8NrU/edit?usp=sharing) with a link to your GitLab issue.
Whenever you push to `main`, we run `autobadger` on your `main` branch. We then push our results to your repository under `Issues`.
> **NOTE**: Be carefull not to push after the deadline unless your intention is to submit late (see policy below).
This issue will contain the contents of `autobadger` as well as some other metadata and notes. This will almost always be your project's final grade, though we do manual reviews of your code as well to check against cheating and hardcoding. We also take the highest grade of all your submissions. In other words, if you get 100 on a GitLab issue, then you are done! :)
### IMPORTANT!
**It is important to note that it is *your responsibility* to verify**:
1. You receive a GitLab issue (within a reasonable amount of time, i.e. an hour, but normally much shorter than that)
2. The results you see align with what you expect.
If there is an issue with (1) or (2), double check your code, give it some time before you push again or [rerun your GitLab pipeline](https://piazza.com/class/m64hzy9v23v398/post/85) manually. If the issue is not resolved after a few attempts, then reach out to your [TA](https://tyler.caraza-harter.com/cs544/s25/messages.html?topic=ta) or visit us in office hours.
> **NOTE**: in cases around/after the deadline, it is better manually rerun the pipeline (if you suspect that your code is fine) than to push to `main` again. We keep track of your latest push to check against the project's deadline.
As such, it is _highly recommended_ to start early, push often, and not wait till the minutes before the deadline to submit! Give yourself a buffer against unexpected issues.
Since it is your responsibility to verify your GitLab issue (and your submission), we will not accept revision requests due to you not checking the status of your GitLab issues beforehand.
> **NOTE**: Be careful not to push after the deadline unless your intention is to submit late (see policy below).
### Miscellaneous
* projects have four parts; for notebooks, use big headers to divide your work into the four parts ("# Part 1: ...")
* for question based project work, (Q1, Q2, etc), include comments like ("# Q1: ...") before the answers
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment