Merge branch 'main' of https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main

b027ad84 · TYLER CARAZA-HARTER · f6f2843c · c9a29938 · b027ad84 · b027ad84
Commit b027ad84 authored 2 weeks ago by TYLER CARAZA-HARTER
--- a/p3/README.md
+++ b/p3/README.md
@@ -20,7 +20,10 @@ Before starting, please review the [general project directions](../projects.md).

 ## Clarifications/Corrections

-* none yet
+* Feb 24: feel free to use different tools to implement Part 2.
+* Feb 24: clarify that `bigdata.py` will be used in tests.
+* Feb 24: add link to lecture notes on parquet file operations.
+* Feb 24: remove port forwarding for `docker run` since we test server with `docker exec`

 ## Part 1: Communication (gRPC)

@@ -79,7 +82,7 @@ server like this:

 ```
 docker build . -t p3
-docker run -d -m 512m -p 127.0.0.1:5440:5440 p3
+docker run -d -m 512m p3
 ```

 The client programs should then be able to communicate with the
@@ -97,8 +100,8 @@ clients need to run.  When we test your code, we will run the clients
 in the same container as the server, like this:

 ```
-docker run --name=server -d -m 512m -p 127.0.0.1:5440:5440 -v ./inputs:/inputs p3   # server
-docker exec server python3 upload.py /inputs/test1.csv                              # client
+docker run --name=server -d -m 512m -v ./inputs:/inputs p3   # server
+docker exec server python3 upload.py /inputs/test1.csv       # client
 ```

 Note that you don't need to have an `inputs/test1.csv` file, as the
@@ -114,7 +117,7 @@ to re-run your container with newer server.py code without rebuilding
 first.  Here's an example:

 ```
-docker run --rm -m 512m -p 127.0.0.1:5440:5440 -v ./server.py:/server.py p3
+docker run --rm -m 512m -v ./server.py:/server.py p3
 ```

 ## Part 2: Upload
@@ -126,7 +129,9 @@ file (for example, you could add the path to some data structure, like a
 list or dictionary).

 Your server should similarly write the same data to a parquet file
-somewhere, using pyarrow.
+somewhere, using `pyarrow`, `pandas`, or any available tools. Refer to
+the [lecture notes](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/14-file-formats?ref_type=heads)
+for a few examples of reading/writing parquet files.

 ## Part 3: Column Sum

@@ -174,22 +179,24 @@ be a performance depending on which format is used.

 Parquet is a column-oriented format, so all the data in a single file
 should be adjacent on disk.  This means it should be possible to read
-a column of data without reading the whole file.  See the `columns`
-parameter here:
-https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
+a column of data without reading the whole file. Check out the `columns`
+parameter of [`pyarrow.parquet.read_table`](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html).
+You can also find an example from the [lecture notes](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/14-file-formats?ref_type=heads).

 **Requirement:** when the server is asked to sum over the column of a
 Parquet file, it should only read the data from that column, not other
 columns.

-**Note:** we will run your server with a 512-MB limit on RAM.  Any
+**Note 1:** we will run your server with a 512-MB limit on RAM.  Any
 individual files we upload will fit within that limit, but the total
 size of the files uploaded will exceed that limit.  That's why your
 server will have to do sums by reading the files (instead of just
-keeping all table data in memory).  If you want manually test your
-code with some bigger uploads, use the `bigdata.py` client.  Instead
-of uploading files, it randomly generateds lots of CSV-formatted data
-and directly uploads it via gRPC.
+keeping all table data in memory).
+
+**Note 2:** the `bigdata.py` randomly generates a large volumne of
+CSV-formatted data and uploads it vis gRPC. You are *required* to
+test your upload implementation with this script and it will be used
+as part of our tests.

 ## Part 4: Locking

@@ -243,12 +250,13 @@ be able to run your client and server as follows:
 docker build . -t p3

 # run server in new container
-docker run --name=yournetid -d -m 512m -p 127.0.0.1:5440:5440 -v ./inputs:/inputs p3
+docker run --name=yournetid -d -m 512m -v ./inputs:/inputs p3

 # run clients in same container
 docker exec yournetid python3 upload.py /inputs/simple.csv
 docker exec yournetid python3 csvsum.py x
 docker exec yournetid python3 parquetsum.py x
+docker exec yournetid python3 bigdata.py
 ```

 Please do include the files built from the .proto (your Dockerfile

--- a/projects.md
+++ b/projects.md
@@ -173,9 +173,28 @@ pip install -r requirements.txt

 ## Submission

-Whenever you push to `main`, we determine that as a "submission" and run `autobadger` on your `main` branch. We then push our results to your repository under `Issues`. This issue will contain the contents of `autobadger` as well as some other metadata and notes. This *should* have the same output as if you were to run it locally. If anything seems terribly wrong, please email your [assigned TA](https://docs.google.com/spreadsheets/d/1HwI0o3IE97AWe_P_sKRPrUITPPGEdvsLzfEKcrP8NrU/edit?usp=sharing) with a link to your GitLab issue.
+Whenever you push to `main`, we run `autobadger` on your `main` branch. We then push our results to your repository under `Issues`.

-> **NOTE**: Be carefull not to push after the deadline unless your intention is to submit late (see policy below).
+This issue will contain the contents of `autobadger` as well as some other metadata and notes. This will almost always be your project's final grade, though we do manual reviews of your code as well to check against cheating and hardcoding. We also take the highest grade of all your submissions. In other words, if you get 100 on a GitLab issue, then you are done! :)
+
+### IMPORTANT!
+
+**It is important to note that it is *your responsibility* to verify**:
+
+1. You receive a GitLab issue (within a reasonable amount of time, i.e. an hour, but normally much shorter than that)
+2. The results you see align with what you expect.
+
+If there is an issue with (1) or (2), double check your code, give it some time before you push again or [rerun your GitLab pipeline](https://piazza.com/class/m64hzy9v23v398/post/85) manually. If the issue is not resolved after a few attempts, then reach out to your [TA](https://tyler.caraza-harter.com/cs544/s25/messages.html?topic=ta) or visit us in office hours.
+
+> **NOTE**: in cases around/after the deadline, it is better manually rerun the pipeline (if you suspect that your code is fine) than to push to `main` again. We keep track of your latest push to check against the project's deadline.
+
+As such, it is _highly recommended_ to start early, push often, and not wait till the minutes before the deadline to submit! Give yourself a buffer against unexpected issues.
+
+Since it is your responsibility to verify your GitLab issue (and your submission), we will not accept revision requests due to you not checking the status of your GitLab issues beforehand.
+
+> **NOTE**: Be careful not to push after the deadline unless your intention is to submit late (see policy below).
+
+### Miscellaneous

 * projects have four parts; for notebooks, use big headers to divide your work into the four parts ("# Part 1: ...")
 * for question based project work, (Q1, Q2, etc), include comments like ("# Q1: ...") before the answers