Corrections

65ac744b · Gabe · dfb8dfba · 65ac744b · 65ac744b
Unverified Commit 65ac744b authored 1 month ago by Gabe
--- a/p5/README.md
+++ b/p5/README.md
@@ -17,7 +17,7 @@ Before starting, please review the [general project directions](../projects.md).

 ## Corrections/Clarifications

-* none yet
+- 3/15/2025: Fixed p5-base dockerfile and added in note to not include the datasets in your submission.

 ## Setup

@@ -46,6 +46,8 @@ mkdir -p nb/data

 Run the provided `get_data.py` script to download the [DeepMind CodeContests dataset](https://huggingface.co/datasets/deepmind/code_contests) and split it into `problems.jsonl` and `solutions.jsonl`.

+**NOTE:** Do NOT include the generated data in your submission. The `.gitignore` will do this for you.
+
 ### Docker Containers

 ```sh
@@ -213,14 +215,15 @@ Your numbers may vary significantly, but the final run should usually be the fas

 The dataset documentation for the `difficulty` field says: "For Codeforces problems, cf_rating is a more reliable measure of difficulty when available".

-For this part, you will attempt to estimate the cf_rating for Codeforces problems for which it is unknown.  To prepare, filter the problems to `CODEFORCES` problems, then further divide into three DataFrames:
-* train dataset: `cf_rating` is >0, and `problem_id` in an EVEN number
-* test dataset: `cf_rating` is >0, and `problem_id` in an ODD number
-* missing dataset: `cf_rating` is 0
+For this part, you will attempt to estimate the cf_rating for Codeforces problems for which it is unknown. To prepare, filter the problems to `CODEFORCES` problems, then further divide into three DataFrames:
+
+- train dataset: `cf_rating` is >0, and `problem_id` in an EVEN number
+- test dataset: `cf_rating` is >0, and `problem_id` in an ODD number
+- missing dataset: `cf_rating` is 0

 #### Q9: How well can a decision tree predict `cf_rating` based on `difficulty`, `time_limit`, and `memory_limit_bytes`?

-Create a Spark Pipeline model with VectorAssembler and DecisionTreeRegression stages.  The max tree depth should be 5.  Train it on the training data, then compute an R^2 score (`r2_score`) for predictions on the test data.  The R^2 score should be your answer for this question.
+Create a Spark Pipeline model with VectorAssembler and DecisionTreeRegression stages. The max tree depth should be 5. Train it on the training data, then compute an R^2 score (`r2_score`) for predictions on the test data. The R^2 score should be your answer for this question.

 #### Q10: Do the problems with a missing `cf_score` appear more or less challenging that other problems?

@@ -228,9 +231,9 @@ Use your model to predict the `cf_score` in the dataset where it is missing.

 Answer with a tuple with 3 numbers:

-* average `cf_rating` in the training dataset
-* average `cf_rating` in the test dataset
-* average **prediction** of `cf_rating in the missing dataset
+- average `cf_rating` in the training dataset
+- average `cf_rating` in the test dataset
+- average **prediction** of `cf_rating in the missing dataset

 For example:

@@ -241,8 +244,6 @@ For example:
 We should be able to run the following on your submission to directly create the mini cluster:

 ```
-# data setup...
-
 docker build . -f p5-base.Dockerfile -t p5-base
 docker build . -f notebook.Dockerfile -t p5-nb
 docker build . -f namenode.Dockerfile -t p5-nn

--- a/p5/p5-base.Dockerfile
+++ b/p5/p5-base.Dockerfile
@@ -12,11 +12,8 @@ RUN apt-get update && apt-get install -y \
    sudo


-# Copy requirements file
-COPY requirements.txt /requirements.txt
-
 # Install Python dependencies
-RUN pip3 install -r /requirements.txt
+RUN pip3 install jupyterlab==4.3.5 pandas==2.2.3 pyspark==3.5.5 matplotlib==3.10.1

 # Download and extract Hadoop
 RUN wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz && \
@@ -33,4 +30,3 @@ RUN wget https://dlcdn.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz
 ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
 ENV PATH="${PATH}:/hadoop-3.4.1/bin"
 ENV HADOOP_HOME=/hadoop-3.4.1
-