Skip to content
Snippets Groups Projects
Unverified Commit 65ac744b authored by Gabe's avatar Gabe
Browse files

Corrections

parent dfb8dfba
No related branches found
No related tags found
No related merge requests found
......@@ -17,7 +17,7 @@ Before starting, please review the [general project directions](../projects.md).
## Corrections/Clarifications
* none yet
- 3/15/2025: Fixed p5-base dockerfile and added in note to not include the datasets in your submission.
## Setup
......@@ -46,6 +46,8 @@ mkdir -p nb/data
Run the provided `get_data.py` script to download the [DeepMind CodeContests dataset](https://huggingface.co/datasets/deepmind/code_contests) and split it into `problems.jsonl` and `solutions.jsonl`.
**NOTE:** Do NOT include the generated data in your submission. The `.gitignore` will do this for you.
### Docker Containers
```sh
......@@ -213,14 +215,15 @@ Your numbers may vary significantly, but the final run should usually be the fas
The dataset documentation for the `difficulty` field says: "For Codeforces problems, cf_rating is a more reliable measure of difficulty when available".
For this part, you will attempt to estimate the cf_rating for Codeforces problems for which it is unknown. To prepare, filter the problems to `CODEFORCES` problems, then further divide into three DataFrames:
* train dataset: `cf_rating` is >0, and `problem_id` in an EVEN number
* test dataset: `cf_rating` is >0, and `problem_id` in an ODD number
* missing dataset: `cf_rating` is 0
For this part, you will attempt to estimate the cf_rating for Codeforces problems for which it is unknown. To prepare, filter the problems to `CODEFORCES` problems, then further divide into three DataFrames:
- train dataset: `cf_rating` is >0, and `problem_id` in an EVEN number
- test dataset: `cf_rating` is >0, and `problem_id` in an ODD number
- missing dataset: `cf_rating` is 0
#### Q9: How well can a decision tree predict `cf_rating` based on `difficulty`, `time_limit`, and `memory_limit_bytes`?
Create a Spark Pipeline model with VectorAssembler and DecisionTreeRegression stages. The max tree depth should be 5. Train it on the training data, then compute an R^2 score (`r2_score`) for predictions on the test data. The R^2 score should be your answer for this question.
Create a Spark Pipeline model with VectorAssembler and DecisionTreeRegression stages. The max tree depth should be 5. Train it on the training data, then compute an R^2 score (`r2_score`) for predictions on the test data. The R^2 score should be your answer for this question.
#### Q10: Do the problems with a missing `cf_score` appear more or less challenging that other problems?
......@@ -228,9 +231,9 @@ Use your model to predict the `cf_score` in the dataset where it is missing.
Answer with a tuple with 3 numbers:
* average `cf_rating` in the training dataset
* average `cf_rating` in the test dataset
* average **prediction** of `cf_rating in the missing dataset
- average `cf_rating` in the training dataset
- average `cf_rating` in the test dataset
- average **prediction** of `cf_rating in the missing dataset
For example:
......@@ -241,8 +244,6 @@ For example:
We should be able to run the following on your submission to directly create the mini cluster:
```
# data setup...
docker build . -f p5-base.Dockerfile -t p5-base
docker build . -f notebook.Dockerfile -t p5-nb
docker build . -f namenode.Dockerfile -t p5-nn
......
......@@ -12,11 +12,8 @@ RUN apt-get update && apt-get install -y \
sudo
# Copy requirements file
COPY requirements.txt /requirements.txt
# Install Python dependencies
RUN pip3 install -r /requirements.txt
RUN pip3 install jupyterlab==4.3.5 pandas==2.2.3 pyspark==3.5.5 matplotlib==3.10.1
# Download and extract Hadoop
RUN wget https://dlcdn.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz && \
......@@ -33,4 +30,3 @@ RUN wget https://dlcdn.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ENV PATH="${PATH}:/hadoop-3.4.1/bin"
ENV HADOOP_HOME=/hadoop-3.4.1
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment