@@ -17,7 +17,7 @@ Before starting, please review the [general project directions](../projects.md).
## Corrections/Clarifications
* none yet
- 3/15/2025: Fixed p5-base dockerfile and added in note to not include the datasets in your submission.
## Setup
...
...
@@ -46,6 +46,8 @@ mkdir -p nb/data
Run the provided `get_data.py` script to download the [DeepMind CodeContests dataset](https://huggingface.co/datasets/deepmind/code_contests) and split it into `problems.jsonl` and `solutions.jsonl`.
**NOTE:** Do NOT include the generated data in your submission. The `.gitignore` will do this for you.
### Docker Containers
```sh
...
...
@@ -213,14 +215,15 @@ Your numbers may vary significantly, but the final run should usually be the fas
The dataset documentation for the `difficulty` field says: "For Codeforces problems, cf_rating is a more reliable measure of difficulty when available".
For this part, you will attempt to estimate the cf_rating for Codeforces problems for which it is unknown. To prepare, filter the problems to `CODEFORCES` problems, then further divide into three DataFrames:
* train dataset: `cf_rating` is >0, and `problem_id` in an EVEN number
* test dataset: `cf_rating` is >0, and `problem_id` in an ODD number
* missing dataset: `cf_rating` is 0
For this part, you will attempt to estimate the cf_rating for Codeforces problems for which it is unknown. To prepare, filter the problems to `CODEFORCES` problems, then further divide into three DataFrames:
- train dataset: `cf_rating` is >0, and `problem_id` in an EVEN number
- test dataset: `cf_rating` is >0, and `problem_id` in an ODD number
- missing dataset: `cf_rating` is 0
#### Q9: How well can a decision tree predict `cf_rating` based on `difficulty`, `time_limit`, and `memory_limit_bytes`?
Create a Spark Pipeline model with VectorAssembler and DecisionTreeRegression stages. The max tree depth should be 5. Train it on the training data, then compute an R^2 score (`r2_score`) for predictions on the test data. The R^2 score should be your answer for this question.
Create a Spark Pipeline model with VectorAssembler and DecisionTreeRegression stages. The max tree depth should be 5. Train it on the training data, then compute an R^2 score (`r2_score`) for predictions on the test data. The R^2 score should be your answer for this question.
#### Q10: Do the problems with a missing `cf_score` appear more or less challenging that other problems?
...
...
@@ -228,9 +231,9 @@ Use your model to predict the `cf_score` in the dataset where it is missing.
Answer with a tuple with 3 numbers:
* average `cf_rating` in the training dataset
* average `cf_rating` in the test dataset
* average **prediction** of `cf_rating in the missing dataset
- average `cf_rating` in the training dataset
- average `cf_rating` in the test dataset
- average **prediction** of `cf_rating in the missing dataset
For example:
...
...
@@ -241,8 +244,6 @@ For example:
We should be able to run the following on your submission to directly create the mini cluster: