Skip to content
Snippets Groups Projects
Commit b9f4a26e authored by tylerharter's avatar tylerharter
Browse files

P5 draft, write questions

parent 4a4630b7
No related branches found
No related tags found
No related merge requests found
......@@ -106,43 +106,43 @@ problems_df.limit(5).show()
If loaded properly, you should see:
![image.png](image.png)
#### Q1: Filtering with **RDD**
#### Q1: How many problems are there with a `cf_rating` of at least 1600, having `private_tests`, and a name containing "_A." (Case Sensitive)? Answer by directly using the RDD API.
Find the number of problems that meet all these criteria:
- `cf_rating` of at least 1600
- Has `private_tests`
- The problem `name` contains `"_A."` (**Case Sensitive**)
You must use `problems_df.rdd` to answer this question. After calling `filter`, call `count` to get the final answer.
Remember that if you have a Spark DataFrame `df`, you can get the underlying RDD using `df.rdd`.
**REMEMBER TO INCLUDE `#q1` AT THE TOP OF THIS CELL**
#### Q2: Filtering with **DataFrames** and **expr**
#### Q2: How many problems are there with a `cf_rating` of at least 1600, having `private_tests`, and a name containing "_A." (Case Sensitive)? Answer by directly using the RDD API. Answer by using the DataFrame API.
Solve the same problem as Q1, but you **must** use `problems_df.filter` and `expr`. Again, call `count` to get the final answer. If done correctly, you should get the same answer as in Q1.
This is the same question as Q1, and you should get the same answer. This is to give you to interact with Spark different ways.
#### Q3: Filtering with **Spark SQL**
#### Q3: How many problems are there with a `cf_rating` of at least 1600, having `private_tests`, and a name containing "_A." (Case Sensitive)? Answer by directly using the RDD API. Answer by using Spark SQL.
Solve the same problem as the previous two questions, but this time use `spark.sql`. First, write the problems to a table:
```python
problems_df.write.saveAsTable("problems", mode="overwrite")
```
Before you can use `spark.sql`, write the problem data to a Hive table so that you can refer to it by name.
Again, the result after calling `count` should match your answers for Q1 and Q2.
## Part 2: Hive Data Warehouse
#### Q4: Bucketing `solutions` table
#### Q4: Does the query plan for a GROUP BY on solutions data need to shuffle/exchange rows if the data is pre-buckted?
Write the data from `solutions.jsonl` to a Hive table named `solutions`, like you did for `problems`. This time, though, bucket the data by "language" and use 4 buckets when writing to the table.
Use Spark SQL to explain the query plan for this query:
We've already added the `problems` table to Hive. Now, create the `solutions` table from `solutions.jsonl`.
```sql
SELECT language, COUNT(*)
FROM solutions
GROUP BY language
```
Unlike the `problems` table, use [`bucketBy`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.bucketBy.html) with your `writeTable` call to create 4 buckets on the column `"language"`. This divides your data into 4 buckets/groups, with all rows having the same language in the same bucket. This makes some queries faster (for example, when you `GROUP BY` on `language`, Spark might avoid shuffling data across partitions/machines).
The `explain` output suffices for your answer. Take note (for your own sake) whether any `Exchange` appears in the output. Think about why an exchange/shuffle is or is not needed between the `partial_count` and `count` aggregates.
<!--
After bucketing the solutions, call `.explain` on a query that counts solutions per language. This should output `== Physical Plan ==` followed by the plan details. You've bucketed correctly if `Bucketed: true` appears in the output.
-->
#### Q5: What tables are in our warehouse?
#### Q5: What tables/views are in our warehouse?
You'll notice additional CSV files in `nb/data` that we haven't used yet. Create a Hive view for each using `createOrReplaceTempView`. Use these files:
......@@ -173,9 +173,9 @@ You may use any method for this question. Join the `solutions` table with the `p
Answer Q6 with code and a single integer. **DO NOT HARDCODE THE CODEFORCES ID**.
#### Q7: Converting `difficulty` to a human-readable string
#### Q7: How many problems are of easy/medium/hard difficulty?
The `problems_df` has a `difficulty` column. Create a new column using the [`withColumn` function](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html) called `difficulty_str` with these rules:
The `problems_df` has a numeric `difficulty` column. For the purpose of categorizing the problems, interpret this number as follows:
- `<= 5` is `Easy`
- `<= 10` is `Medium`
......@@ -189,11 +189,25 @@ Your answer should return this dictionary:
**Hint:** https://www.w3schools.com/sql/sql_case.asp
#### Q8: Spark Caching
#### Q8: Does caching make it faster to compute averages over a subset of a bigger dataset?
To test the impact of caching, we are going to do the same calculations with and without caching. Implement a query that first filters rows of `problem_tests` to get rows where `is_generated` is `False` -- use a variable to refer to the resulting DataFrame.
Write some code to compute the average `input_chars` and `output_chars` over this DataFrame. Then, write code to do an experiment as follows:
1. compute the averages
2. make a call to cache the data
3. compute the averages
4. compute the averages
5. uncache the data
To test the impact of caching, we are going to run the same query with and without cached data. Implement the query that first filters rows where `is_generated` is `False`, then get the average `input_chars` and `output_chars` by problem id. You should cache the `is_generated` filtering only.
Measure the number of seconds it takes each of the three times we do the average calculations.s_generated` filtering only. Answer with list of the three times, in order, as follows:
```
[0.48511195182800293, 0.47667789459228516, 0.1396317481994629]
```
TODO: 3 runs
Your numbers may vary significantly, but the final run should usually be the fastest.
## Part 4: Machine Learning with Spark
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment