P5 draft, write questions

b9f4a26e · tylerharter · 4a4630b7 · b9f4a26e
Commit b9f4a26e authored 1 month ago by tylerharter
--- a/p5/README.md
+++ b/p5/README.md
@@ -106,43 +106,43 @@ problems_df.limit(5).show()
 If loaded properly, you should see:
 ![image.png](image.png)

-#### Q1: Filtering with **RDD**
+#### Q1: How many problems are there with a `cf_rating` of at least 1600, having `private_tests`, and a name containing "_A." (Case Sensitive)?  Answer by directly using the RDD API.

-Find the number of problems that meet all these criteria:
-
- `cf_rating` of at least 1600
- Has `private_tests`
- The problem `name` contains `"_A."` (**Case Sensitive**)
-
-You must use `problems_df.rdd` to answer this question. After calling `filter`, call `count` to get the final answer.
+Remember that if you have a Spark DataFrame `df`, you can get the underlying RDD using `df.rdd`.

 **REMEMBER TO INCLUDE `#q1` AT THE TOP OF THIS CELL**

-#### Q2: Filtering with **DataFrames** and **expr**
+#### Q2: How many problems are there with a `cf_rating` of at least 1600, having `private_tests`, and a name containing "_A." (Case Sensitive)?  Answer by directly using the RDD API.  Answer by using the DataFrame API.

-Solve the same problem as Q1, but you **must** use `problems_df.filter` and `expr`. Again, call `count` to get the final answer. If done correctly, you should get the same answer as in Q1.
+This is the same question as Q1, and you should get the same answer.  This is to give you to interact with Spark different ways.

-#### Q3: Filtering with **Spark SQL**
+#### Q3: How many problems are there with a `cf_rating` of at least 1600, having `private_tests`, and a name containing "_A." (Case Sensitive)?  Answer by directly using the RDD API.  Answer by using Spark SQL.

-Solve the same problem as the previous two questions, but this time use `spark.sql`. First, write the problems to a table:
-
-```python
-problems_df.write.saveAsTable("problems", mode="overwrite")
-```
+Before you can use `spark.sql`, write the problem data to a Hive table so that you can refer to it by name.

 Again, the result after calling `count` should match your answers for Q1 and Q2.

 ## Part 2: Hive Data Warehouse

-#### Q4: Bucketing `solutions` table
+#### Q4: Does the query plan for a GROUP BY on solutions data need to shuffle/exchange rows if the data is pre-buckted?
+
+Write the data from `solutions.jsonl` to a Hive table named `solutions`, like you did for `problems`.  This time, though, bucket the data by "language" and use 4 buckets when writing to the table.
+
+Use Spark SQL to explain the query plan for this query:

-We've already added the `problems` table to Hive. Now, create the `solutions` table from `solutions.jsonl`.
+```sql
+SELECT language, COUNT(*)
+FROM solutions
+GROUP BY language
+```

-Unlike the `problems` table, use [`bucketBy`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.bucketBy.html) with your `writeTable` call to create 4 buckets on the column `"language"`. This divides your data into 4 buckets/groups, with all rows having the same language in the same bucket. This makes some queries faster (for example, when you `GROUP BY` on `language`, Spark might avoid shuffling data across partitions/machines).
+The `explain` output suffices for your answer.  Take note (for your own sake) whether any `Exchange` appears in the output.  Think about why an exchange/shuffle is or is not needed between the `partial_count` and `count` aggregates.

+<!--
 After bucketing the solutions, call `.explain` on a query that counts solutions per language. This should output `== Physical Plan ==` followed by the plan details. You've bucketed correctly if `Bucketed: true` appears in the output.
+-->

-#### Q5: What tables are in our warehouse?
+#### Q5: What tables/views are in our warehouse?

 You'll notice additional CSV files in `nb/data` that we haven't used yet. Create a Hive view for each using `createOrReplaceTempView`. Use these files:

@@ -173,9 +173,9 @@ You may use any method for this question. Join the `solutions` table with the `p

 Answer Q6 with code and a single integer. **DO NOT HARDCODE THE CODEFORCES ID**.

-#### Q7: Converting `difficulty` to a human-readable string
+#### Q7: How many problems are of easy/medium/hard difficulty?

-The `problems_df` has a `difficulty` column. Create a new column using the [`withColumn` function](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html) called `difficulty_str` with these rules:
+The `problems_df` has a numeric `difficulty` column.  For the purpose of categorizing the problems, interpret this number as follows:

 - `<= 5` is `Easy`
 - `<= 10` is `Medium`
@@ -189,11 +189,25 @@ Your answer should return this dictionary:

 **Hint:** https://www.w3schools.com/sql/sql_case.asp

-#### Q8: Spark Caching
+#### Q8: Does caching make it faster to compute averages over a subset of a bigger dataset?
+
+To test the impact of caching, we are going to do the same calculations with and without caching. Implement a query that first filters rows of `problem_tests` to get rows where `is_generated` is `False` -- use a variable to refer to the resulting DataFrame.
+
+Write some code to compute the average `input_chars` and `output_chars` over this DataFrame.  Then, write code to do an experiment as follows:
+
+1. compute the averages
+2. make a call to cache the data
+3. compute the averages
+4. compute the averages
+5. uncache the data

-To test the impact of caching, we are going to run the same query with and without cached data. Implement the query that first filters rows where `is_generated` is `False`, then get the average `input_chars` and `output_chars` by problem id. You should cache the `is_generated` filtering only.
+Measure the number of seconds it takes each of the three times we do the average calculations.s_generated` filtering only.  Answer with list of the three times, in order, as follows:
+
+```
+[0.48511195182800293, 0.47667789459228516, 0.1396317481994629]
+```

-TODO: 3 runs
+Your numbers may vary significantly, but the final run should usually be the fastest.

 ## Part 4: Machine Learning with Spark