Compare revisions

TYLER CARAZA-HARTER · WEICHU YANG · WEICHU YANG · TYLER CARAZA-HARTER · WEICHU YANG · WEICHU YANG
--- a/exams/s25-exam1/exam1.pdf
+++ b/exams/s25-exam1/exam1.pdf
--- a/exams/s25-exam1/exam2.pdf
+++ b/exams/s25-exam1/exam2.pdf
--- a/exams/s25-exam1/exam3.pdf
+++ b/exams/s25-exam1/exam3.pdf
--- a/exams/s25-exam1/exam4.pdf
+++ b/exams/s25-exam1/exam4.pdf
--- a/exams/s25-exam1/exam5.pdf
+++ b/exams/s25-exam1/exam5.pdf
--- a/exams/s25-exam1/exam6.pdf
+++ b/exams/s25-exam1/exam6.pdf
--- a/exams/s25-exam1/exam7.pdf
+++ b/exams/s25-exam1/exam7.pdf
--- a/exams/s25-exam1/exam8.pdf
+++ b/exams/s25-exam1/exam8.pdf
--- a/lec/18-hdfs/nb/lec1.ipynb
+++ b/lec/18-hdfs/nb/lec1.ipynb
--- a/lec/18-hdfs/nb/lec2.ipynb
+++ b/lec/18-hdfs/nb/lec2.ipynb
--- a/lec/20-spark/Dockerfile
+++ b/lec/20-spark/Dockerfile
+FROM ubuntu:24.04
+RUN apt-get update; apt-get install -y wget curl openjdk-11-jdk python3-pip nano
+
+# SPARK
+RUN wget https://archive.apache.org/dist/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz && tar -xf spark-3.5.5-bin-hadoop3.tgz && rm spark-3.5.5-bin-hadoop3.tgz
+         
+# HDFS
+RUN wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz && tar -xf hadoop-3.3.6.tar.gz && rm hadoop-3.3.6.tar.gz
+
+# Jupyter
+RUN pip3 install jupyterlab==4.3.5 pandas==2.2.3 pyspark==3.5.5 --break-system-packages
+
+ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
+ENV PATH="${PATH}:/hadoop-3.3.6/bin"
+ENV HADOOP_HOME=/hadoop-3.3.6
--- a/lec/20-spark/docker-compose.yml
+++ b/lec/20-spark/docker-compose.yml
+services:
+    nb:
+        image: spark-demo
+        ports:
+        - "127.0.0.1:5000:5000"
+        - "127.0.0.1:4040:4040"
+        volumes:
+        - "./nb:/nb"
+        command: python3 -m jupyterlab --no-browser --ip=0.0.0.0 --port=5000 --allow-root --NotebookApp.token=''
+
+    nn:
+        image: spark-demo
+        hostname: nn
+        command: sh -c "hdfs namenode -format -force && hdfs namenode -D dfs.replication=1 -fs hdfs://nn:9000"
+
+    dn:
+        image: spark-demo
+        command: hdfs datanode -fs hdfs://nn:9000
+
+    spark-boss:
+        image: spark-demo
+        hostname: boss
+        command: sh -c "/spark-3.5.5-bin-hadoop3/sbin/start-master.sh && sleep infinity"
+
+    spark-worker:
+        image: spark-demo
+        command: sh -c "/spark-3.5.5-bin-hadoop3/sbin/start-worker.sh spark://boss:7077 -c 2 -m 2g && sleep infinity"
+        deploy:
+                replicas: 2
--- a/lec/20-spark/nb/lec.ipynb
+++ b/lec/20-spark/nb/lec.ipynb
--- a/lec/22-spark/Dockerfile
+++ b/lec/22-spark/Dockerfile
+FROM ubuntu:24.04
+RUN apt-get update; apt-get install -y wget curl openjdk-11-jdk python3-pip nano
+
+# SPARK
+#RUN wget https://archive.apache.org/dist/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz && tar -xf spark-3.5.5-bin-hadoop3.tgz && rm spark-3.5.5-bin-hadoop3.tgz
+RUN wget https://dlcdn.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz && tar -xf spark-3.5.5-bin-hadoop3.tgz && rm spark-3.5.5-bin-hadoop3.tgz
+
+# HDFS
+RUN wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz && tar -xf hadoop-3.3.6.tar.gz && rm hadoop-3.3.6.tar.gz
+
+# Jupyter
+RUN pip3 install jupyterlab==4.3.5 pandas==2.2.3 pyspark==3.5.5 matplotlib==3.10.1 --break-system-packages
+
+ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
+ENV PATH="${PATH}:/hadoop-3.3.6/bin"
+ENV HADOOP_HOME=/hadoop-3.3.6
--- a/lec/22-spark/docker-compose.yml
+++ b/lec/22-spark/docker-compose.yml
+services:
+    nb:
+        image: spark-demo
+        ports:
+        - "127.0.0.1:5000:5000"
+        - "127.0.0.1:4040:4040"
+        volumes:
+        - "./nb:/nb"
+        command: python3 -m jupyterlab --no-browser --ip=0.0.0.0 --port=5000 --allow-root --NotebookApp.token=''
+
+    nn:
+        image: spark-demo
+        hostname: nn
+        command: sh -c "hdfs namenode -format -force && hdfs namenode -D dfs.replication=1 -fs hdfs://nn:9000"
+
+    dn:
+        image: spark-demo
+        command: hdfs datanode -fs hdfs://nn:9000
+
+    spark-boss:
+        image: spark-demo
+        hostname: boss
+        command: sh -c "/spark-3.5.5-bin-hadoop3/sbin/start-master.sh && sleep infinity"
+
+    spark-worker:
+        image: spark-demo
+        command: sh -c "/spark-3.5.5-bin-hadoop3/sbin/start-worker.sh spark://boss:7077 -c 2 -m 2g && sleep infinity"
+        deploy:
+                replicas: 2
--- a/lec/22-spark/nb/holidays2.csv
+++ b/lec/22-spark/nb/holidays2.csv
+date,holiday
+01/01/2013,New Year's Day
+01/01/2014,New Year's Day
+01/01/2015,New Year's Day
+01/01/2016,New Year's Day
+01/01/2018,New Year's Day
+01/01/2019,New Year's Day
+01/01/2020,New Year's Day
+01/01/2021,New Year's Day
+01/02/2012,New Year's Day
+01/02/2017,New Year's Day
+01/15/2018,"Birthday of Martin Luther King, Jr."
+01/16/2012,"Birthday of Martin Luther King, Jr."
+01/16/2017,"Birthday of Martin Luther King, Jr."
+01/17/2011,"Birthday of Martin Luther King, Jr."
+01/17/2022,"Birthday of Martin Luther King, Jr."
+01/18/2016,"Birthday of Martin Luther King, Jr."
+01/18/2021,"Birthday of Martin Luther King, Jr."
+01/19/2015,"Birthday of Martin Luther King, Jr."
+01/20/2014,"Birthday of Martin Luther King, Jr."
+01/20/2020,"Birthday of Martin Luther King, Jr."
+01/20/2021,Inauguration Day
+01/21/2013,"Birthday of Martin Luther King, Jr."
+01/21/2019,"Birthday of Martin Luther King, Jr."
+02/15/2016,Washington's Birthday
+02/15/2021,Washington's Birthday
+02/16/2015,Washington's Birthday
+02/17/2014,Washington's Birthday
+02/17/2020,Washington's Birthday
+02/18/2013,Washington's Birthday
+02/18/2019,Washington's Birthday
+02/19/2018,Washington's Birthday
+02/20/2012,Washington's Birthday
+02/20/2017,Washington's Birthday
+02/21/2011,Washington's Birthday
+02/21/2022,Washington's Birthday
+05/25/2015,Memorial Day
+05/25/2020,Memorial Day
+05/26/2014,Memorial Day
+05/27/2013,Memorial Day
+05/27/2019,Memorial Day
+05/28/2012,Memorial Day
+05/28/2018,Memorial Day
+05/29/2017,Memorial Day
+05/30/2011,Memorial Day
+05/30/2016,Memorial Day
+05/30/2022,Memorial Day
+05/31/2021,Memorial Day
+06/18/2021,Juneteenth National Independence Day
+06/20/2022,Juneteenth National Independence Day
+07/03/2015,Independence Day
+07/03/2020,Independence Day
+07/04/2011,Independence Day
+07/04/2012,Independence Day
+07/04/2013,Independence Day
+07/04/2014,Independence Day
+07/04/2016,Independence Day
+07/04/2017,Independence Day
+07/04/2018,Independence Day
+07/04/2019,Independence Day
+07/04/2022,Independence Day
+07/05/2021,Independence Day
+09/01/2014,Labor Day
+09/02/2013,Labor Day
+09/02/2019,Labor Day
+09/03/2012,Labor Day
+09/03/2018,Labor Day
+09/04/2017,Labor Day
+09/05/2011,Labor Day
+09/05/2016,Labor Day
+09/05/2022,Labor Day
+09/06/2021,Labor Day
+09/07/2015,Labor Day
+09/07/2020,Labor Day
+10/08/2012,Columbus Day
+10/08/2018,Columbus Day
+10/09/2017,Columbus Day
+10/10/2011,Columbus Day
+10/10/2016,Columbus Day
+10/10/2022,Columbus Day
+10/11/2021,Columbus Day
+10/12/2015,Columbus Day
+10/12/2020,Columbus Day
+10/13/2014,Columbus Day
+10/14/2013,Columbus Day
+10/14/2019,Columbus Day
+11/10/2017,Veterans Day
+11/11/2011,Veterans Day
+11/11/2013,Veterans Day
+11/11/2014,Veterans Day
+11/11/2015,Veterans Day
+11/11/2016,Veterans Day
+11/11/2019,Veterans Day
+11/11/2020,Veterans Day
+11/11/2021,Veterans Day
+11/11/2022,Veterans Day
+11/12/2012,Veterans Day
+11/12/2018,Veterans Day
+11/22/2012,Thanksgiving Day
+11/22/2018,Thanksgiving Day
+11/23/2017,Thanksgiving Day
+11/24/2011,Thanksgiving Day
+11/24/2016,Thanksgiving Day
+11/24/2022,Thanksgiving Day
+11/25/2021,Thanksgiving Day
+11/26/2015,Thanksgiving Day
+11/26/2020,Thanksgiving Day
+11/27/2014,Thanksgiving Day
+11/28/2013,Thanksgiving Day
+11/28/2019,Thanksgiving Day
+12/24/2021,Christmas Day
+12/25/2012,Christmas Day
+12/25/2013,Christmas Day
+12/25/2014,Christmas Day
+12/25/2015,Christmas Day
+12/25/2017,Christmas Day
+12/25/2018,Christmas Day
+12/25/2019,Christmas Day
+12/25/2020,Christmas Day
+12/26/2011,Christmas Day
+12/26/2016,Christmas Day
+12/26/2022,Christmas Day
+12/31/2022,New Year's Day
--- a/lec/22-spark/nb/lec1.ipynb
+++ b/lec/22-spark/nb/lec1.ipynb
--- a/lec/22-spark/nb/lec2.ipynb
+++ b/lec/22-spark/nb/lec2.ipynb
--- a/lec/22-spark/nb/starter.ipynb
+++ b/lec/22-spark/nb/starter.ipynb
+%% Cell type:code id:c8dca847-54af-4284-97d8-0682e88a6e8d tags:
+
+``` python
+from pyspark.sql import SparkSession
+spark = (SparkSession.builder.appName("cs544")
+         .master("spark://boss:7077")
+         .config("spark.executor.memory", "2G")
+         .config("spark.sql.warehouse.dir", "hdfs://nn:9000/user/hive/warehouse")
+         .enableHiveSupport()
+         .getOrCreate())
+```
+
+%% Output
+
+    Setting default log level to "WARN".
+    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+    23/10/27 01:41:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
+
+%% Cell type:code id:2294e4e0-ab19-496c-980f-31df757e7837 tags:
+
+``` python
+!hdfs dfs -cp sf.csv hdfs://nn:9000/sf.csv
+```
+
+%% Cell type:code id:cb54bacc-b52a-4c25-93d2-2ba0f61de9b0 tags:
+
+``` python
+df = (spark.read.format("csv")
+      .option("header", True)
+      .option("inferSchema", True)
+      .load("hdfs://nn:9000/sf.csv"))
+```
+
+%% Output
+
+    
+
+%% Cell type:code id:c1298818-83f6-444b-b8a0-4be5b16fd6fb tags:
+
+``` python
+from pyspark.sql.functions import col, expr
+cols = [col(c).alias(c.replace(" ", "_")) for c in df.columns]
+df.select(cols).write.format("parquet").save("hdfs://nn:9000/sf.parquet")
+```
+
+%% Output
+
+    23/10/27 01:43:57 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
+    
+
+%% Cell type:code id:37d1ded3-ed8a-4e39-94cb-dd3a3272af91 tags:
+
+``` python
+!hdfs dfs -rm hdfs://nn:9000/sf.csv
+```
+
+%% Cell type:code id:abea48b5-e012-4ae2-a53a-e40350f94e20 tags:
+
+``` python
+df = spark.read.format("parquet").load("hdfs://nn:9000/sf.parquet")
+```
+%% Cell type:code id:c8dca847-54af-4284-97d8-0682e88a6e8d tags:
+
+``` python
+from pyspark.sql import SparkSession
+spark = (SparkSession.builder.appName("cs544")
+         .master("spark://boss:7077")
+         .config("spark.executor.memory", "2G")
+         .config("spark.sql.warehouse.dir", "hdfs://nn:9000/user/hive/warehouse")
+         .enableHiveSupport()
+         .getOrCreate())
+```
+
+%% Output
+
+    Setting default log level to "WARN".
+    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+    23/10/27 01:41:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
+
+%% Cell type:code id:2294e4e0-ab19-496c-980f-31df757e7837 tags:
+
+``` python
+!hdfs dfs -cp sf.csv hdfs://nn:9000/sf.csv
+```
+
+%% Cell type:code id:cb54bacc-b52a-4c25-93d2-2ba0f61de9b0 tags:
+
+``` python
+df = (spark.read.format("csv")
+      .option("header", True)
+      .option("inferSchema", True)
+      .load("hdfs://nn:9000/sf.csv"))
+```
+
+%% Output
+
+    
+
+%% Cell type:code id:c1298818-83f6-444b-b8a0-4be5b16fd6fb tags:
+
+``` python
+from pyspark.sql.functions import col, expr
+cols = [col(c).alias(c.replace(" ", "_")) for c in df.columns]
+df.select(cols).write.format("parquet").save("hdfs://nn:9000/sf.parquet")
+```
+
+%% Output
+
+    23/10/27 01:43:57 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
+    
+
+%% Cell type:code id:37d1ded3-ed8a-4e39-94cb-dd3a3272af91 tags:
+
+``` python
+!hdfs dfs -rm hdfs://nn:9000/sf.csv
+```
+
+%% Cell type:code id:abea48b5-e012-4ae2-a53a-e40350f94e20 tags:
+
+``` python
+df = spark.read.format("parquet").load("hdfs://nn:9000/sf.parquet")
+```
--- a/p4/README.md
+++ b/p4/README.md
-# DRAFT!  Don't start yet.
-
 # P4 (4% of grade): SQL and HDFS

 ## Overview
@@ -16,17 +14,34 @@ Before starting, please review the [general project directions](../projects.md).

 ## Corrections/Clarifications

-* none yet
+- Mar 5: A hint about HDFS environment variables added; a dataflow diagram added; some minor typos fixed.
+
+- Mar 5: Fix the wrong expected file size in Part 1 and sum of blocks in Part 2.
+
+- Mar 6: Released `autobadger` for `p4` (`0.1.6`)
+
+- Mar 7: 
+  - Some minor updates on p4 `Readme.md`.
+  - Update `autobadgere` to version `0.1.7`
+    - Fixed exception handling, now Autobadger can correctly print error messages. 
+    - Expanded the expected file size range in test4 `test_Hdfs_size`. 
+    - Make the error messages clearer.
+  
+

 ## Introduction

-You'll need to deploy a system including 5 docker containers like this:
+You'll need to deploy a system including 6 docker containers like this:

 <img src="arch.png" width=600>

+The data flow roughly follows this:
+
+<img src="dataflow.png" width=600>
+
 We have provided the other components; what you only need is to complete the work within the gRPC server and its Dockerfile.
 ### Client
-This project will use `docker exec -it` to run the client on the gRPC server's container. Usage of `client.py` is as follows:
+This project will use `docker exec` to run the client on the gRPC server's container. Usage of `client.py` is as follows:
 ```
 #Inside the server container
 python3 client.py DbToHdfs
@@ -65,7 +80,7 @@ export PROJECT=p4

 **Hint 2:** Think about whether there is any .sh script that will help you quickly test code changes.  For example, you may want it to rebuild your Dockerfiles, cleanup an old Compose cluster, and deploy a new cluster.

-**Hint 3:** If you're low on disk space, consider running `docker system prune -a --volumes -f`
+**Hint 3:** If you're low on disk space, consider running `docker system prune --volumes -f`

 ## Part 1: `DbToHdfs` gRPC Call

@@ -73,34 +88,37 @@ In this part, your task is to implement the `DbToHdfs` gRPC call (you can find t

 **DbToHdfs:** To be more specific, you need to:
 1. Connect to the SQL server, with the database name as `CS544` and the password as `abc`. There are two tables in databse: `loans` ,and `loan_types`. The former records all information related to loans, while the latter maps the numbers in the loan_type column of the loans table to their corresponding loan types. There should be **447367** rows in table `loans`. It's like:
-      ```mysql
-      mysql> show tables;
-      +-----------------+
-      | Tables_in_CS544 |
-      +-----------------+
-      | loan_types      |
-      | loans           |
-      +-----------------+
-      mysql> select count(*) from new_table;
-      +----------+
-      | count(*) |
-      +----------+
-      |   426716 |
-      +----------+
-      ``` 
-2. What are the actual types for those loans? 
-Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`. 
+   ```mysql
+   mysql> show tables;
+   +-----------------+
+   | Tables_in_CS544 |
+   +-----------------+
+   | loan_types      |
+   | loans           |
+   +-----------------+
+   mysql> select count(*) from loans;
+   +----------+
+   | count(*) |
+   +----------+
+   |   447367 |
+   +----------+
+   ```
+2. What are the actual types for those loans?
+   Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
 3. Filter all rows where `loan_amount` is **greater than 30,000** and **less than 800,000**. After filtering, this table should have only **426716** rows.
 4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **2x** replication and a **1-MB** block size, using PyArrow (https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html).

-To check whether the upload was correct, you can use `docker exec -it` to enter the gRPC server's container and use HDFS command `hdfs dfs -du -h <path>`to see the file size. The expected result is:
-  ```
-15.3 M  30.5 M  hdfs://nn:9000/hdma-wi-2021.parquet
-  ```
+To check whether the upload was correct, you can use `docker exec -it <container_name> bash` to enter the gRPC server's container and use HDFS command `hdfs dfs -du -h <path>`to see the file size. The expected result should like:
+
+```
+14.4 M   28.9 M  hdfs://nn:9000/hdma-wi-2021.parquet
+```
+Note: Your file size might have slight difference from this. 
+>That's because when we join two tables, rows from one table get matches with rows in the other, but the order of output rows is not guaranteed. If we have the same rows in a different order, the compressibility of snappy (used by Parquet by default) will vary because it is based on compression windows, and there may be more or less redundancy in a window depending on row ordering. 

 **Hint 1:** We used similar tables in lecture: https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/15-sql

-**Hint 2:**  To get more familiar with these tables, you can use SQL queries to print the table schema or retrieve sample data. A convenient way to do this is to use `docker exec -it` to enter the SQL Server, then run mysql client `mysql -p CS544` to access the SQL Server and then perform queries.
+**Hint 2:**  To get more familiar with these tables, you can use SQL queries to print the table schema or retrieve sample data. A convenient way to do this is to use `docker exec -it <container name> bash` to enter the SQL Server, then run mysql client `mysql -p CS544` to access the SQL Server and then perform queries.

 **Hint 3:** After `docker compose up`, the SQL Server needs some time to load the data before it is ready. Therefore, you need to wait for a while, or preferably, add a retry mechanism for the SQL connection.

@@ -115,7 +133,7 @@ In this part, your task is to implement the `BlockLocations` gRPC call (you can
 For example, running `docker exec -it p4-server-1 python3 /client.py BlockLocations -f /hdma-wi-2021.parquet` should show something like this:

 ```
-{'7eb74ce67e75': 15, 'f7747b42d254': 6, '39750756065d': 11}
+{'7eb74ce67e75': 15, 'f7747b42d254': 7, '39750756065d': 8}
 ```

 Note: DataNode location is the randomly generated container ID for the
@@ -126,6 +144,8 @@ The documents [here](https://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/h

 Use a `GETFILEBLOCKLOCATIONS` operation to find the block locations.

+**Hint:** You have to set appropriate environment variable `CLASSPATH` to access HDFS correctly. See example [here](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/blob/main/lec/18-hdfs/notebook.Dockerfile?ref_type=heads).          
+
 ## Part 3: `CalcAvgLoan` gRPC Call

 In this part, your task is to implement the `CalcAvgLoan` gRPC call (you can find the interface definition in the proto file).
@@ -188,9 +208,9 @@ docker compose up -d
 Then run the client like this:

 ```
-docker exec -it p4-server-1 python3 /client.py DbToHdfs
-docker exec -it p4-server-1 python3 /client.py BlockLocations -f /hdma-wi-2021.parquet
-docker exec -it p4-server-1 python3 /client.py CalcAvgLoan -c 55001
+docker exec p4-server-1 python3 /client.py DbToHdfs
+docker exec p4-server-1 python3 /client.py BlockLocations -f /hdma-wi-2021.parquet
+docker exec p4-server-1 python3 /client.py CalcAvgLoan -c 55001
 ```

 Note that we will copy in the the provided files (docker-compose.yml, client.py, lender.proto, hdma-wi-2021.sql.gz, etc.), overwriting anything you might have changed.  Please do NOT push hdma-wi-2021.sql.gz to your repo because it is large, and we want to keep the repos small.
@@ -199,4 +219,11 @@ Please make sure you have `client.py` copied into the p4-server image. We will r

 ## Tester

-Not released yet.
+Please be sure that your installed `autobadger` is on version `0.1.7`. You can print the version using
+
+```bash
+autobadger --info
+```
+
+See [projects.md](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/blob/main/projects.md#testing) for more information.
+
No results found