Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • cdis/cs/courses/cs544/s25/main
  • zzhang2478/main
  • spark667/main
  • vijayprabhak/main
  • vijayprabhak/544-main
  • wyang338/cs-544-s-25
  • jmin39/main
7 results
Show changes
Commits on Source (34)
Showing
with 11338 additions and 45 deletions
File added
File added
File added
File added
File added
File added
File added
File added
This diff is collapsed.
This diff is collapsed.
FROM ubuntu:24.04
RUN apt-get update; apt-get install -y wget curl openjdk-11-jdk python3-pip nano
# SPARK
RUN wget https://archive.apache.org/dist/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz && tar -xf spark-3.5.5-bin-hadoop3.tgz && rm spark-3.5.5-bin-hadoop3.tgz
# HDFS
RUN wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz && tar -xf hadoop-3.3.6.tar.gz && rm hadoop-3.3.6.tar.gz
# Jupyter
RUN pip3 install jupyterlab==4.3.5 pandas==2.2.3 pyspark==3.5.5 --break-system-packages
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ENV PATH="${PATH}:/hadoop-3.3.6/bin"
ENV HADOOP_HOME=/hadoop-3.3.6
services:
nb:
image: spark-demo
ports:
- "127.0.0.1:5000:5000"
- "127.0.0.1:4040:4040"
volumes:
- "./nb:/nb"
command: python3 -m jupyterlab --no-browser --ip=0.0.0.0 --port=5000 --allow-root --NotebookApp.token=''
nn:
image: spark-demo
hostname: nn
command: sh -c "hdfs namenode -format -force && hdfs namenode -D dfs.replication=1 -fs hdfs://nn:9000"
dn:
image: spark-demo
command: hdfs datanode -fs hdfs://nn:9000
spark-boss:
image: spark-demo
hostname: boss
command: sh -c "/spark-3.5.5-bin-hadoop3/sbin/start-master.sh && sleep infinity"
spark-worker:
image: spark-demo
command: sh -c "/spark-3.5.5-bin-hadoop3/sbin/start-worker.sh spark://boss:7077 -c 2 -m 2g && sleep infinity"
deploy:
replicas: 2
This diff is collapsed.
FROM ubuntu:24.04
RUN apt-get update; apt-get install -y wget curl openjdk-11-jdk python3-pip nano
# SPARK
#RUN wget https://archive.apache.org/dist/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz && tar -xf spark-3.5.5-bin-hadoop3.tgz && rm spark-3.5.5-bin-hadoop3.tgz
RUN wget https://dlcdn.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz && tar -xf spark-3.5.5-bin-hadoop3.tgz && rm spark-3.5.5-bin-hadoop3.tgz
# HDFS
RUN wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz && tar -xf hadoop-3.3.6.tar.gz && rm hadoop-3.3.6.tar.gz
# Jupyter
RUN pip3 install jupyterlab==4.3.5 pandas==2.2.3 pyspark==3.5.5 matplotlib==3.10.1 --break-system-packages
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ENV PATH="${PATH}:/hadoop-3.3.6/bin"
ENV HADOOP_HOME=/hadoop-3.3.6
services:
nb:
image: spark-demo
ports:
- "127.0.0.1:5000:5000"
- "127.0.0.1:4040:4040"
volumes:
- "./nb:/nb"
command: python3 -m jupyterlab --no-browser --ip=0.0.0.0 --port=5000 --allow-root --NotebookApp.token=''
nn:
image: spark-demo
hostname: nn
command: sh -c "hdfs namenode -format -force && hdfs namenode -D dfs.replication=1 -fs hdfs://nn:9000"
dn:
image: spark-demo
command: hdfs datanode -fs hdfs://nn:9000
spark-boss:
image: spark-demo
hostname: boss
command: sh -c "/spark-3.5.5-bin-hadoop3/sbin/start-master.sh && sleep infinity"
spark-worker:
image: spark-demo
command: sh -c "/spark-3.5.5-bin-hadoop3/sbin/start-worker.sh spark://boss:7077 -c 2 -m 2g && sleep infinity"
deploy:
replicas: 2
date,holiday
01/01/2013,New Year's Day
01/01/2014,New Year's Day
01/01/2015,New Year's Day
01/01/2016,New Year's Day
01/01/2018,New Year's Day
01/01/2019,New Year's Day
01/01/2020,New Year's Day
01/01/2021,New Year's Day
01/02/2012,New Year's Day
01/02/2017,New Year's Day
01/15/2018,"Birthday of Martin Luther King, Jr."
01/16/2012,"Birthday of Martin Luther King, Jr."
01/16/2017,"Birthday of Martin Luther King, Jr."
01/17/2011,"Birthday of Martin Luther King, Jr."
01/17/2022,"Birthday of Martin Luther King, Jr."
01/18/2016,"Birthday of Martin Luther King, Jr."
01/18/2021,"Birthday of Martin Luther King, Jr."
01/19/2015,"Birthday of Martin Luther King, Jr."
01/20/2014,"Birthday of Martin Luther King, Jr."
01/20/2020,"Birthday of Martin Luther King, Jr."
01/20/2021,Inauguration Day
01/21/2013,"Birthday of Martin Luther King, Jr."
01/21/2019,"Birthday of Martin Luther King, Jr."
02/15/2016,Washington's Birthday
02/15/2021,Washington's Birthday
02/16/2015,Washington's Birthday
02/17/2014,Washington's Birthday
02/17/2020,Washington's Birthday
02/18/2013,Washington's Birthday
02/18/2019,Washington's Birthday
02/19/2018,Washington's Birthday
02/20/2012,Washington's Birthday
02/20/2017,Washington's Birthday
02/21/2011,Washington's Birthday
02/21/2022,Washington's Birthday
05/25/2015,Memorial Day
05/25/2020,Memorial Day
05/26/2014,Memorial Day
05/27/2013,Memorial Day
05/27/2019,Memorial Day
05/28/2012,Memorial Day
05/28/2018,Memorial Day
05/29/2017,Memorial Day
05/30/2011,Memorial Day
05/30/2016,Memorial Day
05/30/2022,Memorial Day
05/31/2021,Memorial Day
06/18/2021,Juneteenth National Independence Day
06/20/2022,Juneteenth National Independence Day
07/03/2015,Independence Day
07/03/2020,Independence Day
07/04/2011,Independence Day
07/04/2012,Independence Day
07/04/2013,Independence Day
07/04/2014,Independence Day
07/04/2016,Independence Day
07/04/2017,Independence Day
07/04/2018,Independence Day
07/04/2019,Independence Day
07/04/2022,Independence Day
07/05/2021,Independence Day
09/01/2014,Labor Day
09/02/2013,Labor Day
09/02/2019,Labor Day
09/03/2012,Labor Day
09/03/2018,Labor Day
09/04/2017,Labor Day
09/05/2011,Labor Day
09/05/2016,Labor Day
09/05/2022,Labor Day
09/06/2021,Labor Day
09/07/2015,Labor Day
09/07/2020,Labor Day
10/08/2012,Columbus Day
10/08/2018,Columbus Day
10/09/2017,Columbus Day
10/10/2011,Columbus Day
10/10/2016,Columbus Day
10/10/2022,Columbus Day
10/11/2021,Columbus Day
10/12/2015,Columbus Day
10/12/2020,Columbus Day
10/13/2014,Columbus Day
10/14/2013,Columbus Day
10/14/2019,Columbus Day
11/10/2017,Veterans Day
11/11/2011,Veterans Day
11/11/2013,Veterans Day
11/11/2014,Veterans Day
11/11/2015,Veterans Day
11/11/2016,Veterans Day
11/11/2019,Veterans Day
11/11/2020,Veterans Day
11/11/2021,Veterans Day
11/11/2022,Veterans Day
11/12/2012,Veterans Day
11/12/2018,Veterans Day
11/22/2012,Thanksgiving Day
11/22/2018,Thanksgiving Day
11/23/2017,Thanksgiving Day
11/24/2011,Thanksgiving Day
11/24/2016,Thanksgiving Day
11/24/2022,Thanksgiving Day
11/25/2021,Thanksgiving Day
11/26/2015,Thanksgiving Day
11/26/2020,Thanksgiving Day
11/27/2014,Thanksgiving Day
11/28/2013,Thanksgiving Day
11/28/2019,Thanksgiving Day
12/24/2021,Christmas Day
12/25/2012,Christmas Day
12/25/2013,Christmas Day
12/25/2014,Christmas Day
12/25/2015,Christmas Day
12/25/2017,Christmas Day
12/25/2018,Christmas Day
12/25/2019,Christmas Day
12/25/2020,Christmas Day
12/26/2011,Christmas Day
12/26/2016,Christmas Day
12/26/2022,Christmas Day
12/31/2022,New Year's Day
This diff is collapsed.
This diff is collapsed.
%% Cell type:code id:c8dca847-54af-4284-97d8-0682e88a6e8d tags:
``` python
from pyspark.sql import SparkSession
spark = (SparkSession.builder.appName("cs544")
.master("spark://boss:7077")
.config("spark.executor.memory", "2G")
.config("spark.sql.warehouse.dir", "hdfs://nn:9000/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate())
```
%% Output
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/27 01:41:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
%% Cell type:code id:2294e4e0-ab19-496c-980f-31df757e7837 tags:
``` python
!hdfs dfs -cp sf.csv hdfs://nn:9000/sf.csv
```
%% Cell type:code id:cb54bacc-b52a-4c25-93d2-2ba0f61de9b0 tags:
``` python
df = (spark.read.format("csv")
.option("header", True)
.option("inferSchema", True)
.load("hdfs://nn:9000/sf.csv"))
```
%% Output
%% Cell type:code id:c1298818-83f6-444b-b8a0-4be5b16fd6fb tags:
``` python
from pyspark.sql.functions import col, expr
cols = [col(c).alias(c.replace(" ", "_")) for c in df.columns]
df.select(cols).write.format("parquet").save("hdfs://nn:9000/sf.parquet")
```
%% Output
23/10/27 01:43:57 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
%% Cell type:code id:37d1ded3-ed8a-4e39-94cb-dd3a3272af91 tags:
``` python
!hdfs dfs -rm hdfs://nn:9000/sf.csv
```
%% Cell type:code id:abea48b5-e012-4ae2-a53a-e40350f94e20 tags:
``` python
df = spark.read.format("parquet").load("hdfs://nn:9000/sf.parquet")
```
# DRAFT! Don't start yet.
# P4 (4% of grade): SQL and HDFS
## Overview
In this project, you will depoly a data system consisting of an SQL server, an HDFS cluster, a gRPC server, and a client. You will need to read and filter data from a SQL database and persist it to HDFS. Then, you will partition the data to enable efficient request processing. Additionally, you will need to write a fault-tolerant application that works even when an HDFS DataNode fails (we will test this scenario).
In this project, you will depoly a data system consisting of an SQL server, an HDFS cluster, a gRPC server, and a client. You will need to read and filter data from a SQL database and persist it to HDFS. Additionally, you will write a fault-tolerant application that works even when an HDFS DataNode fails (we will test this scenario).
Learning objectives:
* communicate with the SQL Server using SQL queries
......@@ -16,22 +14,38 @@ Before starting, please review the [general project directions](../projects.md).
## Corrections/Clarifications
* none yet
- Mar 5: A hint about HDFS environment variables added; a dataflow diagram added; some minor typos fixed.
- Mar 5: Fix the wrong expected file size in Part 1 and sum of blocks in Part 2.
- Mar 6: Released `autobadger` for `p4` (`0.1.6`)
- Mar 7:
- Some minor updates on p4 `Readme.md`.
- Update `autobadgere` to version `0.1.7`
- Fixed exception handling, now Autobadger can correctly print error messages.
- Expanded the expected file size range in test4 `test_Hdfs_size`.
- Make the error messages clearer.
## Introduction
You'll need to deploy a system including 5 docker containers like this:
You'll need to deploy a system including 6 docker containers like this:
<img src="arch.png" width=600>
The data flow roughly follows this:
<img src="dataflow.png" width=600>
We have provided the other components; what you only need is to complete the work within the gRPC server and its Dockerfile.
### Client
This project will use `docker exec -it` to run the client on the gRPC server's container. Usage of `client.py` is as follows:
This project will use `docker exec` to run the client on the gRPC server's container. Usage of `client.py` is as follows:
```
#Inside the server container
python3 client.py DbToHdfs
python3 client.py BlockLocations -f <file_path>
python3 client.py PartitionByCounty
python3 client.py CalcAvgLoan -c <county_code>
```
......@@ -43,8 +57,7 @@ Take a look at the provided Docker compose file. There are several services, inc
You are required to write a server.py and Dockerfile.server based on the provided proto file (you may not modify it).
The gRPC interfaces have already been defined (see `lender.proto` for details). There are no constraints on the return values of `DbToHdfs` and `PartitionByCounty`, so you may return what you think is helpful.
The gRPC interfaces have already been defined (see `lender.proto` for details). There are no constraints on the return values of `DbToHdfs`, so you may return what you think is helpful.
### Docker image naming
You need to build the Docker image following this naming convention.
......@@ -67,7 +80,7 @@ export PROJECT=p4
**Hint 2:** Think about whether there is any .sh script that will help you quickly test code changes. For example, you may want it to rebuild your Dockerfiles, cleanup an old Compose cluster, and deploy a new cluster.
**Hint 3:** If you're low on disk space, consider running `docker system prune -a --volumes -f`
**Hint 3:** If you're low on disk space, consider running `docker system prune --volumes -f`
## Part 1: `DbToHdfs` gRPC Call
......@@ -75,34 +88,37 @@ In this part, your task is to implement the `DbToHdfs` gRPC call (you can find t
**DbToHdfs:** To be more specific, you need to:
1. Connect to the SQL server, with the database name as `CS544` and the password as `abc`. There are two tables in databse: `loans` ,and `loan_types`. The former records all information related to loans, while the latter maps the numbers in the loan_type column of the loans table to their corresponding loan types. There should be **447367** rows in table `loans`. It's like:
```mysql
mysql> show tables;
+-----------------+
| Tables_in_CS544 |
+-----------------+
| loan_types |
| loans |
+-----------------+
mysql> select count(*) from new_table;
+----------+
| count(*) |
+----------+
| 426716 |
+----------+
```
2. What are the actual types for those loans?
Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
```mysql
mysql> show tables;
+-----------------+
| Tables_in_CS544 |
+-----------------+
| loan_types |
| loans |
+-----------------+
mysql> select count(*) from loans;
+----------+
| count(*) |
+----------+
| 447367 |
+----------+
```
2. What are the actual types for those loans?
Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
3. Filter all rows where `loan_amount` is **greater than 30,000** and **less than 800,000**. After filtering, this table should have only **426716** rows.
4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **2x** replication and a **1-MB** block size, using PyArrow (https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html).
To check whether the upload was correct, you can use `docker exec -it` to enter the gRPC server's container and use HDFS command `hdfs dfs -du -h <path>`to see the file size. The expected result is:
```
15.3 M 30.5 M hdfs://nn:9000/hdma-wi-2021.parquet
```
To check whether the upload was correct, you can use `docker exec -it <container_name> bash` to enter the gRPC server's container and use HDFS command `hdfs dfs -du -h <path>`to see the file size. The expected result should like:
```
14.4 M 28.9 M hdfs://nn:9000/hdma-wi-2021.parquet
```
Note: Your file size might have slight difference from this.
>That's because when we join two tables, rows from one table get matches with rows in the other, but the order of output rows is not guaranteed. If we have the same rows in a different order, the compressibility of snappy (used by Parquet by default) will vary because it is based on compression windows, and there may be more or less redundancy in a window depending on row ordering.
**Hint 1:** We used similar tables in lecture: https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/15-sql
**Hint 2:** To get more familiar with these tables, you can use SQL queries to print the table schema or retrieve sample data. A convenient way to do this is to use `docker exec -it` to enter the SQL Server, then run mysql client `mysql -p CS544` to access the SQL Server and then perform queries.
**Hint 2:** To get more familiar with these tables, you can use SQL queries to print the table schema or retrieve sample data. A convenient way to do this is to use `docker exec -it <container name> bash` to enter the SQL Server, then run mysql client `mysql -p CS544` to access the SQL Server and then perform queries.
**Hint 3:** After `docker compose up`, the SQL Server needs some time to load the data before it is ready. Therefore, you need to wait for a while, or preferably, add a retry mechanism for the SQL connection.
......@@ -117,7 +133,7 @@ In this part, your task is to implement the `BlockLocations` gRPC call (you can
For example, running `docker exec -it p4-server-1 python3 /client.py BlockLocations -f /hdma-wi-2021.parquet` should show something like this:
```
{'7eb74ce67e75': 15, 'f7747b42d254': 6, '39750756065d': 11}
{'7eb74ce67e75': 15, 'f7747b42d254': 7, '39750756065d': 8}
```
Note: DataNode location is the randomly generated container ID for the
......@@ -128,13 +144,15 @@ The documents [here](https://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/h
Use a `GETFILEBLOCKLOCATIONS` operation to find the block locations.
**Hint:** You have to set appropriate environment variable `CLASSPATH` to access HDFS correctly. See example [here](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/blob/main/lec/18-hdfs/notebook.Dockerfile?ref_type=heads).
## Part 3: `CalcAvgLoan` gRPC Call
In this part, your task is to implement the `CalcAvgLoan` gRPC call (you can find the interface definition in the proto file).
The call should read hdma-wi-2021.parquet, filtering to rows with the specified county code. One way to do this would be to pass a `("column", "=", ????)` tuple inside a `filters` list upon read: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html
The call should return the average loan amount from the filtered table.
The call should return the average loan amount from the filtered table as an integer (rounding down if necessary).
As an optimization, your code should also write the filtered data to a file named `partitions/<county_code>.parquet`. If there are later calls for the same county_code, your code should use the smaller, county-specific Parquet file (instead of filtering the big Parquet file with all loan applications). The county-specific Parquet file should have 1x replication. When `CalcAvgLoan` returns the average, it should also use the "source" field to indicate whether the data came from big Parquet file (`source="create"` because a new county-specific file had to be created) or a county-specific file was previously created (`source="reuse"`).
......@@ -150,7 +168,7 @@ After a `DbToHdfs` call and a few `CalcAvgLoan` calls, your HDFS directory struc
```
├── hdma-wi-2021.parquet
├── partitioned/
├── partitions/
│ ├── 55001.parquet
│ ├── 55003.parquet
│ └── ...
......@@ -158,18 +176,18 @@ After a `DbToHdfs` call and a few `CalcAvgLoan` calls, your HDFS directory struc
## Part 4: Fault Tolerance
In this part, your task is to modify the `CalcAvgLoan` gRPC calls you implemented in Part 3.
A "fault" is something that goes wrong, like a hard disk failing or an entire DataNode crashing. Fault tolerant code continues functioning for some kinds of faults.
Imagine a scenario where one (or even two) of the three DataNodes fail and go offline. In this case, can the logic implemented in Part 3 still function correctly? If we still want our gRPC server to correctly respond to requests for calculating the average loan, what should we do?
In this part, your task is to make `CalcAvgLoan` tolerant to a single DataNode failure (we will kill one during testing!).
Keep in mind that `/hdma-wi-2021.parquet` has a replication factor of 3, while `/partitioned` only has a replication factor of 1.
Recall that `CalcAvgLoan` sometimes uses small, county-specific Parquet files that have 1x replication, and sometimes it uses the big Parquet file (hdma-wi-2021.parquet) of all loan applications that uses 2x replication. Your fault tolerance strategy should be as follows:
**CalcAvgLoan:** To be more specific, modify this gRPC call to support the server in correctly handling responses even when one (or even two) DataNode is offline. You may try reading the partitioned parquet file first. If unsuccessful, then go back to the large `hdma-wi-2021.parquet` file and complete the computation. What's more, you have to return with the field `source` filled:
1. hdma-wi-2021.parquet: if you created this with 2x replication earlier, you don't need to do anything else here, because HDFS can automatically handle a single DataNode failure for you
2. partitions/<COUNTY_CODE>.parquet: this data only has 1x replication, so HDFS might lose it when the DataNode fails. That's fine, because all the rows are still in the big Parquet file. You should write code to detect this scenario and recreate the lost/corrupted county-specific file by reading the big file again with the county filter. If you try to read an HDFS file with missing data using PyArrow, the client will retry for a while (perhaps 30 seconds or so), then raise an OSError exception, which you should catch and handle
1. "partitioned": calculation performed on parquet partitioned before
2. "unpartitioned": parquet partitioned before is lost, and calculation performed on the initial unpartitioned table
CalcAvgLoan should now use the "source" field in the return value to indicate how the average was computed: "create" (from the big file, because a county-specific file didn't already exist), "recreate" (from the big file, because a county-specific file was corrupted/lost), or "reuse" (there was a valid county-specific file that was used).
To simulate a data node failure, you should use `docker kill` to terminate a node and then wait until you confirm that the number of `live DataNodes` has decreased using the `hdfs dfsadmin -fs <hdfs_path> -report` command.
**Hint:** to manually test DataNode failure, you should use `docker kill` to terminate a node and then wait until you confirm that the number of `live DataNodes` has decreased using the `hdfs dfsadmin -fs <hdfs_path> -report` command.
## Submission
......@@ -187,10 +205,25 @@ docker build . -f Dockerfile.server -t p4-server
docker compose up -d
```
We will copy in the all the dockerfiles except "Dockerfile.server", "docker-compose.yml", "client.py", "lender.proto", and "hdma-wi-2021.sql.gz", overwriting anything you might have changed.
Then run the client like this:
```
docker exec p4-server-1 python3 /client.py DbToHdfs
docker exec p4-server-1 python3 /client.py BlockLocations -f /hdma-wi-2021.parquet
docker exec p4-server-1 python3 /client.py CalcAvgLoan -c 55001
```
Note that we will copy in the the provided files (docker-compose.yml, client.py, lender.proto, hdma-wi-2021.sql.gz, etc.), overwriting anything you might have changed. Please do NOT push hdma-wi-2021.sql.gz to your repo because it is large, and we want to keep the repos small.
Please make sure you have `client.py` copied into p4-server-1. We will run client.py in p4-server-1 to test your code.
Please make sure you have `client.py` copied into the p4-server image. We will run client.py in the p4-server-1 container to test your code.
## Tester
Not released yet.
Please be sure that your installed `autobadger` is on version `0.1.7`. You can print the version using
```bash
autobadger --info
```
See [projects.md](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/blob/main/projects.md#testing) for more information.