Outline.md



I think both approaches are fine for this project. Using Jupyter will be more detailed and fundamental, while the application approach will be more engaging.

Jupyter Way

Part1: Setup and SQL Query
Offer them docker-compose.yml , Dockerfile.sql , Dockerfile.hdfs and Dockerfile.notebook, while they are required to complete Dockerfile.namenode, Dockerfile.datanode .
In Dockerfile.sql, we download data, deploy SQL server, get it ready to be queried
Then the whole system can be established by running docker compose up .
Q1: Connect to SQL server and query
In jupyter, use mysqlconnector to connect to SQL server, then do specific queries, then print the result.
Q2: Persist a table from SQL
Read a table from SQL server and save it separately  input.parquet.

Part2 Data Upload and HDFS status
Q3: Check the number of living datanodes
run hdfs dfsadmin -fs -report command to get the status of HDFS.


Then upload and `input.parquet` to HDFS with 2x replication.
Q4: what are the logical and physical sizes of the parquet files?
Run hdfs dfs -du -h hdfs://boss:9000/

Part3 PyArrow
Q5: What is the average of XXX (something like this)
Use PyArrow to read from HDFS and do some calculation.


Ask them to do some more complex calculations and store results as a output.parquet back to HDFS with 1 replication.
Q6: blocks distribution across the two DataNode for output.parquet (2x)
Use the WebHDFS OPEN operation with offset 0 and noredirect=true  to get it.
output is like:{'755329887c2a': 9, 'c181cd6fd6fe': 7}
Q7: blocks distribution across the two DataNode for output.parquet (1x)
Use the WebHDFS GETFILEBLOCKLOCATIONS and iterate every block for counting.

Part 4: Disaster Strikes
Kill one datanode manually.
Q8: how many live DataNodes are in the cluster?
Run hdfs dfsadmin -fs -report again, but expecting Live datanodes (1)


Ask students to access `result.parquet` , which expected to fail.
Q9:  how many blocks of single.parquet were lost?
Use OPEN or GETFILEBLOCKLOCATIONS to get that.
Q10:  return specific line of output by recalculate with replicated input.parquet

Application Way
Offer them docker-compose.yml , Dockerfile.sql , Dockerfile.hdfs and Dockerfile.notebook, while they are required to complete Dockerfile.namenode, Dockerfile.datanode.
Then the main system can be established by running docker compose up.
Students need to:

Define interfaces, grpc or flask

Write a server.py: read data from SQL, save them as input.parquet, store input.parquet in HDFS with 1x rep, do calculation, store output.parquet in HDFS with 1x rep, then start serving(grpc or flask).
Manually kill one datanode.
Add logic for data disaster recovery:


If the output data is incomplete, read from the input and compute the result directly.
If a data node has restarted, recompute and store the output.