I think both approaches are fine for this project. Using Jupyter will be more detailed and fundamental, while the application approach will be more engaging.
Jupyter Way
Part1: Setup and SQL Query
Offer them docker-compose.yml
, Dockerfile.sql
, Dockerfile.hdfs
and Dockerfile.notebook
, while they are required to complete Dockerfile.namenode
, Dockerfile.datanode
.
In Dockerfile.sql
, we download data, deploy SQL server, get it ready to be queried
Then the whole system can be established by running docker compose up
.
Q1: Connect to SQL server and query
In jupyter, use mysqlconnector
to connect to SQL server, then do specific queries, then print the result.
Q2: Persist a table from SQL
Read a table from SQL server and save it separately input.parquet
.
Part2 Data Upload and HDFS status
Q3: Check the number of living datanodes
run hdfs dfsadmin -fs -report
command to get the status of HDFS.
Then upload and `input.parquet` to HDFS with 2x replication.
Q4: what are the logical and physical sizes of the parquet files?
Run hdfs dfs -du -h hdfs://boss:9000/
Part3 PyArrow
Q5: What is the average of XXX
(something like this)
Use PyArrow to read from HDFS and do some calculation.
Ask them to do some more complex calculations and store results as a output.parquet
back to HDFS with 1 replication.
Q6: blocks distribution across the two DataNode for output.parquet
(2x)
Use the WebHDFS OPEN
operation with offset
0 and noredirect=true
to get it.
output is like:{'755329887c2a': 9, 'c181cd6fd6fe': 7}
Q7: blocks distribution across the two DataNode for output.parquet
(1x)
Use the WebHDFS GETFILEBLOCKLOCATIONS
and iterate every block for counting.
Part 4: Disaster Strikes
Kill one datanode manually.
Q8: how many live DataNodes are in the cluster?
Run hdfs dfsadmin -fs -report
again, but expecting Live datanodes (1)
Ask students to access `result.parquet` , which expected to fail.
Q9: how many blocks of single.parquet were lost?
Use OPEN
or GETFILEBLOCKLOCATIONS
to get that.
Q10: return specific line of output by recalculate with replicated input.parquet
Application Way
Offer them docker-compose.yml
, Dockerfile.sql
, Dockerfile.hdfs
and Dockerfile.notebook
, while they are required to complete Dockerfile.namenode
, Dockerfile.datanode
.
Then the main system can be established by running docker compose up
.
Students need to:
- Define interfaces,
grpc
orflask
- Write a
server.py
: read data from SQL, save them asinput.parquet
, storeinput.parquet
in HDFS with 1x rep, do calculation, storeoutput.parquet
in HDFS with 1x rep, then start serving(grpc
orflask
). - Manually kill one datanode.
- Add logic for data disaster recovery:
- If the output data is incomplete, read from the input and compute the result directly.
- If a data node has restarted, recompute and store the output.