Skip to content
Snippets Groups Projects
Code owners
Assign users and groups as approvers for specific file changes. Learn more.

I think both approaches are fine for this project. Using Jupyter will be more detailed and fundamental, while the application approach will be more engaging.

Jupyter Way

Part1: Setup and SQL Query

Offer them docker-compose.yml , Dockerfile.sql , Dockerfile.hdfs and Dockerfile.notebook, while they are required to complete Dockerfile.namenode, Dockerfile.datanode .

In Dockerfile.sql, we download data, deploy SQL server, get it ready to be queried

Then the whole system can be established by running docker compose up .

Q1: Connect to SQL server and query

In jupyter, use mysqlconnector to connect to SQL server, then do specific queries, then print the result.

Q2: Persist a table from SQL

Read a table from SQL server and save it separately input.parquet.

Part2 Data Upload and HDFS status

Q3: Check the number of living datanodes

run hdfs dfsadmin -fs -report command to get the status of HDFS.


Then upload and `input.parquet` to HDFS with 2x replication.

Q4: what are the logical and physical sizes of the parquet files?

Run hdfs dfs -du -h hdfs://boss:9000/

Part3 PyArrow

Q5: What is the average of XXX (something like this)

Use PyArrow to read from HDFS and do some calculation.

Ask them to do some more complex calculations and store results as a output.parquet back to HDFS with 1 replication.

Q6: blocks distribution across the two DataNode for output.parquet (2x) Use the WebHDFS OPEN operation with offset 0 and noredirect=true to get it.

output is like:{'755329887c2a': 9, 'c181cd6fd6fe': 7}

Q7: blocks distribution across the two DataNode for output.parquet (1x)

Use the WebHDFS GETFILEBLOCKLOCATIONS and iterate every block for counting.

Part 4: Disaster Strikes

Kill one datanode manually.

Q8: how many live DataNodes are in the cluster? Run hdfs dfsadmin -fs -report again, but expecting Live datanodes (1)


Ask students to access `result.parquet` , which expected to fail.

Q9: how many blocks of single.parquet were lost?

Use OPEN or GETFILEBLOCKLOCATIONS to get that.

Q10: return specific line of output by recalculate with replicated input.parquet

Application Way

Offer them docker-compose.yml , Dockerfile.sql , Dockerfile.hdfs and Dockerfile.notebook, while they are required to complete Dockerfile.namenode, Dockerfile.datanode.

Then the main system can be established by running docker compose up.

Students need to:

  1. Define interfaces, grpc or flask
  2. Write a server.py: read data from SQL, save them as input.parquet, store input.parquet in HDFS with 1x rep, do calculation, store output.parquet in HDFS with 1x rep, then start serving(grpc or flask).
  3. Manually kill one datanode.
  4. Add logic for data disaster recovery:
  • If the output data is incomplete, read from the input and compute the result directly.
  • If a data node has restarted, recompute and store the output.