I think both approaches are fine for this project. Using Jupyter will be more detailed and fundamental, while the application approach will be more engaging. # Jupyter Way ## Part1: Setup and SQL Query Offer them `docker-compose.yml` , `Dockerfile.sql` , `Dockerfile.hdfs` and `Dockerfile.notebook`, while they are required to complete `Dockerfile.namenode`, `Dockerfile.datanode` . In `Dockerfile.sql`, we download data, deploy SQL server, get it ready to be queried Then the whole system can be established by running `docker compose up` . **Q1: Connect to SQL server and query** In jupyter, use `mysqlconnector` to connect to SQL server, then do specific queries, then print the result. **Q2: Persist a table from SQL** Read a table from SQL server and save it separately `input.parquet`. ## Part2 Data Upload and HDFS status **Q3: Check the number of living datanodes** run `hdfs dfsadmin -fs -report` command to get the status of HDFS. <br> Then upload and `input.parquet` to HDFS with 2x replication. **Q4: what are the logical and physical sizes of the parquet files?** Run `hdfs dfs -du -h hdfs://boss:9000/` ## Part3 PyArrow **Q5: What is the average of `XXX` (something like this)** Use PyArrow to read from HDFS and do some calculation. <br> Ask them to do some more complex calculations and store results as a `output.parquet` back to HDFS with 1 replication. **Q6: blocks distribution across the two DataNode for** `output.parquet` (2x) Use the WebHDFS `OPEN` operation with `offset` 0 and `noredirect=true` to get it. output is like:`{'755329887c2a': 9, 'c181cd6fd6fe': 7}` **Q7: blocks distribution across the two DataNode for** `output.parquet` (1x) Use the WebHDFS `GETFILEBLOCKLOCATIONS` and iterate every block for counting. ## Part 4: Disaster Strikes Kill one datanode manually. **Q8: how many live DataNodes are in the cluster?** Run `hdfs dfsadmin -fs -report` again, but expecting `Live datanodes (1)` <br> Ask students to access `result.parquet` , which expected to fail. **Q9: how many blocks of single.parquet were lost?** Use `OPEN` or `GETFILEBLOCKLOCATIONS` to get that. **Q10: return specific line of output by recalculate with replicated** `input.parquet` # Application Way Offer them `docker-compose.yml` , `Dockerfile.sql` , `Dockerfile.hdfs` and `Dockerfile.notebook`, while they are required to complete `Dockerfile.namenode`, `Dockerfile.datanode`. Then the main system can be established by running `docker compose up`. Students need to: 1. Define interfaces, `grpc` or `flask` 2. Write a `server.py`: read data from SQL, save them as `input.parquet`, store `input.parquet` in HDFS with 1x rep, do calculation, store `output.parquet` in HDFS with 1x rep, then start serving(`grpc` or `flask`). 3. Manually kill one datanode. 4. Add logic for data disaster recovery: <blockquote> * If the output data is incomplete, read from the input and compute the result directly. * If a data node has restarted, recompute and store the output.</blockquote>