From 71846d2717b655962d079743f9a5a2533c4582b1 Mon Sep 17 00:00:00 2001 From: wyang338 <weichuyang777@gmail.com> Date: Sat, 22 Feb 2025 19:41:51 -0600 Subject: [PATCH] Jupyter_way outline done --- p4/Outline.md | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+) create mode 100644 p4/Outline.md diff --git a/p4/Outline.md b/p4/Outline.md new file mode 100644 index 0000000..ee5ad84 --- /dev/null +++ b/p4/Outline.md @@ -0,0 +1,64 @@ +# Part1: Setup and SQL Query + +Offer them `docker-compose.yml` , `Dockerfile.sql` , `Dockerfile.hdfs` and `Dockerfile.notebook`, while they are required to complete `Dockerfile.namenode`, `Dockerfile.datanode` . + +In `Dockerfile.sql`, we download data, deploy SQL server, get it ready to be queried + +Then the whole system can be established by running `docker compose up` . + +**Q1: Connect to SQL server and query** + +In jupyter, use `mysqlconnector` to connect to SQL server, then do specific queries, then print the result. + +**Q2: Persist a table from SQL** + +Read a table from SQL server and save it separately `input.parquet`. + +# Part2 Data Upload and HDFS status + +**Q3: Check the number of living datanodes** + +run `hdfs dfsadmin -fs -report` command to get the status of HDFS. + +<br> +Then upload and `input.parquet` to HDFS with 2x replication. + +**Q4: what are the logical and physical sizes of the parquet files?** + +Run `hdfs dfs -du -h hdfs://boss:9000/` + +# Part3 PyArrow + +**Q5: What is the average of `XXX` (something like this)** + +Use PyArrow to read from HDFS and do some calculation. +<br> + + +Ask them to do some more complex calculations and store results as a `output.parquet` back to HDFS with 1 replication. + +**Q6: blocks distribution across the two DataNode for** `output.parquet` (2x) +Use the WebHDFS `OPEN` operation with `offset` 0 and `noredirect=true` to get it. + +output is like:`{'755329887c2a': 9, 'c181cd6fd6fe': 7}` + +**Q7: blocks distribution across the two DataNode for** `output.parquet` (1x) + +Use the WebHDFS `GETFILEBLOCKLOCATIONS` and iterate every block for counting. + +# Part 4: Disaster Strikes + +Kill one datanode manually. + +**Q8: how many live DataNodes are in the cluster?** +Run `hdfs dfsadmin -fs -report` again, but expecting `Live datanodes (1)` + + +<br> +Ask students to access `result.parquet` , which expected to fail. + +**Q9: how many blocks of single.parquet were lost?** + +Use `OPEN` or `GETFILEBLOCKLOCATIONS` to get that. + +**Q10: return specific line of output by recalculate with replicated** `input.parquet` \ No newline at end of file -- GitLab