From 71846d2717b655962d079743f9a5a2533c4582b1 Mon Sep 17 00:00:00 2001
From: wyang338 <weichuyang777@gmail.com>
Date: Sat, 22 Feb 2025 19:41:51 -0600
Subject: [PATCH] Jupyter_way outline done

---
 p4/Outline.md | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)
 create mode 100644 p4/Outline.md

diff --git a/p4/Outline.md b/p4/Outline.md
new file mode 100644
index 0000000..ee5ad84
--- /dev/null
+++ b/p4/Outline.md
@@ -0,0 +1,64 @@
+# Part1: Setup and SQL Query
+
+Offer them `docker-compose.yml` , `Dockerfile.sql` , `Dockerfile.hdfs` and `Dockerfile.notebook`, while they are required to complete `Dockerfile.namenode`, `Dockerfile.datanode` .
+
+In `Dockerfile.sql`, we download data, deploy SQL server, get it ready to be queried
+
+Then the whole system can be established by running `docker compose up` .
+
+**Q1: Connect to SQL server and query** 
+
+In jupyter, use `mysqlconnector` to connect to SQL server, then do specific queries, then print the result.
+
+**Q2: Persist a table from SQL**
+
+Read a table from SQL server and save it separately  `input.parquet`. 
+
+# Part2 Data Upload and HDFS status
+
+**Q3: Check the number of living datanodes**
+
+run `hdfs dfsadmin -fs -report` command to get the status of HDFS.
+
+<br>
+Then upload and `input.parquet` to HDFS with 2x replication.
+
+**Q4: what are the logical and physical sizes of the parquet files?**
+
+Run `hdfs dfs -du -h hdfs://boss:9000/`
+
+# Part3 PyArrow
+
+**Q5: What is the average of `XXX` (something like this)**
+
+Use PyArrow to read from HDFS and do some calculation.  
+<br>
+
+
+Ask them to do some more complex calculations and store results as a `output.parquet` back to HDFS with 1 replication. 
+
+**Q6: blocks distribution across the two DataNode for** `output.parquet` (2x)
+Use the WebHDFS `OPEN` operation with `offset` 0 and `noredirect=true`  to get it.
+
+output is like:`{'755329887c2a': 9, 'c181cd6fd6fe': 7}`
+
+**Q7: blocks distribution across the two DataNode for** `output.parquet` (1x)
+
+Use the WebHDFS `GETFILEBLOCKLOCATIONS` and iterate every block for counting.
+
+# Part 4: Disaster Strikes
+
+Kill one datanode manually.
+
+**Q8: how many live DataNodes are in the cluster?**
+Run `hdfs dfsadmin -fs -report` again, but expecting `Live datanodes (1)`
+
+
+<br>
+Ask students to access `result.parquet` , which expected to fail.
+
+**Q9:  how many blocks of single.parquet were lost?**
+
+Use `OPEN` or `GETFILEBLOCKLOCATIONS` to get that.
+
+**Q10:  return specific line of output by recalculate with replicated** `input.parquet`
\ No newline at end of file
-- 
GitLab