Application_way general_idea done

fed8ea0c · wyang338 · 71846d27 · fed8ea0c
Commit fed8ea0c authored 3 weeks ago by wyang338
--- a/p4/Outline.md
+++ b/p4/Outline.md
-# Part1: Setup and SQL Query
+I think both approaches are fine for this project. Using Jupyter will be more detailed and fundamental, while the application approach will be more engaging.
+
+
+
+# Jupyter Way
+## Part1: Setup and SQL Query

 Offer them `docker-compose.yml` , `Dockerfile.sql` , `Dockerfile.hdfs` and `Dockerfile.notebook`, while they are required to complete `Dockerfile.namenode`, `Dockerfile.datanode` .

@@ -14,7 +19,7 @@ In jupyter, use `mysqlconnector` to connect to SQL server, then do specific quer

 Read a table from SQL server and save it separately  `input.parquet`. 

-# Part2 Data Upload and HDFS status
+## Part2 Data Upload and HDFS status

 **Q3: Check the number of living datanodes**

@@ -27,7 +32,7 @@ Then upload and `input.parquet` to HDFS with 2x replication.

 Run `hdfs dfs -du -h hdfs://boss:9000/`

-# Part3 PyArrow
+## Part3 PyArrow

 **Q5: What is the average of `XXX` (something like this)**

@@ -46,7 +51,7 @@ output is like:`{'755329887c2a': 9, 'c181cd6fd6fe': 7}`

 Use the WebHDFS `GETFILEBLOCKLOCATIONS` and iterate every block for counting.

-# Part 4: Disaster Strikes
+## Part 4: Disaster Strikes

 Kill one datanode manually.

@@ -61,4 +66,20 @@ Ask students to access `result.parquet` , which expected to fail.

 Use `OPEN` or `GETFILEBLOCKLOCATIONS` to get that.

-**Q10:  return specific line of output by recalculate with replicated** `input.parquet`
\ No newline at end of file
+**Q10:  return specific line of output by recalculate with replicated** `input.parquet`
+
+# Application Way
+Offer them `docker-compose.yml` , `Dockerfile.sql` , `Dockerfile.hdfs` and `Dockerfile.notebook`, while they are required to complete `Dockerfile.namenode`, `Dockerfile.datanode`. 
+
+Then the main system can be established by running `docker compose up`.
+
+Students need to:
+
+1. Define interfaces, `grpc` or `flask`
+2. Write a `server.py`: read data from SQL, save them as `input.parquet`, store `input.parquet` in HDFS with 1x rep, do calculation, store `output.parquet` in HDFS with 1x rep, then start serving(`grpc` or `flask`). 
+3. Manually kill one datanode.
+4. Add logic for data disaster recovery:
+<blockquote>
+
+  * If the output data is incomplete, read from the input and compute the result directly.
+  * If a data node has restarted, recompute and store the output.</blockquote>
\ No newline at end of file