diff --git a/p4/Outline.md b/p4/Outline.md index ee5ad8422aa0f47d881baf0fb8326631df2ac58e..06c58c13fe9a2e61f12cb49a134695c76bdc566c 100644 --- a/p4/Outline.md +++ b/p4/Outline.md @@ -1,4 +1,9 @@ -# Part1: Setup and SQL Query +I think both approaches are fine for this project. Using Jupyter will be more detailed and fundamental, while the application approach will be more engaging. + + + +# Jupyter Way +## Part1: Setup and SQL Query Offer them `docker-compose.yml` , `Dockerfile.sql` , `Dockerfile.hdfs` and `Dockerfile.notebook`, while they are required to complete `Dockerfile.namenode`, `Dockerfile.datanode` . @@ -14,7 +19,7 @@ In jupyter, use `mysqlconnector` to connect to SQL server, then do specific quer Read a table from SQL server and save it separately `input.parquet`. -# Part2 Data Upload and HDFS status +## Part2 Data Upload and HDFS status **Q3: Check the number of living datanodes** @@ -27,7 +32,7 @@ Then upload and `input.parquet` to HDFS with 2x replication. Run `hdfs dfs -du -h hdfs://boss:9000/` -# Part3 PyArrow +## Part3 PyArrow **Q5: What is the average of `XXX` (something like this)** @@ -46,7 +51,7 @@ output is like:`{'755329887c2a': 9, 'c181cd6fd6fe': 7}` Use the WebHDFS `GETFILEBLOCKLOCATIONS` and iterate every block for counting. -# Part 4: Disaster Strikes +## Part 4: Disaster Strikes Kill one datanode manually. @@ -61,4 +66,20 @@ Ask students to access `result.parquet` , which expected to fail. Use `OPEN` or `GETFILEBLOCKLOCATIONS` to get that. -**Q10: return specific line of output by recalculate with replicated** `input.parquet` \ No newline at end of file +**Q10: return specific line of output by recalculate with replicated** `input.parquet` + +# Application Way +Offer them `docker-compose.yml` , `Dockerfile.sql` , `Dockerfile.hdfs` and `Dockerfile.notebook`, while they are required to complete `Dockerfile.namenode`, `Dockerfile.datanode`. + +Then the main system can be established by running `docker compose up`. + +Students need to: + +1. Define interfaces, `grpc` or `flask` +2. Write a `server.py`: read data from SQL, save them as `input.parquet`, store `input.parquet` in HDFS with 1x rep, do calculation, store `output.parquet` in HDFS with 1x rep, then start serving(`grpc` or `flask`). +3. Manually kill one datanode. +4. Add logic for data disaster recovery: +<blockquote> + + * If the output data is incomplete, read from the input and compute the result directly. + * If a data node has restarted, recompute and store the output.</blockquote> \ No newline at end of file