Skip to content
Snippets Groups Projects
Outline.md 2 KiB
Newer Older
wyang338's avatar
wyang338 committed
# Part1: Setup and SQL Query

Offer them `docker-compose.yml` , `Dockerfile.sql` , `Dockerfile.hdfs` and `Dockerfile.notebook`, while they are required to complete `Dockerfile.namenode`, `Dockerfile.datanode` .

In `Dockerfile.sql`, we download data, deploy SQL server, get it ready to be queried

Then the whole system can be established by running `docker compose up` .

**Q1: Connect to SQL server and query** 

In jupyter, use `mysqlconnector` to connect to SQL server, then do specific queries, then print the result.

**Q2: Persist a table from SQL**

Read a table from SQL server and save it separately  `input.parquet`. 

# Part2 Data Upload and HDFS status

**Q3: Check the number of living datanodes**

run `hdfs dfsadmin -fs -report` command to get the status of HDFS.

<br>
Then upload and `input.parquet` to HDFS with 2x replication.

**Q4: what are the logical and physical sizes of the parquet files?**

Run `hdfs dfs -du -h hdfs://boss:9000/`

# Part3 PyArrow

**Q5: What is the average of `XXX` (something like this)**

Use PyArrow to read from HDFS and do some calculation.  
<br>


Ask them to do some more complex calculations and store results as a `output.parquet` back to HDFS with 1 replication. 

**Q6: blocks distribution across the two DataNode for** `output.parquet` (2x)
Use the WebHDFS `OPEN` operation with `offset` 0 and `noredirect=true`  to get it.

output is like:`{'755329887c2a': 9, 'c181cd6fd6fe': 7}`

**Q7: blocks distribution across the two DataNode for** `output.parquet` (1x)

Use the WebHDFS `GETFILEBLOCKLOCATIONS` and iterate every block for counting.

# Part 4: Disaster Strikes

Kill one datanode manually.

**Q8: how many live DataNodes are in the cluster?**
Run `hdfs dfsadmin -fs -report` again, but expecting `Live datanodes (1)`


<br>
Ask students to access `result.parquet` , which expected to fail.

**Q9:  how many blocks of single.parquet were lost?**

Use `OPEN` or `GETFILEBLOCKLOCATIONS` to get that.

**Q10:  return specific line of output by recalculate with replicated** `input.parquet`