diff --git a/Labs/Lab12/README.md b/Labs/Lab12/README.md new file mode 100644 index 0000000000000000000000000000000000000000..53807fbf3d729d3ed08bf430ade515ba0f91620e --- /dev/null +++ b/Labs/Lab12/README.md @@ -0,0 +1,9 @@ +# Lab 12: Regression + +1. Continue working on P5 + +2. Related lab documents: [SQL](./SQL.md), [Matrix](./counting-cells.md), [Raster](./raster.md) + +# Screenshot Requirement + +A screenshot that shows your progress \ No newline at end of file diff --git a/Labs/Lab12/SQL.md b/Labs/Lab12/SQL.md new file mode 100644 index 0000000000000000000000000000000000000000..69bb6c44952996d82f56fed349b2a134450b1c0e --- /dev/null +++ b/Labs/Lab12/SQL.md @@ -0,0 +1,156 @@ +# SQL Database Queries + +SQLite databases contain multiple tables. We can write queries to ask +questions about the data in these tables. One way is by putting our +queries in strings and using `pd.read_sql` to get the results back in +a DataFrame. + +We'll give some examples here that will help you get the data you need +for P6. The `INNER JOIN` and `GROUY BY` operations will be especially +useful. + +```python +import pandas as pd +import sqlite3 + +df = pd.DataFrame([ + {"state": "wi", "city": "madison", "address": "123 Main St.", "item": "apples", "quantity": 3, "price": 1}, + {"state": "wi", "city": "madison", "address": "123 Main St.", "item": "oranges", "quantity": 4, "price": 0.8}, + {"state": "wi", "city": "madison", "address": "123 Main St.", "item": "cantaloupe", "quantity": 5, "price": 2}, + {"state": "wi", "city": "milwaukee", "address": "456 State St", "item": "apples", "quantity": 6, "price": 0.9}, + {"state": "wi", "city": "milwaukee", "address": "456 State St.", "item": "oranges", "quantity": 8, "price": 1.2}, +]) +connection = sqlite3.connect("grocery.db") +df.to_sql("sales", connection, if_exists="replace", index=False) +``` + +Take a look at the data: + +```python +pd.read_sql("SELECT * FROM sales", connection) +``` + +To review `GROUP BY`, take a look at this query that computes how much +revenue each kind of fruit generates and run it: + +```python +pd.read_sql("SELECT item, SUM(quantity*price) AS dollars FROM sales GROUP BY item", connection) +``` + +Now, try to write a query that counts sales per location. + +<details> + <summary>ANSWER</summary> + <code> + pd.read_sql("SELECT state, city, address, SUM(quantity*price) AS dollars FROM sales GROUP BY state, city, address", connection) + </code> +</details> + + +Notice a problem? The issue is that all address information is +repeated for each location. That wastes space, but much worse, it +opens the possibility for typos leading to results such as this: + +<img src="err.png" width=400> + +To avoid these issues, it's common in practice to break up such a +table into two smaller tables, perhaps named `locations` and `sales`. +A `location_id` field might make it possible to combine the +information. + +```python +df = pd.DataFrame([ + {"location_id": 1, "state": "wi", "city": "madison", "address": "123 Main St."}, + {"location_id": 2, "state": "wi", "city": "milwaukee", "address": "456 State St."}, +]) +df.to_sql("locations", connection, if_exists="replace", index=False) + +df = pd.DataFrame([ + {"location_id": 1, "item": "apples", "quantity": 3, "price": 1}, + {"location_id": 1, "item": "oranges", "quantity": 4, "price": 0.8}, + {"location_id": 1, "item": "cantaloupe", "quantity": 5, "price": 2}, + {"location_id": 2, "item": "apples", "quantity": 6, "price": 0.9}, + {"location_id": 2, "item": "oranges", "quantity": 8, "price": 1.2}, +]) +df.to_sql("sales", connection, if_exists="replace", index=False) +``` + +Take a look at each table: + +* `pd.read_sql("SELECT * FROM sales", connection)` +* `pd.read_sql("SELECT * FROM locations", connection)` + +Note that you *could* figure out the location for each sale in the +`sales` table by using the `location_id` to find that information in +`locations`. + +There's an easier way, `INNER JOIN` (there are other kinds of joins +that we won't discuss in CS 320). + +Try running this: + +``` +pd.read_sql("SELECT * FROM locations INNER JOIN sales", connection) +``` + +Notice that the `INNER JOIN` is creating a row for every combination +of the 2 rows in `locations` and the 5 rows in `sales`, for a total of +10 result rows. Most of these results are meaningless: many of the +output rows have `location_id` appearing twice, with the two values +being inconsistent. + +We need to add an `ON` clause to match up each `sales` row with the +`locations` row that has the same `location_id`. Add `ON +locations.location_id = sales.location_id` to the end of the query: + +```python +pd.read_sql("SELECT * FROM locations INNER JOIN sales ON locations.location_id = sales.location_id", connection) +``` + +The `location_id` was only useful for matching up the rows, so you may +want to drop in in pandas (there's not a simple way in SQL): + +```python +pd.read_sql(""" + SELECT * FROM + locations INNER JOIN sales + ON locations.location_id = sales.location_id""", + connection).drop(columns="location_id") +``` + +We can also do similar queries as we could before when we only had one +table. The `GROUP BY` will come after the `INNER JOIN`. How much +revenue did each fruit generate? + +```python +pd.read_sql(""" + SELECT item, SUM(quantity*price) AS dollars + FROM locations INNER JOIN sales + ON locations.location_id = sales.location_id + GROUP BY item""", connection) +``` + +Now, try write a query to answer the question, how much revenue was +there at each location? + +<details> + <summary>ANSWER (option 1)</summary> + <code> +pd.read_sql(""" + SELECT state, city, address, SUM(quantity*price) AS dollars + FROM locations INNER JOIN sales + ON locations.location_id = sales.location_id + GROUP BY state, city, address""", connection) + </code> +</details> + +<details> + <summary>ANSWER (option 2)</summary> + <code> +pd.read_sql(""" + SELECT state, city, address, SUM(quantity*price) AS dollars + FROM locations INNER JOIN sales + ON locations.location_id = sales.location_id + GROUP BY locations.location_id""", connection) + </code> +</details> diff --git a/Labs/Lab12/counting-cells.md b/Labs/Lab12/counting-cells.md new file mode 100644 index 0000000000000000000000000000000000000000..d8f98ce746ea137a390fbc7cb3e864169d438422 --- /dev/null +++ b/Labs/Lab12/counting-cells.md @@ -0,0 +1,49 @@ +# Counting Cells + +Run the following: + +```python +import numpy as np +a = np.array([ + [0,0,5,8], + [1,2,4,8], + [2,4,6,9], +]) +``` + +How many even numbers are in this matrix? What percentage of the +numbers are even? We'll walk you through the solution. Please run +each step (which builds on the previous) to see what's happening. + +First step, mod by 2, to get a 0 in every odd cell: + +```python +a % 2 +``` + +Now, let's do an elementwise comparison to get a True in every place where there is an even number: + +```python +a % 2 == 0 +``` + +It will be easier to count matches if we represent True as 1 and False as 0: + +```python +(a % 2 == 0).astype(int) +``` + +How many is that? + +```python +(a % 2 == 0).astype(int).sum() +``` + +And what percent of the total is that? + +```python +(a % 2 == 0).astype(int).mean() * 100 +``` + +This may be useful for counting what percentage of an area matches a +given land type in P6. diff --git a/Labs/Lab12/raster.md b/Labs/Lab12/raster.md new file mode 100644 index 0000000000000000000000000000000000000000..b0cf862d3cc0198361ef99bf299d2c00d0c3b356 --- /dev/null +++ b/Labs/Lab12/raster.md @@ -0,0 +1,93 @@ +# Geographic Raster Data + +In class, we learned geopandas, which is a *vector-based* GIS tool -- +that means geo data is represented by vectors of coordinates, which +form polygons and other shapes. + +*Raster* data is the other common kind of geo data you'll encounter. + With raster data, you break land into a matrix, with numbers in each + cell telling you something about the land at a given position. + +In this part, we'll learn a bit about the `rasterio` module. It will +help us create numpy arrays corresponding to how land is used in a +given WI county (this will be useful for predicting things like a +county's population). + +First, install some packages: + +``` +pip3 install rasterio Pillow +``` + +P6 includes a `land.zip` dataset. Let's open it (this assumes +your lab notebook is in the `p6` directory -- you may need to modify the path to +`land.zip` if you're elsewhere): + +```python +import rasterio +land = rasterio.open("zip://../p6/land.zip!wi.tif") +``` + +This is the dataset for all of WI. Let's say we want to only see Dane +County (which contains Madison). We can get this from TIGERweb, a +service run by the US Census Bureau. + +1. go to https://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/ +2. click "TIGERweb/tigerWMS_Census2020" +3. click "Counties (82)" +4. at the bottom, click "Query" +5. in the "Where" box, type `NAME='Dane County'` exactly +6. under "Format", choose "GeoJSON" +7. click "Query (GET)" +8. copy the URL + +Paste the URL into the following code snippet: + +```python +import geopandas as gpd +url = "????" +dane = gpd.read_file(url) +dane.plot() +``` + +You should see a rough outline of Dane County. + +**NOTE: do not make requests to TIGERweb as part of P6. We have + already done so and saved the results in a geojson file we + provide.** + +We can use that outline as a *mask* on the raster data to get a numpy +array of land use. A mask identifies specific cells in a matrix that +matter to us (note that we need to convert our geopandas data to the +same CRS as the rasterio data): + +```python +from rasterio.mask import mask +matrix, _ = mask(land, dane.to_crs(land.crs)["geometry"], crop=True) +matrix = matrix[0] +``` + +Let's visualize the county: + +``` +import matplotlib.pyplot as plt +plt.imshow(matrix) +``` + +It should look like this: + +<img src="dane.png" width=400> + +Browse the legend here: https://www.mrlc.gov/data/legends/national-land-cover-database-2019-nlcd2019-legend + +We see water is encoded as 11. We can highlight all the water regions in Dane County like this: + +```python +plt.imshow(matrix == 11) +``` + +Try filtering the matrix in different ways to see where the following land covers are dominant: + +* Deciduous Forest +* Cultivated Crops +* Developed, Low Intensity