lab12

d9921350 · JINLANG WANG · 13b784a2 · d9921350 · d9921350 · d9921350
Commit d9921350 authored 1 year ago by JINLANG WANG
--- a/Labs/Lab12/README.md
+++ b/Labs/Lab12/README.md
+# Lab 12: Regression
+
+1. Continue working on P5
+
+2. Related lab documents: [SQL](./SQL.md), [Matrix](./counting-cells.md), [Raster](./raster.md)
+
+# Screenshot Requirement
+
+A screenshot that shows your progress
\ No newline at end of file
--- a/Labs/Lab12/SQL.md
+++ b/Labs/Lab12/SQL.md
+# SQL Database Queries
+
+SQLite databases contain multiple tables.  We can write queries to ask
+questions about the data in these tables.  One way is by putting our
+queries in strings and using `pd.read_sql` to get the results back in
+a DataFrame.
+
+We'll give some examples here that will help you get the data you need
+for P6.  The `INNER JOIN` and `GROUY BY` operations will be especially
+useful.
+
+```python
+import pandas as pd
+import sqlite3
+
+df = pd.DataFrame([
+    {"state": "wi", "city": "madison", "address": "123 Main St.", "item": "apples", "quantity": 3, "price": 1},
+    {"state": "wi", "city": "madison", "address": "123 Main St.", "item": "oranges", "quantity": 4, "price": 0.8},
+    {"state": "wi", "city": "madison", "address": "123 Main St.", "item": "cantaloupe", "quantity": 5, "price": 2},
+    {"state": "wi", "city": "milwaukee", "address": "456 State St", "item": "apples", "quantity": 6, "price": 0.9},
+    {"state": "wi", "city": "milwaukee", "address": "456 State St.", "item": "oranges", "quantity": 8, "price": 1.2},
+])
+connection = sqlite3.connect("grocery.db")
+df.to_sql("sales", connection, if_exists="replace", index=False)
+```
+
+Take a look at the data:
+
+```python
+pd.read_sql("SELECT * FROM sales", connection)
+```
+
+To review `GROUP BY`, take a look at this query that computes how much
+revenue each kind of fruit generates and run it:
+
+```python
+pd.read_sql("SELECT item, SUM(quantity*price) AS dollars FROM sales GROUP BY item", connection)
+```
+
+Now, try to write a query that counts sales per location.
+
+<details>
+    <summary>ANSWER</summary>
+    <code>
+    pd.read_sql("SELECT state, city, address, SUM(quantity*price) AS dollars FROM sales GROUP BY state, city, address", connection)
+    </code>
+</details>
+
+
+Notice a problem?  The issue is that all address information is
+repeated for each location.  That wastes space, but much worse, it
+opens the possibility for typos leading to results such as this:
+
+<img src="err.png" width=400>
+
+To avoid these issues, it's common in practice to break up such a
+table into two smaller tables, perhaps named `locations` and `sales`.
+A `location_id` field might make it possible to combine the
+information.
+
+```python
+df = pd.DataFrame([
+    {"location_id": 1, "state": "wi", "city": "madison", "address": "123 Main St."},
+    {"location_id": 2, "state": "wi", "city": "milwaukee", "address": "456 State St."},
+])
+df.to_sql("locations", connection, if_exists="replace", index=False)
+
+df = pd.DataFrame([
+    {"location_id": 1, "item": "apples", "quantity": 3, "price": 1},
+    {"location_id": 1, "item": "oranges", "quantity": 4, "price": 0.8},
+    {"location_id": 1, "item": "cantaloupe", "quantity": 5, "price": 2},
+    {"location_id": 2, "item": "apples", "quantity": 6, "price": 0.9},
+    {"location_id": 2, "item": "oranges", "quantity": 8, "price": 1.2},
+])
+df.to_sql("sales", connection, if_exists="replace", index=False)
+```
+
+Take a look at each table:
+
+* `pd.read_sql("SELECT * FROM sales", connection)`
+* `pd.read_sql("SELECT * FROM locations", connection)`
+
+Note that you *could* figure out the location for each sale in the
+`sales` table by using the `location_id` to find that information in
+`locations`.
+
+There's an easier way, `INNER JOIN` (there are other kinds of joins
+that we won't discuss in CS 320).
+
+Try running this:
+
+```
+pd.read_sql("SELECT * FROM locations INNER JOIN sales", connection)
+```
+
+Notice that the `INNER JOIN` is creating a row for every combination
+of the 2 rows in `locations` and the 5 rows in `sales`, for a total of
+10 result rows.  Most of these results are meaningless: many of the
+output rows have `location_id` appearing twice, with the two values
+being inconsistent.
+
+We need to add an `ON` clause to match up each `sales` row with the
+`locations` row that has the same `location_id`.  Add `ON
+locations.location_id = sales.location_id` to the end of the query:
+
+```python
+pd.read_sql("SELECT * FROM locations INNER JOIN sales ON locations.location_id = sales.location_id", connection)
+```
+
+The `location_id` was only useful for matching up the rows, so you may
+want to drop in in pandas (there's not a simple way in SQL):
+
+```python
+pd.read_sql("""
+  SELECT * FROM 
+  locations INNER JOIN sales 
+  ON locations.location_id = sales.location_id""",
+  connection).drop(columns="location_id")
+```
+
+We can also do similar queries as we could before when we only had one
+table.  The `GROUP BY` will come after the `INNER JOIN`.  How much
+revenue did each fruit generate?
+
+```python
+pd.read_sql("""
+  SELECT item, SUM(quantity*price) AS dollars
+  FROM locations INNER JOIN sales 
+  ON locations.location_id = sales.location_id
+  GROUP BY item""", connection)
+```
+
+Now, try write a query to answer the question, how much revenue was
+there at each location?
+
+<details>
+    <summary>ANSWER (option 1)</summary>
+    <code>
+pd.read_sql("""
+  SELECT state, city, address, SUM(quantity*price) AS dollars
+  FROM locations INNER JOIN sales 
+  ON locations.location_id = sales.location_id
+  GROUP BY state, city, address""", connection)
+    </code>
+</details>
+
+<details>
+    <summary>ANSWER (option 2)</summary>
+    <code>
+pd.read_sql("""
+  SELECT state, city, address, SUM(quantity*price) AS dollars
+  FROM locations INNER JOIN sales 
+  ON locations.location_id = sales.location_id
+  GROUP BY locations.location_id""", connection)
+    </code>
+</details>
--- a/Labs/Lab12/counting-cells.md
+++ b/Labs/Lab12/counting-cells.md
+# Counting Cells
+
+Run the following:
+
+```python
+import numpy as np
+a = np.array([
+    [0,0,5,8],
+    [1,2,4,8],
+    [2,4,6,9],
+])
+```
+
+How many even numbers are in this matrix?  What percentage of the
+numbers are even?  We'll walk you through the solution.  Please run
+each step (which builds on the previous) to see what's happening.
+
+First step, mod by 2, to get a 0 in every odd cell:
+
+```python
+a % 2
+```
+
+Now, let's do an elementwise comparison to get a True in every place where there is an even number:
+
+```python
+a % 2 == 0
+```
+
+It will be easier to count matches if we represent True as 1 and False as 0:
+
+```python
+(a % 2 == 0).astype(int)
+```
+
+How many is that?
+
+```python
+(a % 2 == 0).astype(int).sum()
+```
+
+And what percent of the total is that?
+
+```python
+(a % 2 == 0).astype(int).mean() * 100
+```
+
+This may be useful for counting what percentage of an area matches a
+given land type in P6.
--- a/Labs/Lab12/raster.md
+++ b/Labs/Lab12/raster.md
+# Geographic Raster Data
+
+In class, we learned geopandas, which is a *vector-based* GIS tool --
+that means geo data is represented by vectors of coordinates, which
+form polygons and other shapes.
+
+*Raster* data is the other common kind of geo data you'll encounter.
+ With raster data, you break land into a matrix, with numbers in each
+ cell telling you something about the land at a given position.
+
+In this part, we'll learn a bit about the `rasterio` module.  It will
+help us create numpy arrays corresponding to how land is used in a
+given WI county (this will be useful for predicting things like a
+county's population).
+
+First, install some packages:
+
+```
+pip3 install rasterio Pillow
+```
+
+P6 includes a `land.zip` dataset.  Let's open it (this assumes
+your lab notebook is in the `p6` directory -- you may need to modify the path to
+`land.zip` if you're elsewhere):
+
+```python
+import rasterio
+land = rasterio.open("zip://../p6/land.zip!wi.tif")
+```
+
+This is the dataset for all of WI.  Let's say we want to only see Dane
+County (which contains Madison).  We can get this from TIGERweb, a
+service run by the US Census Bureau.
+
+1. go to https://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/
+2. click "TIGERweb/tigerWMS_Census2020"
+3. click "Counties (82)"
+4. at the bottom, click "Query"
+5. in the "Where" box, type `NAME='Dane County'` exactly
+6. under "Format", choose "GeoJSON"
+7. click "Query (GET)"
+8. copy the URL
+
+Paste the URL into the following code snippet:
+
+```python
+import geopandas as gpd
+url = "????"
+dane = gpd.read_file(url)
+dane.plot()
+```
+
+You should see a rough outline of Dane County.
+
+**NOTE: do not make requests to TIGERweb as part of P6.  We have
+  already done so and saved the results in a geojson file we
+  provide.**
+
+We can use that outline as a *mask* on the raster data to get a numpy
+array of land use.  A mask identifies specific cells in a matrix that
+matter to us (note that we need to convert our geopandas data to the
+same CRS as the rasterio data):
+
+```python
+from rasterio.mask import mask
+matrix, _ = mask(land, dane.to_crs(land.crs)["geometry"], crop=True)
+matrix = matrix[0]
+```
+
+Let's visualize the county:
+
+```
+import matplotlib.pyplot as plt
+plt.imshow(matrix)
+```
+
+It should look like this:
+
+<img src="dane.png" width=400>
+
+Browse the legend here: https://www.mrlc.gov/data/legends/national-land-cover-database-2019-nlcd2019-legend
+
+We see water is encoded as 11.  We can highlight all the water regions in Dane County like this:
+
+```python
+plt.imshow(matrix == 11)
+```
+
+Try filtering the matrix in different ways to see where the following land covers are dominant:
+
+* Deciduous Forest
+* Cultivated Crops
+* Developed, Low Intensity