Compare revisions

GURMAIL SINGH · Mark Ruzicka · MARK FRANCIS RUZICKA · Mark Ruzicka · MARK FRANCIS RUZICKA · Mark Ruzicka
--- a/lecture-material/03-Performance_1/03-performance.key
+++ b/lecture-material/03-Performance_1/03-performance.key
--- a/lecture-material/03-Performance_1/03-performance.pdf
+++ b/lecture-material/03-Performance_1/03-performance.pdf
--- a/lecture-material/03-Performance_1/03-performance.pptx
+++ b/lecture-material/03-Performance_1/03-performance.pptx
--- a/lecture-material/04-Performance_2/04-performance-worksheet-solution.key
+++ b/lecture-material/04-Performance_2/04-performance-worksheet-solution.key
--- a/lecture-material/04-Performance_2/04-performance-worksheet-solution.pdf
+++ b/lecture-material/04-Performance_2/04-performance-worksheet-solution.pdf
--- a/lecture-material/04-Performance_2/04-performance-worksheet-solution.pptx
+++ b/lecture-material/04-Performance_2/04-performance-worksheet-solution.pptx
--- a/lecture-material/17-Visualization_2/17-visualization.ipynb
+++ b/lecture-material/17-Visualization_2/17-visualization.ipynb
--- a/lecture-material/17-Visualization_2/17-visualization_001.ipynb
+++ b/lecture-material/17-Visualization_2/17-visualization_001.ipynb
 %% Cell type:markdown id:471a762b tags:

 # Visualization 2

 %% Cell type:markdown id:2478daaa-cb6c-4d73-92af-01ae91e773fe tags:

 ### Geographic Data / Maps

-#### Installation
-```python
-pip3 install --upgrade pip
-pip3 install geopandas shapely descartes geopy netaddr
-sudo apt install -y python3-rtree
-```
-
 - `import geopandas as gpd`
 - `.shp` => Shapefile
 - `gpd.datasets.get_path(<shp file path>)`:
    - example: `gpd.datasets.get_path("naturalearth_lowres")`
 - `gpd.read_file(<path>)`

 %% Cell type:code id:e6f50cc3 tags:

 ``` python
 import matplotlib
 import matplotlib.pyplot as plt
 import pandas as pd
 import math
 import requests
 import re
 import os

 # new import statements
 import geopandas as gpd
 from shapely.geometry import Point, Polygon, box
 ```

-%% Cell type:code id:257671bc-1e79-47c2-aa7a-1725562d23ef tags:
-
-``` python
-!ls /home/gurmail.singh/.local/lib/python3.8/site-packages/geopandas/datasets/naturalearth_lowres
-```
-
-%% Cell type:code id:e7269706-188a-48e3-882a-ac068170b1af tags:
-
-``` python
-!ls /home/gurmail.singh/.local/lib/python3.8/site-packages/geopandas/datasets
-```
-
 %% Cell type:code id:273ed288 tags:

 ``` python
-# Find the path for "naturalearth_lowres"
-path =
-# Read the shapefile for "naturalearth_lowres" and
+# Read the geojson file for "naturalearth_lowres"
+gdf = gpd.read_file("naturalearth_lowres.geojson")
 # set index using "name" column
-gdf =
+gdf = gdf.set_index("name")
 ```

 %% Cell type:code id:cf4d871c-d6d6-4d99-be96-75faf804d4a7 tags:

 ``` python
 gdf.head()
 ```

 %% Cell type:code id:5c983208-7851-4d25-92e7-3f110039411a tags:

 ``` python
 type(gdf).__mro__
 ```

 %% Cell type:code id:653edec9-8b82-4684-ae82-a9fa399459ce tags:

 ``` python
-# All shapefiles have a column called "geometry"
+# Get the "geometry" column
 gdf["geometry"]
 ```

 %% Cell type:code id:e2eb071a-b29a-4edd-8747-50b02fd43108 tags:

 ``` python
 type(gdf["geometry"]).__mro__
 ```

 %% Cell type:code id:da5b5783-d78e-460f-bd22-b41e7f24f871 tags:

 ``` python
 # First country's name and geometry
 ```

 %% Cell type:code id:8f4da997-567d-47f6-8594-f0a37b92b9e6 tags:

 ``` python
 # Second country's name geometry
 print(gdf.index[1])
 gdf["geometry"].iat[1]
 ```

 %% Cell type:code id:d380b32f-c401-4601-b6d3-02ac344b3f08 tags:

 ``` python
 # Geometry for "United States of America"
 # gdf.at[<row_index>, <column_name>]
 ```

 %% Cell type:code id:3528f252-36b7-4ca5-9108-0951367837cc tags:

 ``` python
 # Type of Tanzania's geometry
 print(gdf.index[1], type(gdf["geometry"].iat[1]))

 # Type of United States of America's geometry
 print("United States of America", type(gdf.at["United States of America", "geometry"]))
 ```

 %% Cell type:markdown id:47711b72-a24c-4da8-a804-3faf88a5fe8b tags:

 - `gdf.plot(figsize=(<width>, <height>), column=<column name>)`
 - `ax.set_axis_off()`

 %% Cell type:code id:9874c7de-226e-4296-af60-45e091836a19 tags:

 ``` python
 ax = gdf.plot(figsize=(8,4))
 ```

 %% Cell type:code id:f5547cc5-3404-421e-aeed-4b4cf8ee8382 tags:

 ``` python
 # Set facecolor="0.7", edgecolor="black"
 ax = gdf.plot(figsize=(8,4))
 # Turn off the axes
 # ax.set_axis_off()
 ```

 %% Cell type:code id:58da716e tags:

 ``` python
 # Color the map based on population column, column="pop_est" and set cmap="cool" and legend=True
 ax = gdf.plot(figsize=(8,4))
 # Turn off the axes
 ax.set_axis_off()
 ```

 %% Cell type:markdown id:c7a85431-bdab-46ab-8e29-5f91d8a2e0bc tags:

 #### Create a map where countries with >100M people are red, others are gray.

 %% Cell type:code id:962a552b-94a6-4690-8018-f82173dd6096 tags:

 ``` python
 # Create a map where countries with >100M people are red, others are gray

 # Add a new column called color to gdf and set default value to "lightgray"

 # Boolean indexing to set color to red for countries with "pop_est" > 1e8

 # Create the plot
 # ax = gdf.plot(figsize=(8,4), color=gdf["color"])
 # ax.set_axis_off()
 ```

-%% Cell type:markdown id:5a32c52c-509f-4f7f-8dd6-bc7fe53c64fd tags:
-
-### All shapefile geometries are shapely shapes.
-
-%% Cell type:code id:33e92def-a7db-4d90-adc8-a3d01d472ac1 tags:
-
-``` python
-type(gdf["geometry"].iat[2])
-```
-
 %% Cell type:markdown id:39c95c23 tags:

 ### Shapely shapes

 - `from shapely.geometry import Point, Polygon, box`
 - `Polygon([(<x1>, <y1>), (<x2>, <y2>), (<x3>, <y3>), ...])`
 - `box(minx, miny, maxx, maxy)`
 - `Point(<x>, <y>)`
 - `<shapely object>.buffer(<size>)`
    - example: `Point(5, 5).buffer(3)` creates a circle

 %% Cell type:code id:61716db9 tags:

 ``` python
 triangle =    # triangle
 triangle
 ```

 %% Cell type:code id:6a36021a-8653-4698-a0ba-818c14091d79 tags:

 ``` python
 type(triangle)
 ```

 %% Cell type:code id:bddd958d-6fde-42df-8ae3-459e25fa3f04 tags:

 ``` python
 box1 =     # not a type; just a function that creates box
 box1
 ```

 %% Cell type:code id:eb7ade8d-96d2-4770-bce9-b3e35996b0b8 tags:

 ``` python
 type(box1)
 ```

 %% Cell type:code id:e5308c61-b1fa-433b-bb05-485ca7bd23da tags:

 ``` python
 point =
 point
 ```

 %% Cell type:code id:5c669619-af78-477d-b807-3e6d99278f2f tags:

 ``` python
 type(point)
 ```

 %% Cell type:code id:f015d27e-8fd8-446f-992f-4f7a88cc582d tags:

 ``` python
 # use buffer to create a circle from a point
 circle =
 circle
 ```

 %% Cell type:code id:39fe5752-8038-46cc-b4a6-320e6a78bbd1 tags:

 ``` python
 type(circle)
 ```

 %% Cell type:code id:800cc48d-c241-4439-9161-ff309e50373f tags:

 ``` python
 triangle_buffer = triangle.buffer(3)
 triangle_buffer
 ```

 %% Cell type:code id:6b9478d3-7cc9-4e80-ae84-e56d7bfd0b71 tags:

 ``` python
 type(triangle_buffer)
 ```

 %% Cell type:markdown id:1a340443-d750-43d5-99cd-28e7034ce898 tags:

 #### Shapely methods

 - Shapely methods:
    - `union`:  any point that is in either shape (OR)
    - `intersection`: any point that is in both shapes (AND)
    - `difference`: subtraction
    - `intersects`: do they overlap? returns True / False

 %% Cell type:code id:d1c5f9f7 tags:

 ``` python
 # union triangle and box1
 # it will give any point that is in either shape (OR)
 ```

 %% Cell type:code id:8a2d3357 tags:

 ``` python
 # intersection triangle and box1
 # any point that is in both shapes (AND)
 ```

 %% Cell type:code id:f327bb99-be34-4a00-a216-a6f40df48268 tags:

 ``` python
 # difference of triangle and box1
 ```

 %% Cell type:code id:9082b54a tags:

 ``` python
 # difference of box1 and triangle
 box1.difference(triangle)   # subtraction
 ```

 %% Cell type:code id:0def4359-78f6-4f14-80ae-05c25ff64bc5 tags:

 ``` python
 # check whether triangle intersects box1
 # the is, check do they overlap?
 ```

 %% Cell type:markdown id:51a82cf3-fe28-46d2-a019-d87f6a0f41b6 tags:

 Is the point "near" (<6 units) the triangle?

 %% Cell type:code id:7a87b70f tags:

 ``` python
 triangle.union(point.buffer(6))
 ```

 %% Cell type:code id:e04daa60 tags:

 ``` python
 triangle.intersects(point.buffer(6))
 ```

 %% Cell type:markdown id:bf500303 tags:

 #### Extracting "Europe" data from "naturalearth_lowres"

 %% Cell type:code id:410b08cf tags:

 ``` python
 # Europe bounding box
 eur_window = box(-10.67, 34.5, 31.55, 71.05)
 ```

 %% Cell type:markdown id:d45fdfcb-b244-4eff-a9d6-5cc4fefdfbd8 tags:

 Can we use `intersects` method?

 %% Cell type:code id:24f4a32b-9ed4-468b-a194-ebe0395fcdb6 tags:

 ``` python
 gdf.intersects(eur_window)
 ```

 %% Cell type:code id:ff455178-84da-45bf-b5b9-a71a7f9160f7 tags:

 ``` python
 # Incorrect v1
 gdf[gdf.intersects(eur_window)].plot()
 ```

 %% Cell type:code id:03aa8da8-72f4-45fd-8931-e410d1204226 tags:

 ``` python
 # Incorrect v2
 gdf[~gdf.intersects(eur_window)].plot()
 ```

 %% Cell type:markdown id:4173f3c5-017f-44ff-b713-158219fcf3e5 tags:

 Can we use `intersection` method?

 %% Cell type:code id:d7226fd9-3bb2-48c3-a5e9-5ce1ca0dd88f tags:

 ``` python
 gdf.intersection(eur_window)
 ```

 %% Cell type:code id:108b8b8a tags:

 ``` python
 gdf.intersection(eur_window).plot()
 ```

 %% Cell type:markdown id:f13e4e16-286a-4681-8bc5-9bd3fcf599a7 tags:

 How can we get rid of empty polygons (and remove the warning)?

 %% Cell type:code id:8c0dd051-a808-4fd4-b5ff-fee6ed580753 tags:

 ``` python
 eur = gdf.intersection(eur_window)
 eur.is_empty
 ```

 %% Cell type:markdown id:acb1b886-f7c8-41d4-979a-2309c2b973bd tags:

 Remove all the empty polygons using `is_empty`.

 %% Cell type:code id:55a76a00 tags:

 ``` python
 eur = eur[~eur.is_empty]
 eur
 ```

 %% Cell type:code id:08c59df7 tags:

 ``` python
 eur.plot()
 ```

 %% Cell type:markdown id:97a12d23 tags:

 #### Centroids of European countries

 %% Cell type:code id:eb5c83b7 tags:

 ``` python
 # plot the centroids
 ax = eur.plot(facecolor="lightgray", edgecolor="k")
 eur.centroid.plot(ax=ax)
 ```

 %% Cell type:markdown id:dbfab5e7-5941-438a-baf2-e05369106004 tags:

 ### Lat / Lon CRS

 - Lon is x-coord
 - Lat is y-coord
    - tells you where the point on Earth is
 - **IMPORTANT**: degrees are not a unit of distance. 1 degree of longitute near the equator is a lot farther than moving 1 degree of longitute near the north pole

 Using `.crs` to access CRS of a gdf.



 %% Cell type:code id:89251896 tags:

 ``` python
 eur.crs
 ```

 %% Cell type:markdown id:2a6ba447-6f1a-4d31-8c33-ec1eeb9e0deb tags:

 #### Single CRS doesn't work for the whole earth

 - Setting a different CRS for Europe that is based on meters.
 - https://spatialreference.org/ref/?search=europe

 %% Cell type:code id:586038b1 tags:

 ``` python
 # Setting CRS to "EPSG:3035"
 eur2 = eur.to_crs("EPSG:3035")
 eur2.crs
 ```

 %% Cell type:code id:d46c124c-9aff-4c2f-81fe-08885b606800 tags:

 ``` python
 ax = eur2.plot(facecolor="lightgray", edgecolor="k")
 eur2.centroid.plot()
 ```

 %% Cell type:code id:045b9c33 tags:

 ``` python
 ax = eur2.plot(facecolor="lightgray", edgecolor="k")
 eur2.centroid.plot(ax=ax)
 ```

 %% Cell type:markdown id:0634941f tags:

-#### How much error does lat/long computation introduce?
+#### How much error does lat/lon computation introduce?

 %% Cell type:code id:c0b72aff tags:

 ``` python
 ax = eur2.plot(facecolor="lightgray", edgecolor="k")
 eur2.centroid.plot(ax=ax, color="k") # black => correct
 eur.centroid.to_crs("EPSG:3035").plot(ax=ax, color="r")  # red => miscalculated
 ```

 %% Cell type:code id:ca9e306e tags:

 ``` python
 type(eur2.iloc[0])
 ```

 %% Cell type:code id:f489c88d-5964-4358-b8c4-fc80d28b5491 tags:

 ``` python
 type(eur2).__mro__
 ```

 %% Cell type:markdown id:85880c73 tags:

 #### Area of European countries

 %% Cell type:code id:3e4874d9 tags:

 ``` python
 eur2.area # area in sq meters
 ```

 %% Cell type:markdown id:95f55824 tags:

 What is the area in **sq miles**?

 %% Cell type:code id:85ee20c2 tags:

 ``` python
 # Conversion: / 1000 / 1000 / 2.59
 (eur2.area / 1000 / 1000 / 2.59).sort_values(ascending=False)
 # careful!  some countries (e.g., Russia) were cropped when we did intersection
 ```

 %% Cell type:code id:cd600837 tags:

 ``` python
 # area on screen, not real area
 eur.area
 ```

 %% Cell type:markdown id:daf1245f-9939-468d-9a5f-34225c2dbc51 tags:

 ### CRS

 - `<GeoDataFrame object>.crs`: gives you information about current CRS.
 - `<GeoDataFrame object>.to_crs(<TARGET CRS>)`: changes CRS to `<TARGET CRS>`.

 %% Cell type:markdown id:9da3ee4c tags:

 ### Madison area emergency services

 - Data source: https://data-cityofmadison.opendata.arcgis.com/
    - Search for:
        - "City limit"
        - "Lakes and rivers"
        - "Fire stations"
        - "Police stations"

 - CRS for Madison area: https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system#/media/File:Universal_Transverse_Mercator_zones.svg

 %% Cell type:code id:a6f80847 tags:

 ``` python
 city = gpd.read_file("City_Limit.zip").to_crs("epsg:32616")
 ```

 %% Cell type:code id:7f8595be tags:

 ``` python
 city.crs
 ```

 %% Cell type:code id:9ebd361f tags:

 ``` python
 water = gpd.read_file("Lakes_and_Rivers.zip").to_crs(city.crs)
 fire = gpd.read_file("Fire_Stations.zip").to_crs(city.crs)
 police = gpd.read_file("Police_Stations.zip").to_crs(city.crs)
 ```

-%% Cell type:markdown id:723e2bff-545f-4a8c-9f72-8a9906a99b1b tags:
-
-#### Run this on your virtual machine
-
-`sudo sh -c "echo 'Options = UnsafeLegacyRenegotiation' >> /etc/ssl/openssl.cnf"`
-
-then restart notebook!
-
 %% Cell type:markdown id:b6069860-7dd9-43f2-970e-fd15980135ef tags:

 #### GeoJSON

 How to find the below URL?

 - Go to info page of a dataset, for example: https://data-cityofmadison.opendata.arcgis.com/datasets/police-stations/explore?location=43.081769%2C-89.391550%2C12.81
 - Then click on "I want to use this" > "View API Resources" > "GeoJSON"

 %% Cell type:code id:3a095d5e tags:

 ``` python
 url = "https://maps.cityofmadison.com/arcgis/rest/services/Public/OPEN_DATA/MapServer/2/query?outFields=*&where=1%3D1&f=geojson"
 police2 = gpd.read_file(url).to_crs(city.crs)
 ```

 %% Cell type:code id:248be81e tags:

 ``` python
 ax = city.plot(color="lightgray")
 water.plot(color="lightblue", ax=ax)
 fire.plot(color="red", ax=ax, marker="+", label="Fire")
 police2.plot(color="blue", ax=ax, label="Police")
 ax.legend(loc="upper left", frameon=False)
 ax.set_axis_off()
 ```

 %% Cell type:code id:3a609d81 tags:

 ``` python
 fire.to_file("fire.geojson")
 ```

 %% Cell type:markdown id:77600884 tags:

 ### Geocoding: street address => lat / lon


- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long
+- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/lon


 #### Daily incident reports: https://www.cityofmadison.com/fire/daily-reports

 %% Cell type:code id:6b0b2aa0 tags:

 ``` python
 url = "https://www.cityofmadison.com/fire/daily-reports"
 r = requests.get(url)
 r
 ```

 %% Cell type:code id:bee28b41 tags:

 ``` python
 r.raise_for_status() # give me an exception if not 200 (e.g., 404)
 ```

 %% Cell type:code id:0bae00e7 tags:

 ``` python
-# doesn't work
+# sometime the following doesn't work
 # pd.read_html(url)
 ```

 %% Cell type:code id:47173ec2 tags:

 ``` python
 # print(r.text)
 ```

 %% Cell type:markdown id:39f166c5 tags:

 Find all **span** tags with **streetAddress** using regex.

 %% Cell type:code id:ac7b9482-5512-49e4-b0b8-7f9ed34b5844 tags:

 ``` python
-# <p>1700 block Thierer Road<br>
+# <p>2300 block Pennsylvania Avenue<br>
 # addrs = re.findall(r'<p>1700 block Thierer Road<br>', r.text)
 ```

 %% Cell type:code id:8e9b49d2-3e0d-4a39-9dcf-13b288a165e7 tags:

 ``` python
 addrs = re.findall(r'', r.text)
 addrs = pd.Series(addrs)
 addrs
 ```

 %% Cell type:markdown id:4fc8ba0c-9d45-4251-8055-1310cabf15a9 tags:

 #### Without city name and state name, geocoding would return match with the most famous location with such a street name.

 %% Cell type:code id:095b9ebe-b583-4098-a3a2-47dd46610d59 tags:

 ``` python
-geo_info = gpd.tools.geocode("1300 East Washington Ave")
+geo_info = gpd.tools.geocode("2300 block Pennsylvania Avenue")
 geo_info
 ```

 %% Cell type:code id:e52ee0fc-d73c-4942-9bec-d1586a702f68 tags:

 ``` python
 geo_info["address"].loc[0]
 ```

 %% Cell type:markdown id:e80b714c-a962-4a47-b7cc-3a19d0da8ac0 tags:

 #### To get the correct address we want, we should concatenate "Madison, Wisconsin" to the end of the address.

 %% Cell type:code id:cf3f590b-8d00-46c4-ac57-6f2f61509f68 tags:

 ``` python
-geo_info = gpd.tools.geocode("1300 East Washington Ave, Madison, Wisconsin")
+geo_info = gpd.tools.geocode("2300 block Pennsylvania Avenue, Madison, Wisconsin")
 geo_info
 ```

 %% Cell type:markdown id:3caf8c12-67a3-4e19-a758-d556d010eead tags:

-#### Addresses with "block" often won't work or won't give you the correct lat/long. We need to remove the word "block" before geocoding.
-
-%% Cell type:code id:54103e4a-5ac2-4f19-811b-385c02623ed2 tags:
-
-``` python
-gpd.tools.geocode("800 block W. Johnson Street, Madison, Wisconsin")
-```
-
-%% Cell type:code id:66072f4b-2286-4d66-ba2c-18ccd2e37ed6 tags:
-
-``` python
-gpd.tools.geocode("800 W. Johnson Street, Madison, Wisconsin")
-```
+#### Addresses with "block" often won't work or won't give you the correct lat/lon. We need to remove the word "block" before geocoding.

 %% Cell type:code id:cf982302 tags:

 ``` python
-fixed_addrs = addrs.str.replace(" block ", " ") + ", Madison, WI"
+fixed_addrs = addrs.str.replace(" block ", " ") + ", Madison, Wisconsin"
 fixed_addrs
 ```

 %% Cell type:markdown id:d6a267ed-6876-4866-a66f-74390a3d4ee1 tags:

 #### Using a different provider than the default one

- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long
+- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/lon
    - We will be using "OpenStreetMap", for which the argument is "nominatim"
    - We also need to specify argument to `user_agent` parameter, indicating where the request is coming from; for example: "cs320_bot"

 %% Cell type:code id:ab0e699f tags:

 ``` python
 incidents = gpd.tools.geocode(fixed_addrs, provider="nominatim", user_agent="cs320bot")
 incidents
 ```

 %% Cell type:markdown id:4ad73b2c-5171-45c2-a3f0-eef30f93c492 tags:

 It is often a good idea to drop na values. Although in this version of the example, there are no failed geocodings.

 %% Cell type:code id:41a7d12e-73be-4442-b43c-285229e6cfdb tags:

 ``` python
 incidents = incidents.dropna()
 incidents
 ```

 %% Cell type:markdown id:b30733e6-9491-4be0-bf20-bba64903334d tags:

 #### Self-practice

 If you want practice with regex, try to write regular expression and use the match result to make sure that "Madison" and "Wisconsin" is part of each address.

 %% Cell type:code id:843bbba2-3de5-4cc0-b76b-92e7bebdd379 tags:

 ``` python
 # self-practice
 for addr in incidents["address"]:
    print(addr)
 ```

 %% Cell type:code id:1a04c2b0 tags:

 ``` python
 ax = city.plot(color="lightgray")
 water.plot(color="lightblue", ax=ax)
 fire.plot(color="red", ax=ax, marker="+", label="Fire")
 police2.plot(color="blue", ax=ax, label="Police")
 incidents.to_crs(city.crs).plot(ax=ax, color="k", label="Incidents")
 ax.legend(loc="upper left", frameon=False)
 ax.set_axis_off()
 ```

 %% Cell type:markdown id:471a762b tags:

 # Visualization 2

 %% Cell type:markdown id:2478daaa-cb6c-4d73-92af-01ae91e773fe tags:

 ### Geographic Data / Maps

-#### Installation
-```python
-pip3 install --upgrade pip
-pip3 install geopandas shapely descartes geopy netaddr
-sudo apt install -y python3-rtree
-```
-
 - `import geopandas as gpd`
 - `.shp` => Shapefile
 - `gpd.datasets.get_path(<shp file path>)`:
    - example: `gpd.datasets.get_path("naturalearth_lowres")`
 - `gpd.read_file(<path>)`

 %% Cell type:code id:e6f50cc3 tags:

 ``` python
 import matplotlib
 import matplotlib.pyplot as plt
 import pandas as pd
 import math
 import requests
 import re
 import os

 # new import statements
 import geopandas as gpd
 from shapely.geometry import Point, Polygon, box
 ```

-%% Cell type:code id:257671bc-1e79-47c2-aa7a-1725562d23ef tags:
-
-``` python
-!ls /home/gurmail.singh/.local/lib/python3.8/site-packages/geopandas/datasets/naturalearth_lowres
-```
-
-%% Cell type:code id:e7269706-188a-48e3-882a-ac068170b1af tags:
-
-``` python
-!ls /home/gurmail.singh/.local/lib/python3.8/site-packages/geopandas/datasets
-```
-
 %% Cell type:code id:273ed288 tags:

 ``` python
-# Find the path for "naturalearth_lowres"
-path =
-# Read the shapefile for "naturalearth_lowres" and
+# Read the geojson file for "naturalearth_lowres"
+gdf = gpd.read_file("naturalearth_lowres.geojson")
 # set index using "name" column
-gdf =
+gdf = gdf.set_index("name")
 ```

 %% Cell type:code id:cf4d871c-d6d6-4d99-be96-75faf804d4a7 tags:

 ``` python
 gdf.head()
 ```

 %% Cell type:code id:5c983208-7851-4d25-92e7-3f110039411a tags:

 ``` python
 type(gdf).__mro__
 ```

 %% Cell type:code id:653edec9-8b82-4684-ae82-a9fa399459ce tags:

 ``` python
-# All shapefiles have a column called "geometry"
+# Get the "geometry" column
 gdf["geometry"]
 ```

 %% Cell type:code id:e2eb071a-b29a-4edd-8747-50b02fd43108 tags:

 ``` python
 type(gdf["geometry"]).__mro__
 ```

 %% Cell type:code id:da5b5783-d78e-460f-bd22-b41e7f24f871 tags:

 ``` python
 # First country's name and geometry
 ```

 %% Cell type:code id:8f4da997-567d-47f6-8594-f0a37b92b9e6 tags:

 ``` python
 # Second country's name geometry
 print(gdf.index[1])
 gdf["geometry"].iat[1]
 ```

 %% Cell type:code id:d380b32f-c401-4601-b6d3-02ac344b3f08 tags:

 ``` python
 # Geometry for "United States of America"
 # gdf.at[<row_index>, <column_name>]
 ```

 %% Cell type:code id:3528f252-36b7-4ca5-9108-0951367837cc tags:

 ``` python
 # Type of Tanzania's geometry
 print(gdf.index[1], type(gdf["geometry"].iat[1]))

 # Type of United States of America's geometry
 print("United States of America", type(gdf.at["United States of America", "geometry"]))
 ```

 %% Cell type:markdown id:47711b72-a24c-4da8-a804-3faf88a5fe8b tags:

 - `gdf.plot(figsize=(<width>, <height>), column=<column name>)`
 - `ax.set_axis_off()`

 %% Cell type:code id:9874c7de-226e-4296-af60-45e091836a19 tags:

 ``` python
 ax = gdf.plot(figsize=(8,4))
 ```

 %% Cell type:code id:f5547cc5-3404-421e-aeed-4b4cf8ee8382 tags:

 ``` python
 # Set facecolor="0.7", edgecolor="black"
 ax = gdf.plot(figsize=(8,4))
 # Turn off the axes
 # ax.set_axis_off()
 ```

 %% Cell type:code id:58da716e tags:

 ``` python
 # Color the map based on population column, column="pop_est" and set cmap="cool" and legend=True
 ax = gdf.plot(figsize=(8,4))
 # Turn off the axes
 ax.set_axis_off()
 ```

 %% Cell type:markdown id:c7a85431-bdab-46ab-8e29-5f91d8a2e0bc tags:

 #### Create a map where countries with >100M people are red, others are gray.

 %% Cell type:code id:962a552b-94a6-4690-8018-f82173dd6096 tags:

 ``` python
 # Create a map where countries with >100M people are red, others are gray

 # Add a new column called color to gdf and set default value to "lightgray"

 # Boolean indexing to set color to red for countries with "pop_est" > 1e8

 # Create the plot
 # ax = gdf.plot(figsize=(8,4), color=gdf["color"])
 # ax.set_axis_off()
 ```

-%% Cell type:markdown id:5a32c52c-509f-4f7f-8dd6-bc7fe53c64fd tags:
-
-### All shapefile geometries are shapely shapes.
-
-%% Cell type:code id:33e92def-a7db-4d90-adc8-a3d01d472ac1 tags:
-
-``` python
-type(gdf["geometry"].iat[2])
-```
-
 %% Cell type:markdown id:39c95c23 tags:

 ### Shapely shapes

 - `from shapely.geometry import Point, Polygon, box`
 - `Polygon([(<x1>, <y1>), (<x2>, <y2>), (<x3>, <y3>), ...])`
 - `box(minx, miny, maxx, maxy)`
 - `Point(<x>, <y>)`
 - `<shapely object>.buffer(<size>)`
    - example: `Point(5, 5).buffer(3)` creates a circle

 %% Cell type:code id:61716db9 tags:

 ``` python
 triangle =    # triangle
 triangle
 ```

 %% Cell type:code id:6a36021a-8653-4698-a0ba-818c14091d79 tags:

 ``` python
 type(triangle)
 ```

 %% Cell type:code id:bddd958d-6fde-42df-8ae3-459e25fa3f04 tags:

 ``` python
 box1 =     # not a type; just a function that creates box
 box1
 ```

 %% Cell type:code id:eb7ade8d-96d2-4770-bce9-b3e35996b0b8 tags:

 ``` python
 type(box1)
 ```

 %% Cell type:code id:e5308c61-b1fa-433b-bb05-485ca7bd23da tags:

 ``` python
 point =
 point
 ```

 %% Cell type:code id:5c669619-af78-477d-b807-3e6d99278f2f tags:

 ``` python
 type(point)
 ```

 %% Cell type:code id:f015d27e-8fd8-446f-992f-4f7a88cc582d tags:

 ``` python
 # use buffer to create a circle from a point
 circle =
 circle
 ```

 %% Cell type:code id:39fe5752-8038-46cc-b4a6-320e6a78bbd1 tags:

 ``` python
 type(circle)
 ```

 %% Cell type:code id:800cc48d-c241-4439-9161-ff309e50373f tags:

 ``` python
 triangle_buffer = triangle.buffer(3)
 triangle_buffer
 ```

 %% Cell type:code id:6b9478d3-7cc9-4e80-ae84-e56d7bfd0b71 tags:

 ``` python
 type(triangle_buffer)
 ```

 %% Cell type:markdown id:1a340443-d750-43d5-99cd-28e7034ce898 tags:

 #### Shapely methods

 - Shapely methods:
    - `union`:  any point that is in either shape (OR)
    - `intersection`: any point that is in both shapes (AND)
    - `difference`: subtraction
    - `intersects`: do they overlap? returns True / False

 %% Cell type:code id:d1c5f9f7 tags:

 ``` python
 # union triangle and box1
 # it will give any point that is in either shape (OR)
 ```

 %% Cell type:code id:8a2d3357 tags:

 ``` python
 # intersection triangle and box1
 # any point that is in both shapes (AND)
 ```

 %% Cell type:code id:f327bb99-be34-4a00-a216-a6f40df48268 tags:

 ``` python
 # difference of triangle and box1
 ```

 %% Cell type:code id:9082b54a tags:

 ``` python
 # difference of box1 and triangle
 box1.difference(triangle)   # subtraction
 ```

 %% Cell type:code id:0def4359-78f6-4f14-80ae-05c25ff64bc5 tags:

 ``` python
 # check whether triangle intersects box1
 # the is, check do they overlap?
 ```

 %% Cell type:markdown id:51a82cf3-fe28-46d2-a019-d87f6a0f41b6 tags:

 Is the point "near" (<6 units) the triangle?

 %% Cell type:code id:7a87b70f tags:

 ``` python
 triangle.union(point.buffer(6))
 ```

 %% Cell type:code id:e04daa60 tags:

 ``` python
 triangle.intersects(point.buffer(6))
 ```

 %% Cell type:markdown id:bf500303 tags:

 #### Extracting "Europe" data from "naturalearth_lowres"

 %% Cell type:code id:410b08cf tags:

 ``` python
 # Europe bounding box
 eur_window = box(-10.67, 34.5, 31.55, 71.05)
 ```

 %% Cell type:markdown id:d45fdfcb-b244-4eff-a9d6-5cc4fefdfbd8 tags:

 Can we use `intersects` method?

 %% Cell type:code id:24f4a32b-9ed4-468b-a194-ebe0395fcdb6 tags:

 ``` python
 gdf.intersects(eur_window)
 ```

 %% Cell type:code id:ff455178-84da-45bf-b5b9-a71a7f9160f7 tags:

 ``` python
 # Incorrect v1
 gdf[gdf.intersects(eur_window)].plot()
 ```

 %% Cell type:code id:03aa8da8-72f4-45fd-8931-e410d1204226 tags:

 ``` python
 # Incorrect v2
 gdf[~gdf.intersects(eur_window)].plot()
 ```

 %% Cell type:markdown id:4173f3c5-017f-44ff-b713-158219fcf3e5 tags:

 Can we use `intersection` method?

 %% Cell type:code id:d7226fd9-3bb2-48c3-a5e9-5ce1ca0dd88f tags:

 ``` python
 gdf.intersection(eur_window)
 ```

 %% Cell type:code id:108b8b8a tags:

 ``` python
 gdf.intersection(eur_window).plot()
 ```

 %% Cell type:markdown id:f13e4e16-286a-4681-8bc5-9bd3fcf599a7 tags:

 How can we get rid of empty polygons (and remove the warning)?

 %% Cell type:code id:8c0dd051-a808-4fd4-b5ff-fee6ed580753 tags:

 ``` python
 eur = gdf.intersection(eur_window)
 eur.is_empty
 ```

 %% Cell type:markdown id:acb1b886-f7c8-41d4-979a-2309c2b973bd tags:

 Remove all the empty polygons using `is_empty`.

 %% Cell type:code id:55a76a00 tags:

 ``` python
 eur = eur[~eur.is_empty]
 eur
 ```

 %% Cell type:code id:08c59df7 tags:

 ``` python
 eur.plot()
 ```

 %% Cell type:markdown id:97a12d23 tags:

 #### Centroids of European countries

 %% Cell type:code id:eb5c83b7 tags:

 ``` python
 # plot the centroids
 ax = eur.plot(facecolor="lightgray", edgecolor="k")
 eur.centroid.plot(ax=ax)
 ```

 %% Cell type:markdown id:dbfab5e7-5941-438a-baf2-e05369106004 tags:

 ### Lat / Lon CRS

 - Lon is x-coord
 - Lat is y-coord
    - tells you where the point on Earth is
 - **IMPORTANT**: degrees are not a unit of distance. 1 degree of longitute near the equator is a lot farther than moving 1 degree of longitute near the north pole

 Using `.crs` to access CRS of a gdf.



 %% Cell type:code id:89251896 tags:

 ``` python
 eur.crs
 ```

 %% Cell type:markdown id:2a6ba447-6f1a-4d31-8c33-ec1eeb9e0deb tags:

 #### Single CRS doesn't work for the whole earth

 - Setting a different CRS for Europe that is based on meters.
 - https://spatialreference.org/ref/?search=europe

 %% Cell type:code id:586038b1 tags:

 ``` python
 # Setting CRS to "EPSG:3035"
 eur2 = eur.to_crs("EPSG:3035")
 eur2.crs
 ```

 %% Cell type:code id:d46c124c-9aff-4c2f-81fe-08885b606800 tags:

 ``` python
 ax = eur2.plot(facecolor="lightgray", edgecolor="k")
 eur2.centroid.plot()
 ```

 %% Cell type:code id:045b9c33 tags:

 ``` python
 ax = eur2.plot(facecolor="lightgray", edgecolor="k")
 eur2.centroid.plot(ax=ax)
 ```

 %% Cell type:markdown id:0634941f tags:

-#### How much error does lat/long computation introduce?
+#### How much error does lat/lon computation introduce?

 %% Cell type:code id:c0b72aff tags:

 ``` python
 ax = eur2.plot(facecolor="lightgray", edgecolor="k")
 eur2.centroid.plot(ax=ax, color="k") # black => correct
 eur.centroid.to_crs("EPSG:3035").plot(ax=ax, color="r")  # red => miscalculated
 ```

 %% Cell type:code id:ca9e306e tags:

 ``` python
 type(eur2.iloc[0])
 ```

 %% Cell type:code id:f489c88d-5964-4358-b8c4-fc80d28b5491 tags:

 ``` python
 type(eur2).__mro__
 ```

 %% Cell type:markdown id:85880c73 tags:

 #### Area of European countries

 %% Cell type:code id:3e4874d9 tags:

 ``` python
 eur2.area # area in sq meters
 ```

 %% Cell type:markdown id:95f55824 tags:

 What is the area in **sq miles**?

 %% Cell type:code id:85ee20c2 tags:

 ``` python
 # Conversion: / 1000 / 1000 / 2.59
 (eur2.area / 1000 / 1000 / 2.59).sort_values(ascending=False)
 # careful!  some countries (e.g., Russia) were cropped when we did intersection
 ```

 %% Cell type:code id:cd600837 tags:

 ``` python
 # area on screen, not real area
 eur.area
 ```

 %% Cell type:markdown id:daf1245f-9939-468d-9a5f-34225c2dbc51 tags:

 ### CRS

 - `<GeoDataFrame object>.crs`: gives you information about current CRS.
 - `<GeoDataFrame object>.to_crs(<TARGET CRS>)`: changes CRS to `<TARGET CRS>`.

 %% Cell type:markdown id:9da3ee4c tags:

 ### Madison area emergency services

 - Data source: https://data-cityofmadison.opendata.arcgis.com/
    - Search for:
        - "City limit"
        - "Lakes and rivers"
        - "Fire stations"
        - "Police stations"

 - CRS for Madison area: https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system#/media/File:Universal_Transverse_Mercator_zones.svg

 %% Cell type:code id:a6f80847 tags:

 ``` python
 city = gpd.read_file("City_Limit.zip").to_crs("epsg:32616")
 ```

 %% Cell type:code id:7f8595be tags:

 ``` python
 city.crs
 ```

 %% Cell type:code id:9ebd361f tags:

 ``` python
 water = gpd.read_file("Lakes_and_Rivers.zip").to_crs(city.crs)
 fire = gpd.read_file("Fire_Stations.zip").to_crs(city.crs)
 police = gpd.read_file("Police_Stations.zip").to_crs(city.crs)
 ```

-%% Cell type:markdown id:723e2bff-545f-4a8c-9f72-8a9906a99b1b tags:
-
-#### Run this on your virtual machine
-
-`sudo sh -c "echo 'Options = UnsafeLegacyRenegotiation' >> /etc/ssl/openssl.cnf"`
-
-then restart notebook!
-
 %% Cell type:markdown id:b6069860-7dd9-43f2-970e-fd15980135ef tags:

 #### GeoJSON

 How to find the below URL?

 - Go to info page of a dataset, for example: https://data-cityofmadison.opendata.arcgis.com/datasets/police-stations/explore?location=43.081769%2C-89.391550%2C12.81
 - Then click on "I want to use this" > "View API Resources" > "GeoJSON"

 %% Cell type:code id:3a095d5e tags:

 ``` python
 url = "https://maps.cityofmadison.com/arcgis/rest/services/Public/OPEN_DATA/MapServer/2/query?outFields=*&where=1%3D1&f=geojson"
 police2 = gpd.read_file(url).to_crs(city.crs)
 ```

 %% Cell type:code id:248be81e tags:

 ``` python
 ax = city.plot(color="lightgray")
 water.plot(color="lightblue", ax=ax)
 fire.plot(color="red", ax=ax, marker="+", label="Fire")
 police2.plot(color="blue", ax=ax, label="Police")
 ax.legend(loc="upper left", frameon=False)
 ax.set_axis_off()
 ```

 %% Cell type:code id:3a609d81 tags:

 ``` python
 fire.to_file("fire.geojson")
 ```

 %% Cell type:markdown id:77600884 tags:

 ### Geocoding: street address => lat / lon


- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long
+- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/lon


 #### Daily incident reports: https://www.cityofmadison.com/fire/daily-reports

 %% Cell type:code id:6b0b2aa0 tags:

 ``` python
 url = "https://www.cityofmadison.com/fire/daily-reports"
 r = requests.get(url)
 r
 ```

 %% Cell type:code id:bee28b41 tags:

 ``` python
 r.raise_for_status() # give me an exception if not 200 (e.g., 404)
 ```

 %% Cell type:code id:0bae00e7 tags:

 ``` python
-# doesn't work
+# sometime the following doesn't work
 # pd.read_html(url)
 ```

 %% Cell type:code id:47173ec2 tags:

 ``` python
 # print(r.text)
 ```

 %% Cell type:markdown id:39f166c5 tags:

 Find all **span** tags with **streetAddress** using regex.

 %% Cell type:code id:ac7b9482-5512-49e4-b0b8-7f9ed34b5844 tags:

 ``` python
-# <p>1700 block Thierer Road<br>
+# <p>2300 block Pennsylvania Avenue<br>
 # addrs = re.findall(r'<p>1700 block Thierer Road<br>', r.text)
 ```

 %% Cell type:code id:8e9b49d2-3e0d-4a39-9dcf-13b288a165e7 tags:

 ``` python
 addrs = re.findall(r'', r.text)
 addrs = pd.Series(addrs)
 addrs
 ```

 %% Cell type:markdown id:4fc8ba0c-9d45-4251-8055-1310cabf15a9 tags:

 #### Without city name and state name, geocoding would return match with the most famous location with such a street name.

 %% Cell type:code id:095b9ebe-b583-4098-a3a2-47dd46610d59 tags:

 ``` python
-geo_info = gpd.tools.geocode("1300 East Washington Ave")
+geo_info = gpd.tools.geocode("2300 block Pennsylvania Avenue")
 geo_info
 ```

 %% Cell type:code id:e52ee0fc-d73c-4942-9bec-d1586a702f68 tags:

 ``` python
 geo_info["address"].loc[0]
 ```

 %% Cell type:markdown id:e80b714c-a962-4a47-b7cc-3a19d0da8ac0 tags:

 #### To get the correct address we want, we should concatenate "Madison, Wisconsin" to the end of the address.

 %% Cell type:code id:cf3f590b-8d00-46c4-ac57-6f2f61509f68 tags:

 ``` python
-geo_info = gpd.tools.geocode("1300 East Washington Ave, Madison, Wisconsin")
+geo_info = gpd.tools.geocode("2300 block Pennsylvania Avenue, Madison, Wisconsin")
 geo_info
 ```

 %% Cell type:markdown id:3caf8c12-67a3-4e19-a758-d556d010eead tags:

-#### Addresses with "block" often won't work or won't give you the correct lat/long. We need to remove the word "block" before geocoding.
-
-%% Cell type:code id:54103e4a-5ac2-4f19-811b-385c02623ed2 tags:
-
-``` python
-gpd.tools.geocode("800 block W. Johnson Street, Madison, Wisconsin")
-```
-
-%% Cell type:code id:66072f4b-2286-4d66-ba2c-18ccd2e37ed6 tags:
-
-``` python
-gpd.tools.geocode("800 W. Johnson Street, Madison, Wisconsin")
-```
+#### Addresses with "block" often won't work or won't give you the correct lat/lon. We need to remove the word "block" before geocoding.

 %% Cell type:code id:cf982302 tags:

 ``` python
-fixed_addrs = addrs.str.replace(" block ", " ") + ", Madison, WI"
+fixed_addrs = addrs.str.replace(" block ", " ") + ", Madison, Wisconsin"
 fixed_addrs
 ```

 %% Cell type:markdown id:d6a267ed-6876-4866-a66f-74390a3d4ee1 tags:

 #### Using a different provider than the default one

- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long
+- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/lon
    - We will be using "OpenStreetMap", for which the argument is "nominatim"
    - We also need to specify argument to `user_agent` parameter, indicating where the request is coming from; for example: "cs320_bot"

 %% Cell type:code id:ab0e699f tags:

 ``` python
 incidents = gpd.tools.geocode(fixed_addrs, provider="nominatim", user_agent="cs320bot")
 incidents
 ```

 %% Cell type:markdown id:4ad73b2c-5171-45c2-a3f0-eef30f93c492 tags:

 It is often a good idea to drop na values. Although in this version of the example, there are no failed geocodings.

 %% Cell type:code id:41a7d12e-73be-4442-b43c-285229e6cfdb tags:

 ``` python
 incidents = incidents.dropna()
 incidents
 ```

 %% Cell type:markdown id:b30733e6-9491-4be0-bf20-bba64903334d tags:

 #### Self-practice

 If you want practice with regex, try to write regular expression and use the match result to make sure that "Madison" and "Wisconsin" is part of each address.

 %% Cell type:code id:843bbba2-3de5-4cc0-b76b-92e7bebdd379 tags:

 ``` python
 # self-practice
 for addr in incidents["address"]:
    print(addr)
 ```

 %% Cell type:code id:1a04c2b0 tags:

 ``` python
 ax = city.plot(color="lightgray")
 water.plot(color="lightblue", ax=ax)
 fire.plot(color="red", ax=ax, marker="+", label="Fire")
 police2.plot(color="blue", ax=ax, label="Police")
 incidents.to_crs(city.crs).plot(ax=ax, color="k", label="Incidents")
 ax.legend(loc="upper left", frameon=False)
 ax.set_axis_off()
 ```

--- a/lecture-material/17-Visualization_2/17-visualization_002.ipynb
+++ b/lecture-material/17-Visualization_2/17-visualization_002.ipynb
 %% Cell type:markdown id:471a762b tags:

 # Visualization 2

 %% Cell type:markdown id:2478daaa-cb6c-4d73-92af-01ae91e773fe tags:

 ### Geographic Data / Maps

-#### Installation
-```python
-pip3 install --upgrade pip
-pip3 install geopandas shapely descartes geopy netaddr
-sudo apt install -y python3-rtree
-```
-
 - `import geopandas as gpd`
 - `.shp` => Shapefile
 - `gpd.datasets.get_path(<shp file path>)`:
    - example: `gpd.datasets.get_path("naturalearth_lowres")`
 - `gpd.read_file(<path>)`

 %% Cell type:code id:e6f50cc3 tags:

 ``` python
 import matplotlib
 import matplotlib.pyplot as plt
 import pandas as pd
 import math
 import requests
 import re
 import os

 # new import statements
 import geopandas as gpd
 from shapely.geometry import Point, Polygon, box
 ```

-%% Cell type:code id:257671bc-1e79-47c2-aa7a-1725562d23ef tags:
-
-``` python
-!ls /home/gurmail.singh/.local/lib/python3.8/site-packages/geopandas/datasets/naturalearth_lowres
-```
-
-%% Cell type:code id:e7269706-188a-48e3-882a-ac068170b1af tags:
-
-``` python
-!ls /home/gurmail.singh/.local/lib/python3.8/site-packages/geopandas/datasets
-```
-
 %% Cell type:code id:273ed288 tags:

 ``` python
-# Find the path for "naturalearth_lowres"
-path =
-# Read the shapefile for "naturalearth_lowres" and
+# Read the geojson file for "naturalearth_lowres"
+gdf = gpd.read_file("naturalearth_lowres.geojson")
 # set index using "name" column
-gdf =
+gdf = gdf.set_index("name")
 ```

 %% Cell type:code id:cf4d871c-d6d6-4d99-be96-75faf804d4a7 tags:

 ``` python
 gdf.head()
 ```

 %% Cell type:code id:5c983208-7851-4d25-92e7-3f110039411a tags:

 ``` python
 type(gdf).__mro__
 ```

 %% Cell type:code id:653edec9-8b82-4684-ae82-a9fa399459ce tags:

 ``` python
-# All shapefiles have a column called "geometry"
+# Get the "geometry" column
 gdf["geometry"]
 ```

 %% Cell type:code id:e2eb071a-b29a-4edd-8747-50b02fd43108 tags:

 ``` python
 type(gdf["geometry"]).__mro__
 ```

 %% Cell type:code id:da5b5783-d78e-460f-bd22-b41e7f24f871 tags:

 ``` python
 # First country's name and geometry
 ```

 %% Cell type:code id:8f4da997-567d-47f6-8594-f0a37b92b9e6 tags:

 ``` python
 # Second country's name geometry
 print(gdf.index[1])
 gdf["geometry"].iat[1]
 ```

 %% Cell type:code id:d380b32f-c401-4601-b6d3-02ac344b3f08 tags:

 ``` python
 # Geometry for "United States of America"
 # gdf.at[<row_index>, <column_name>]
 ```

 %% Cell type:code id:3528f252-36b7-4ca5-9108-0951367837cc tags:

 ``` python
 # Type of Tanzania's geometry
 print(gdf.index[1], type(gdf["geometry"].iat[1]))

 # Type of United States of America's geometry
 print("United States of America", type(gdf.at["United States of America", "geometry"]))
 ```

 %% Cell type:markdown id:47711b72-a24c-4da8-a804-3faf88a5fe8b tags:

 - `gdf.plot(figsize=(<width>, <height>), column=<column name>)`
 - `ax.set_axis_off()`

 %% Cell type:code id:9874c7de-226e-4296-af60-45e091836a19 tags:

 ``` python
 ax = gdf.plot(figsize=(8,4))
 ```

 %% Cell type:code id:f5547cc5-3404-421e-aeed-4b4cf8ee8382 tags:

 ``` python
 # Set facecolor="0.7", edgecolor="black"
 ax = gdf.plot(figsize=(8,4))
 # Turn off the axes
 # ax.set_axis_off()
 ```

 %% Cell type:code id:58da716e tags:

 ``` python
 # Color the map based on population column, column="pop_est" and set cmap="cool" and legend=True
 ax = gdf.plot(figsize=(8,4))
 # Turn off the axes
 ax.set_axis_off()
 ```

 %% Cell type:markdown id:c7a85431-bdab-46ab-8e29-5f91d8a2e0bc tags:

 #### Create a map where countries with >100M people are red, others are gray.

 %% Cell type:code id:962a552b-94a6-4690-8018-f82173dd6096 tags:

 ``` python
 # Create a map where countries with >100M people are red, others are gray

 # Add a new column called color to gdf and set default value to "lightgray"

 # Boolean indexing to set color to red for countries with "pop_est" > 1e8

 # Create the plot
 # ax = gdf.plot(figsize=(8,4), color=gdf["color"])
 # ax.set_axis_off()
 ```

-%% Cell type:markdown id:5a32c52c-509f-4f7f-8dd6-bc7fe53c64fd tags:
-
-### All shapefile geometries are shapely shapes.
-
-%% Cell type:code id:33e92def-a7db-4d90-adc8-a3d01d472ac1 tags:
-
-``` python
-type(gdf["geometry"].iat[2])
-```
-
 %% Cell type:markdown id:39c95c23 tags:

 ### Shapely shapes

 - `from shapely.geometry import Point, Polygon, box`
 - `Polygon([(<x1>, <y1>), (<x2>, <y2>), (<x3>, <y3>), ...])`
 - `box(minx, miny, maxx, maxy)`
 - `Point(<x>, <y>)`
 - `<shapely object>.buffer(<size>)`
    - example: `Point(5, 5).buffer(3)` creates a circle

 %% Cell type:code id:61716db9 tags:

 ``` python
 triangle =    # triangle
 triangle
 ```

 %% Cell type:code id:6a36021a-8653-4698-a0ba-818c14091d79 tags:

 ``` python
 type(triangle)
 ```

 %% Cell type:code id:bddd958d-6fde-42df-8ae3-459e25fa3f04 tags:

 ``` python
 box1 =     # not a type; just a function that creates box
 box1
 ```

 %% Cell type:code id:eb7ade8d-96d2-4770-bce9-b3e35996b0b8 tags:

 ``` python
 type(box1)
 ```

 %% Cell type:code id:e5308c61-b1fa-433b-bb05-485ca7bd23da tags:

 ``` python
 point =
 point
 ```

 %% Cell type:code id:5c669619-af78-477d-b807-3e6d99278f2f tags:

 ``` python
 type(point)
 ```

 %% Cell type:code id:f015d27e-8fd8-446f-992f-4f7a88cc582d tags:

 ``` python
 # use buffer to create a circle from a point
 circle =
 circle
 ```

 %% Cell type:code id:39fe5752-8038-46cc-b4a6-320e6a78bbd1 tags:

 ``` python
 type(circle)
 ```

 %% Cell type:code id:800cc48d-c241-4439-9161-ff309e50373f tags:

 ``` python
 triangle_buffer = triangle.buffer(3)
 triangle_buffer
 ```

 %% Cell type:code id:6b9478d3-7cc9-4e80-ae84-e56d7bfd0b71 tags:

 ``` python
 type(triangle_buffer)
 ```

 %% Cell type:markdown id:1a340443-d750-43d5-99cd-28e7034ce898 tags:

 #### Shapely methods

 - Shapely methods:
    - `union`:  any point that is in either shape (OR)
    - `intersection`: any point that is in both shapes (AND)
    - `difference`: subtraction
    - `intersects`: do they overlap? returns True / False

 %% Cell type:code id:d1c5f9f7 tags:

 ``` python
 # union triangle and box1
 # it will give any point that is in either shape (OR)
 ```

 %% Cell type:code id:8a2d3357 tags:

 ``` python
 # intersection triangle and box1
 # any point that is in both shapes (AND)
 ```

 %% Cell type:code id:f327bb99-be34-4a00-a216-a6f40df48268 tags:

 ``` python
 # difference of triangle and box1
 ```

 %% Cell type:code id:9082b54a tags:

 ``` python
 # difference of box1 and triangle
 box1.difference(triangle)   # subtraction
 ```

 %% Cell type:code id:0def4359-78f6-4f14-80ae-05c25ff64bc5 tags:

 ``` python
 # check whether triangle intersects box1
 # the is, check do they overlap?
 ```

 %% Cell type:markdown id:51a82cf3-fe28-46d2-a019-d87f6a0f41b6 tags:

 Is the point "near" (<6 units) the triangle?

 %% Cell type:code id:7a87b70f tags:

 ``` python
 triangle.union(point.buffer(6))
 ```

 %% Cell type:code id:e04daa60 tags:

 ``` python
 triangle.intersects(point.buffer(6))
 ```

 %% Cell type:markdown id:bf500303 tags:

 #### Extracting "Europe" data from "naturalearth_lowres"

 %% Cell type:code id:410b08cf tags:

 ``` python
 # Europe bounding box
 eur_window = box(-10.67, 34.5, 31.55, 71.05)
 ```

 %% Cell type:markdown id:d45fdfcb-b244-4eff-a9d6-5cc4fefdfbd8 tags:

 Can we use `intersects` method?

 %% Cell type:code id:24f4a32b-9ed4-468b-a194-ebe0395fcdb6 tags:

 ``` python
 gdf.intersects(eur_window)
 ```

 %% Cell type:code id:ff455178-84da-45bf-b5b9-a71a7f9160f7 tags:

 ``` python
 # Incorrect v1
 gdf[gdf.intersects(eur_window)].plot()
 ```

 %% Cell type:code id:03aa8da8-72f4-45fd-8931-e410d1204226 tags:

 ``` python
 # Incorrect v2
 gdf[~gdf.intersects(eur_window)].plot()
 ```

 %% Cell type:markdown id:4173f3c5-017f-44ff-b713-158219fcf3e5 tags:

 Can we use `intersection` method?

 %% Cell type:code id:d7226fd9-3bb2-48c3-a5e9-5ce1ca0dd88f tags:

 ``` python
 gdf.intersection(eur_window)
 ```

 %% Cell type:code id:108b8b8a tags:

 ``` python
 gdf.intersection(eur_window).plot()
 ```

 %% Cell type:markdown id:f13e4e16-286a-4681-8bc5-9bd3fcf599a7 tags:

 How can we get rid of empty polygons (and remove the warning)?

 %% Cell type:code id:8c0dd051-a808-4fd4-b5ff-fee6ed580753 tags:

 ``` python
 eur = gdf.intersection(eur_window)
 eur.is_empty
 ```

 %% Cell type:markdown id:acb1b886-f7c8-41d4-979a-2309c2b973bd tags:

 Remove all the empty polygons using `is_empty`.

 %% Cell type:code id:55a76a00 tags:

 ``` python
 eur = eur[~eur.is_empty]
 eur
 ```

 %% Cell type:code id:08c59df7 tags:

 ``` python
 eur.plot()
 ```

 %% Cell type:markdown id:97a12d23 tags:

 #### Centroids of European countries

 %% Cell type:code id:eb5c83b7 tags:

 ``` python
 # plot the centroids
 ax = eur.plot(facecolor="lightgray", edgecolor="k")
 eur.centroid.plot(ax=ax)
 ```

 %% Cell type:markdown id:dbfab5e7-5941-438a-baf2-e05369106004 tags:

 ### Lat / Lon CRS

 - Lon is x-coord
 - Lat is y-coord
    - tells you where the point on Earth is
 - **IMPORTANT**: degrees are not a unit of distance. 1 degree of longitute near the equator is a lot farther than moving 1 degree of longitute near the north pole

 Using `.crs` to access CRS of a gdf.



 %% Cell type:code id:89251896 tags:

 ``` python
 eur.crs
 ```

 %% Cell type:markdown id:2a6ba447-6f1a-4d31-8c33-ec1eeb9e0deb tags:

 #### Single CRS doesn't work for the whole earth

 - Setting a different CRS for Europe that is based on meters.
 - https://spatialreference.org/ref/?search=europe

 %% Cell type:code id:586038b1 tags:

 ``` python
 # Setting CRS to "EPSG:3035"
 eur2 = eur.to_crs("EPSG:3035")
 eur2.crs
 ```

 %% Cell type:code id:d46c124c-9aff-4c2f-81fe-08885b606800 tags:

 ``` python
 ax = eur2.plot(facecolor="lightgray", edgecolor="k")
 eur2.centroid.plot()
 ```

 %% Cell type:code id:045b9c33 tags:

 ``` python
 ax = eur2.plot(facecolor="lightgray", edgecolor="k")
 eur2.centroid.plot(ax=ax)
 ```

 %% Cell type:markdown id:0634941f tags:

-#### How much error does lat/long computation introduce?
+#### How much error does lat/lon computation introduce?

 %% Cell type:code id:c0b72aff tags:

 ``` python
 ax = eur2.plot(facecolor="lightgray", edgecolor="k")
 eur2.centroid.plot(ax=ax, color="k") # black => correct
 eur.centroid.to_crs("EPSG:3035").plot(ax=ax, color="r")  # red => miscalculated
 ```

 %% Cell type:code id:ca9e306e tags:

 ``` python
 type(eur2.iloc[0])
 ```

 %% Cell type:code id:f489c88d-5964-4358-b8c4-fc80d28b5491 tags:

 ``` python
 type(eur2).__mro__
 ```

 %% Cell type:markdown id:85880c73 tags:

 #### Area of European countries

 %% Cell type:code id:3e4874d9 tags:

 ``` python
 eur2.area # area in sq meters
 ```

 %% Cell type:markdown id:95f55824 tags:

 What is the area in **sq miles**?

 %% Cell type:code id:85ee20c2 tags:

 ``` python
 # Conversion: / 1000 / 1000 / 2.59
 (eur2.area / 1000 / 1000 / 2.59).sort_values(ascending=False)
 # careful!  some countries (e.g., Russia) were cropped when we did intersection
 ```

 %% Cell type:code id:cd600837 tags:

 ``` python
 # area on screen, not real area
 eur.area
 ```

 %% Cell type:markdown id:daf1245f-9939-468d-9a5f-34225c2dbc51 tags:

 ### CRS

 - `<GeoDataFrame object>.crs`: gives you information about current CRS.
 - `<GeoDataFrame object>.to_crs(<TARGET CRS>)`: changes CRS to `<TARGET CRS>`.

 %% Cell type:markdown id:9da3ee4c tags:

 ### Madison area emergency services

 - Data source: https://data-cityofmadison.opendata.arcgis.com/
    - Search for:
        - "City limit"
        - "Lakes and rivers"
        - "Fire stations"
        - "Police stations"

 - CRS for Madison area: https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system#/media/File:Universal_Transverse_Mercator_zones.svg

 %% Cell type:code id:a6f80847 tags:

 ``` python
 city = gpd.read_file("City_Limit.zip").to_crs("epsg:32616")
 ```

 %% Cell type:code id:7f8595be tags:

 ``` python
 city.crs
 ```

 %% Cell type:code id:9ebd361f tags:

 ``` python
 water = gpd.read_file("Lakes_and_Rivers.zip").to_crs(city.crs)
 fire = gpd.read_file("Fire_Stations.zip").to_crs(city.crs)
 police = gpd.read_file("Police_Stations.zip").to_crs(city.crs)
 ```

-%% Cell type:markdown id:723e2bff-545f-4a8c-9f72-8a9906a99b1b tags:
-
-#### Run this on your virtual machine
-
-`sudo sh -c "echo 'Options = UnsafeLegacyRenegotiation' >> /etc/ssl/openssl.cnf"`
-
-then restart notebook!
-
 %% Cell type:markdown id:b6069860-7dd9-43f2-970e-fd15980135ef tags:

 #### GeoJSON

 How to find the below URL?

 - Go to info page of a dataset, for example: https://data-cityofmadison.opendata.arcgis.com/datasets/police-stations/explore?location=43.081769%2C-89.391550%2C12.81
 - Then click on "I want to use this" > "View API Resources" > "GeoJSON"

 %% Cell type:code id:3a095d5e tags:

 ``` python
 url = "https://maps.cityofmadison.com/arcgis/rest/services/Public/OPEN_DATA/MapServer/2/query?outFields=*&where=1%3D1&f=geojson"
 police2 = gpd.read_file(url).to_crs(city.crs)
 ```

 %% Cell type:code id:248be81e tags:

 ``` python
 ax = city.plot(color="lightgray")
 water.plot(color="lightblue", ax=ax)
 fire.plot(color="red", ax=ax, marker="+", label="Fire")
 police2.plot(color="blue", ax=ax, label="Police")
 ax.legend(loc="upper left", frameon=False)
 ax.set_axis_off()
 ```

 %% Cell type:code id:3a609d81 tags:

 ``` python
 fire.to_file("fire.geojson")
 ```

 %% Cell type:markdown id:77600884 tags:

 ### Geocoding: street address => lat / lon


- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long
+- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/lon


 #### Daily incident reports: https://www.cityofmadison.com/fire/daily-reports

 %% Cell type:code id:6b0b2aa0 tags:

 ``` python
 url = "https://www.cityofmadison.com/fire/daily-reports"
 r = requests.get(url)
 r
 ```

 %% Cell type:code id:bee28b41 tags:

 ``` python
 r.raise_for_status() # give me an exception if not 200 (e.g., 404)
 ```

 %% Cell type:code id:0bae00e7 tags:

 ``` python
-# doesn't work
+# sometime the following doesn't work
 # pd.read_html(url)
 ```

 %% Cell type:code id:47173ec2 tags:

 ``` python
 # print(r.text)
 ```

 %% Cell type:markdown id:39f166c5 tags:

 Find all **span** tags with **streetAddress** using regex.

 %% Cell type:code id:ac7b9482-5512-49e4-b0b8-7f9ed34b5844 tags:

 ``` python
-# <p>1700 block Thierer Road<br>
+# <p>2300 block Pennsylvania Avenue<br>
 # addrs = re.findall(r'<p>1700 block Thierer Road<br>', r.text)
 ```

 %% Cell type:code id:8e9b49d2-3e0d-4a39-9dcf-13b288a165e7 tags:

 ``` python
 addrs = re.findall(r'', r.text)
 addrs = pd.Series(addrs)
 addrs
 ```

 %% Cell type:markdown id:4fc8ba0c-9d45-4251-8055-1310cabf15a9 tags:

 #### Without city name and state name, geocoding would return match with the most famous location with such a street name.

 %% Cell type:code id:095b9ebe-b583-4098-a3a2-47dd46610d59 tags:

 ``` python
-geo_info = gpd.tools.geocode("1300 East Washington Ave")
+geo_info = gpd.tools.geocode("2300 block Pennsylvania Avenue")
 geo_info
 ```

 %% Cell type:code id:e52ee0fc-d73c-4942-9bec-d1586a702f68 tags:

 ``` python
 geo_info["address"].loc[0]
 ```

 %% Cell type:markdown id:e80b714c-a962-4a47-b7cc-3a19d0da8ac0 tags:

 #### To get the correct address we want, we should concatenate "Madison, Wisconsin" to the end of the address.

 %% Cell type:code id:cf3f590b-8d00-46c4-ac57-6f2f61509f68 tags:

 ``` python
-geo_info = gpd.tools.geocode("1300 East Washington Ave, Madison, Wisconsin")
+geo_info = gpd.tools.geocode("2300 block Pennsylvania Avenue, Madison, Wisconsin")
 geo_info
 ```

 %% Cell type:markdown id:3caf8c12-67a3-4e19-a758-d556d010eead tags:

-#### Addresses with "block" often won't work or won't give you the correct lat/long. We need to remove the word "block" before geocoding.
-
-%% Cell type:code id:54103e4a-5ac2-4f19-811b-385c02623ed2 tags:
-
-``` python
-gpd.tools.geocode("800 block W. Johnson Street, Madison, Wisconsin")
-```
-
-%% Cell type:code id:66072f4b-2286-4d66-ba2c-18ccd2e37ed6 tags:
-
-``` python
-gpd.tools.geocode("800 W. Johnson Street, Madison, Wisconsin")
-```
+#### Addresses with "block" often won't work or won't give you the correct lat/lon. We need to remove the word "block" before geocoding.

 %% Cell type:code id:cf982302 tags:

 ``` python
-fixed_addrs = addrs.str.replace(" block ", " ") + ", Madison, WI"
+fixed_addrs = addrs.str.replace(" block ", " ") + ", Madison, Wisconsin"
 fixed_addrs
 ```

 %% Cell type:markdown id:d6a267ed-6876-4866-a66f-74390a3d4ee1 tags:

 #### Using a different provider than the default one

- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long
+- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/lon
    - We will be using "OpenStreetMap", for which the argument is "nominatim"
    - We also need to specify argument to `user_agent` parameter, indicating where the request is coming from; for example: "cs320_bot"

 %% Cell type:code id:ab0e699f tags:

 ``` python
 incidents = gpd.tools.geocode(fixed_addrs, provider="nominatim", user_agent="cs320bot")
 incidents
 ```

 %% Cell type:markdown id:4ad73b2c-5171-45c2-a3f0-eef30f93c492 tags:

 It is often a good idea to drop na values. Although in this version of the example, there are no failed geocodings.

 %% Cell type:code id:41a7d12e-73be-4442-b43c-285229e6cfdb tags:

 ``` python
 incidents = incidents.dropna()
 incidents
 ```

 %% Cell type:markdown id:b30733e6-9491-4be0-bf20-bba64903334d tags:

 #### Self-practice

 If you want practice with regex, try to write regular expression and use the match result to make sure that "Madison" and "Wisconsin" is part of each address.

 %% Cell type:code id:843bbba2-3de5-4cc0-b76b-92e7bebdd379 tags:

 ``` python
 # self-practice
 for addr in incidents["address"]:
    print(addr)
 ```

 %% Cell type:code id:1a04c2b0 tags:

 ``` python
 ax = city.plot(color="lightgray")
 water.plot(color="lightblue", ax=ax)
 fire.plot(color="red", ax=ax, marker="+", label="Fire")
 police2.plot(color="blue", ax=ax, label="Police")
 incidents.to_crs(city.crs).plot(ax=ax, color="k", label="Incidents")
 ax.legend(loc="upper left", frameon=False)
 ax.set_axis_off()
 ```

 %% Cell type:markdown id:471a762b tags:

 # Visualization 2

 %% Cell type:markdown id:2478daaa-cb6c-4d73-92af-01ae91e773fe tags:

 ### Geographic Data / Maps

-#### Installation
-```python
-pip3 install --upgrade pip
-pip3 install geopandas shapely descartes geopy netaddr
-sudo apt install -y python3-rtree
-```
-
 - `import geopandas as gpd`
 - `.shp` => Shapefile
 - `gpd.datasets.get_path(<shp file path>)`:
    - example: `gpd.datasets.get_path("naturalearth_lowres")`
 - `gpd.read_file(<path>)`

 %% Cell type:code id:e6f50cc3 tags:

 ``` python
 import matplotlib
 import matplotlib.pyplot as plt
 import pandas as pd
 import math
 import requests
 import re
 import os

 # new import statements
 import geopandas as gpd
 from shapely.geometry import Point, Polygon, box
 ```

-%% Cell type:code id:257671bc-1e79-47c2-aa7a-1725562d23ef tags:
-
-``` python
-!ls /home/gurmail.singh/.local/lib/python3.8/site-packages/geopandas/datasets/naturalearth_lowres
-```
-
-%% Cell type:code id:e7269706-188a-48e3-882a-ac068170b1af tags:
-
-``` python
-!ls /home/gurmail.singh/.local/lib/python3.8/site-packages/geopandas/datasets
-```
-
 %% Cell type:code id:273ed288 tags:

 ``` python
-# Find the path for "naturalearth_lowres"
-path =
-# Read the shapefile for "naturalearth_lowres" and
+# Read the geojson file for "naturalearth_lowres"
+gdf = gpd.read_file("naturalearth_lowres.geojson")
 # set index using "name" column
-gdf =
+gdf = gdf.set_index("name")
 ```

 %% Cell type:code id:cf4d871c-d6d6-4d99-be96-75faf804d4a7 tags:

 ``` python
 gdf.head()
 ```

 %% Cell type:code id:5c983208-7851-4d25-92e7-3f110039411a tags:

 ``` python
 type(gdf).__mro__
 ```

 %% Cell type:code id:653edec9-8b82-4684-ae82-a9fa399459ce tags:

 ``` python
-# All shapefiles have a column called "geometry"
+# Get the "geometry" column
 gdf["geometry"]
 ```

 %% Cell type:code id:e2eb071a-b29a-4edd-8747-50b02fd43108 tags:

 ``` python
 type(gdf["geometry"]).__mro__
 ```

 %% Cell type:code id:da5b5783-d78e-460f-bd22-b41e7f24f871 tags:

 ``` python
 # First country's name and geometry
 ```

 %% Cell type:code id:8f4da997-567d-47f6-8594-f0a37b92b9e6 tags:

 ``` python
 # Second country's name geometry
 print(gdf.index[1])
 gdf["geometry"].iat[1]
 ```

 %% Cell type:code id:d380b32f-c401-4601-b6d3-02ac344b3f08 tags:

 ``` python
 # Geometry for "United States of America"
 # gdf.at[<row_index>, <column_name>]
 ```

 %% Cell type:code id:3528f252-36b7-4ca5-9108-0951367837cc tags:

 ``` python
 # Type of Tanzania's geometry
 print(gdf.index[1], type(gdf["geometry"].iat[1]))

 # Type of United States of America's geometry
 print("United States of America", type(gdf.at["United States of America", "geometry"]))
 ```

 %% Cell type:markdown id:47711b72-a24c-4da8-a804-3faf88a5fe8b tags:

 - `gdf.plot(figsize=(<width>, <height>), column=<column name>)`
 - `ax.set_axis_off()`

 %% Cell type:code id:9874c7de-226e-4296-af60-45e091836a19 tags:

 ``` python
 ax = gdf.plot(figsize=(8,4))
 ```

 %% Cell type:code id:f5547cc5-3404-421e-aeed-4b4cf8ee8382 tags:

 ``` python
 # Set facecolor="0.7", edgecolor="black"
 ax = gdf.plot(figsize=(8,4))
 # Turn off the axes
 # ax.set_axis_off()
 ```

 %% Cell type:code id:58da716e tags:

 ``` python
 # Color the map based on population column, column="pop_est" and set cmap="cool" and legend=True
 ax = gdf.plot(figsize=(8,4))
 # Turn off the axes
 ax.set_axis_off()
 ```

 %% Cell type:markdown id:c7a85431-bdab-46ab-8e29-5f91d8a2e0bc tags:

 #### Create a map where countries with >100M people are red, others are gray.

 %% Cell type:code id:962a552b-94a6-4690-8018-f82173dd6096 tags:

 ``` python
 # Create a map where countries with >100M people are red, others are gray

 # Add a new column called color to gdf and set default value to "lightgray"

 # Boolean indexing to set color to red for countries with "pop_est" > 1e8

 # Create the plot
 # ax = gdf.plot(figsize=(8,4), color=gdf["color"])
 # ax.set_axis_off()
 ```

-%% Cell type:markdown id:5a32c52c-509f-4f7f-8dd6-bc7fe53c64fd tags:
-
-### All shapefile geometries are shapely shapes.
-
-%% Cell type:code id:33e92def-a7db-4d90-adc8-a3d01d472ac1 tags:
-
-``` python
-type(gdf["geometry"].iat[2])
-```
-
 %% Cell type:markdown id:39c95c23 tags:

 ### Shapely shapes

 - `from shapely.geometry import Point, Polygon, box`
 - `Polygon([(<x1>, <y1>), (<x2>, <y2>), (<x3>, <y3>), ...])`
 - `box(minx, miny, maxx, maxy)`
 - `Point(<x>, <y>)`
 - `<shapely object>.buffer(<size>)`
    - example: `Point(5, 5).buffer(3)` creates a circle

 %% Cell type:code id:61716db9 tags:

 ``` python
 triangle =    # triangle
 triangle
 ```

 %% Cell type:code id:6a36021a-8653-4698-a0ba-818c14091d79 tags:

 ``` python
 type(triangle)
 ```

 %% Cell type:code id:bddd958d-6fde-42df-8ae3-459e25fa3f04 tags:

 ``` python
 box1 =     # not a type; just a function that creates box
 box1
 ```

 %% Cell type:code id:eb7ade8d-96d2-4770-bce9-b3e35996b0b8 tags:

 ``` python
 type(box1)
 ```

 %% Cell type:code id:e5308c61-b1fa-433b-bb05-485ca7bd23da tags:

 ``` python
 point =
 point
 ```

 %% Cell type:code id:5c669619-af78-477d-b807-3e6d99278f2f tags:

 ``` python
 type(point)
 ```

 %% Cell type:code id:f015d27e-8fd8-446f-992f-4f7a88cc582d tags:

 ``` python
 # use buffer to create a circle from a point
 circle =
 circle
 ```

 %% Cell type:code id:39fe5752-8038-46cc-b4a6-320e6a78bbd1 tags:

 ``` python
 type(circle)
 ```

 %% Cell type:code id:800cc48d-c241-4439-9161-ff309e50373f tags:

 ``` python
 triangle_buffer = triangle.buffer(3)
 triangle_buffer
 ```

 %% Cell type:code id:6b9478d3-7cc9-4e80-ae84-e56d7bfd0b71 tags:

 ``` python
 type(triangle_buffer)
 ```

 %% Cell type:markdown id:1a340443-d750-43d5-99cd-28e7034ce898 tags:

 #### Shapely methods

 - Shapely methods:
    - `union`:  any point that is in either shape (OR)
    - `intersection`: any point that is in both shapes (AND)
    - `difference`: subtraction
    - `intersects`: do they overlap? returns True / False

 %% Cell type:code id:d1c5f9f7 tags:

 ``` python
 # union triangle and box1
 # it will give any point that is in either shape (OR)
 ```

 %% Cell type:code id:8a2d3357 tags:

 ``` python
 # intersection triangle and box1
 # any point that is in both shapes (AND)
 ```

 %% Cell type:code id:f327bb99-be34-4a00-a216-a6f40df48268 tags:

 ``` python
 # difference of triangle and box1
 ```

 %% Cell type:code id:9082b54a tags:

 ``` python
 # difference of box1 and triangle
 box1.difference(triangle)   # subtraction
 ```

 %% Cell type:code id:0def4359-78f6-4f14-80ae-05c25ff64bc5 tags:

 ``` python
 # check whether triangle intersects box1
 # the is, check do they overlap?
 ```

 %% Cell type:markdown id:51a82cf3-fe28-46d2-a019-d87f6a0f41b6 tags:

 Is the point "near" (<6 units) the triangle?

 %% Cell type:code id:7a87b70f tags:

 ``` python
 triangle.union(point.buffer(6))
 ```

 %% Cell type:code id:e04daa60 tags:

 ``` python
 triangle.intersects(point.buffer(6))
 ```

 %% Cell type:markdown id:bf500303 tags:

 #### Extracting "Europe" data from "naturalearth_lowres"

 %% Cell type:code id:410b08cf tags:

 ``` python
 # Europe bounding box
 eur_window = box(-10.67, 34.5, 31.55, 71.05)
 ```

 %% Cell type:markdown id:d45fdfcb-b244-4eff-a9d6-5cc4fefdfbd8 tags:

 Can we use `intersects` method?

 %% Cell type:code id:24f4a32b-9ed4-468b-a194-ebe0395fcdb6 tags:

 ``` python
 gdf.intersects(eur_window)
 ```

 %% Cell type:code id:ff455178-84da-45bf-b5b9-a71a7f9160f7 tags:

 ``` python
 # Incorrect v1
 gdf[gdf.intersects(eur_window)].plot()
 ```

 %% Cell type:code id:03aa8da8-72f4-45fd-8931-e410d1204226 tags:

 ``` python
 # Incorrect v2
 gdf[~gdf.intersects(eur_window)].plot()
 ```

 %% Cell type:markdown id:4173f3c5-017f-44ff-b713-158219fcf3e5 tags:

 Can we use `intersection` method?

 %% Cell type:code id:d7226fd9-3bb2-48c3-a5e9-5ce1ca0dd88f tags:

 ``` python
 gdf.intersection(eur_window)
 ```

 %% Cell type:code id:108b8b8a tags:

 ``` python
 gdf.intersection(eur_window).plot()
 ```

 %% Cell type:markdown id:f13e4e16-286a-4681-8bc5-9bd3fcf599a7 tags:

 How can we get rid of empty polygons (and remove the warning)?

 %% Cell type:code id:8c0dd051-a808-4fd4-b5ff-fee6ed580753 tags:

 ``` python
 eur = gdf.intersection(eur_window)
 eur.is_empty
 ```

 %% Cell type:markdown id:acb1b886-f7c8-41d4-979a-2309c2b973bd tags:

 Remove all the empty polygons using `is_empty`.

 %% Cell type:code id:55a76a00 tags:

 ``` python
 eur = eur[~eur.is_empty]
 eur
 ```

 %% Cell type:code id:08c59df7 tags:

 ``` python
 eur.plot()
 ```

 %% Cell type:markdown id:97a12d23 tags:

 #### Centroids of European countries

 %% Cell type:code id:eb5c83b7 tags:

 ``` python
 # plot the centroids
 ax = eur.plot(facecolor="lightgray", edgecolor="k")
 eur.centroid.plot(ax=ax)
 ```

 %% Cell type:markdown id:dbfab5e7-5941-438a-baf2-e05369106004 tags:

 ### Lat / Lon CRS

 - Lon is x-coord
 - Lat is y-coord
    - tells you where the point on Earth is
 - **IMPORTANT**: degrees are not a unit of distance. 1 degree of longitute near the equator is a lot farther than moving 1 degree of longitute near the north pole

 Using `.crs` to access CRS of a gdf.



 %% Cell type:code id:89251896 tags:

 ``` python
 eur.crs
 ```

 %% Cell type:markdown id:2a6ba447-6f1a-4d31-8c33-ec1eeb9e0deb tags:

 #### Single CRS doesn't work for the whole earth

 - Setting a different CRS for Europe that is based on meters.
 - https://spatialreference.org/ref/?search=europe

 %% Cell type:code id:586038b1 tags:

 ``` python
 # Setting CRS to "EPSG:3035"
 eur2 = eur.to_crs("EPSG:3035")
 eur2.crs
 ```

 %% Cell type:code id:d46c124c-9aff-4c2f-81fe-08885b606800 tags:

 ``` python
 ax = eur2.plot(facecolor="lightgray", edgecolor="k")
 eur2.centroid.plot()
 ```

 %% Cell type:code id:045b9c33 tags:

 ``` python
 ax = eur2.plot(facecolor="lightgray", edgecolor="k")
 eur2.centroid.plot(ax=ax)
 ```

 %% Cell type:markdown id:0634941f tags:

-#### How much error does lat/long computation introduce?
+#### How much error does lat/lon computation introduce?

 %% Cell type:code id:c0b72aff tags:

 ``` python
 ax = eur2.plot(facecolor="lightgray", edgecolor="k")
 eur2.centroid.plot(ax=ax, color="k") # black => correct
 eur.centroid.to_crs("EPSG:3035").plot(ax=ax, color="r")  # red => miscalculated
 ```

 %% Cell type:code id:ca9e306e tags:

 ``` python
 type(eur2.iloc[0])
 ```

 %% Cell type:code id:f489c88d-5964-4358-b8c4-fc80d28b5491 tags:

 ``` python
 type(eur2).__mro__
 ```

 %% Cell type:markdown id:85880c73 tags:

 #### Area of European countries

 %% Cell type:code id:3e4874d9 tags:

 ``` python
 eur2.area # area in sq meters
 ```

 %% Cell type:markdown id:95f55824 tags:

 What is the area in **sq miles**?

 %% Cell type:code id:85ee20c2 tags:

 ``` python
 # Conversion: / 1000 / 1000 / 2.59
 (eur2.area / 1000 / 1000 / 2.59).sort_values(ascending=False)
 # careful!  some countries (e.g., Russia) were cropped when we did intersection
 ```

 %% Cell type:code id:cd600837 tags:

 ``` python
 # area on screen, not real area
 eur.area
 ```

 %% Cell type:markdown id:daf1245f-9939-468d-9a5f-34225c2dbc51 tags:

 ### CRS

 - `<GeoDataFrame object>.crs`: gives you information about current CRS.
 - `<GeoDataFrame object>.to_crs(<TARGET CRS>)`: changes CRS to `<TARGET CRS>`.

 %% Cell type:markdown id:9da3ee4c tags:

 ### Madison area emergency services

 - Data source: https://data-cityofmadison.opendata.arcgis.com/
    - Search for:
        - "City limit"
        - "Lakes and rivers"
        - "Fire stations"
        - "Police stations"

 - CRS for Madison area: https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system#/media/File:Universal_Transverse_Mercator_zones.svg

 %% Cell type:code id:a6f80847 tags:

 ``` python
 city = gpd.read_file("City_Limit.zip").to_crs("epsg:32616")
 ```

 %% Cell type:code id:7f8595be tags:

 ``` python
 city.crs
 ```

 %% Cell type:code id:9ebd361f tags:

 ``` python
 water = gpd.read_file("Lakes_and_Rivers.zip").to_crs(city.crs)
 fire = gpd.read_file("Fire_Stations.zip").to_crs(city.crs)
 police = gpd.read_file("Police_Stations.zip").to_crs(city.crs)
 ```

-%% Cell type:markdown id:723e2bff-545f-4a8c-9f72-8a9906a99b1b tags:
-
-#### Run this on your virtual machine
-
-`sudo sh -c "echo 'Options = UnsafeLegacyRenegotiation' >> /etc/ssl/openssl.cnf"`
-
-then restart notebook!
-
 %% Cell type:markdown id:b6069860-7dd9-43f2-970e-fd15980135ef tags:

 #### GeoJSON

 How to find the below URL?

 - Go to info page of a dataset, for example: https://data-cityofmadison.opendata.arcgis.com/datasets/police-stations/explore?location=43.081769%2C-89.391550%2C12.81
 - Then click on "I want to use this" > "View API Resources" > "GeoJSON"

 %% Cell type:code id:3a095d5e tags:

 ``` python
 url = "https://maps.cityofmadison.com/arcgis/rest/services/Public/OPEN_DATA/MapServer/2/query?outFields=*&where=1%3D1&f=geojson"
 police2 = gpd.read_file(url).to_crs(city.crs)
 ```

 %% Cell type:code id:248be81e tags:

 ``` python
 ax = city.plot(color="lightgray")
 water.plot(color="lightblue", ax=ax)
 fire.plot(color="red", ax=ax, marker="+", label="Fire")
 police2.plot(color="blue", ax=ax, label="Police")
 ax.legend(loc="upper left", frameon=False)
 ax.set_axis_off()
 ```

 %% Cell type:code id:3a609d81 tags:

 ``` python
 fire.to_file("fire.geojson")
 ```

 %% Cell type:markdown id:77600884 tags:

 ### Geocoding: street address => lat / lon


- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long
+- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/lon


 #### Daily incident reports: https://www.cityofmadison.com/fire/daily-reports

 %% Cell type:code id:6b0b2aa0 tags:

 ``` python
 url = "https://www.cityofmadison.com/fire/daily-reports"
 r = requests.get(url)
 r
 ```

 %% Cell type:code id:bee28b41 tags:

 ``` python
 r.raise_for_status() # give me an exception if not 200 (e.g., 404)
 ```

 %% Cell type:code id:0bae00e7 tags:

 ``` python
-# doesn't work
+# sometime the following doesn't work
 # pd.read_html(url)
 ```

 %% Cell type:code id:47173ec2 tags:

 ``` python
 # print(r.text)
 ```

 %% Cell type:markdown id:39f166c5 tags:

 Find all **span** tags with **streetAddress** using regex.

 %% Cell type:code id:ac7b9482-5512-49e4-b0b8-7f9ed34b5844 tags:

 ``` python
-# <p>1700 block Thierer Road<br>
+# <p>2300 block Pennsylvania Avenue<br>
 # addrs = re.findall(r'<p>1700 block Thierer Road<br>', r.text)
 ```

 %% Cell type:code id:8e9b49d2-3e0d-4a39-9dcf-13b288a165e7 tags:

 ``` python
 addrs = re.findall(r'', r.text)
 addrs = pd.Series(addrs)
 addrs
 ```

 %% Cell type:markdown id:4fc8ba0c-9d45-4251-8055-1310cabf15a9 tags:

 #### Without city name and state name, geocoding would return match with the most famous location with such a street name.

 %% Cell type:code id:095b9ebe-b583-4098-a3a2-47dd46610d59 tags:

 ``` python
-geo_info = gpd.tools.geocode("1300 East Washington Ave")
+geo_info = gpd.tools.geocode("2300 block Pennsylvania Avenue")
 geo_info
 ```

 %% Cell type:code id:e52ee0fc-d73c-4942-9bec-d1586a702f68 tags:

 ``` python
 geo_info["address"].loc[0]
 ```

 %% Cell type:markdown id:e80b714c-a962-4a47-b7cc-3a19d0da8ac0 tags:

 #### To get the correct address we want, we should concatenate "Madison, Wisconsin" to the end of the address.

 %% Cell type:code id:cf3f590b-8d00-46c4-ac57-6f2f61509f68 tags:

 ``` python
-geo_info = gpd.tools.geocode("1300 East Washington Ave, Madison, Wisconsin")
+geo_info = gpd.tools.geocode("2300 block Pennsylvania Avenue, Madison, Wisconsin")
 geo_info
 ```

 %% Cell type:markdown id:3caf8c12-67a3-4e19-a758-d556d010eead tags:

-#### Addresses with "block" often won't work or won't give you the correct lat/long. We need to remove the word "block" before geocoding.
-
-%% Cell type:code id:54103e4a-5ac2-4f19-811b-385c02623ed2 tags:
-
-``` python
-gpd.tools.geocode("800 block W. Johnson Street, Madison, Wisconsin")
-```
-
-%% Cell type:code id:66072f4b-2286-4d66-ba2c-18ccd2e37ed6 tags:
-
-``` python
-gpd.tools.geocode("800 W. Johnson Street, Madison, Wisconsin")
-```
+#### Addresses with "block" often won't work or won't give you the correct lat/lon. We need to remove the word "block" before geocoding.

 %% Cell type:code id:cf982302 tags:

 ``` python
-fixed_addrs = addrs.str.replace(" block ", " ") + ", Madison, WI"
+fixed_addrs = addrs.str.replace(" block ", " ") + ", Madison, Wisconsin"
 fixed_addrs
 ```

 %% Cell type:markdown id:d6a267ed-6876-4866-a66f-74390a3d4ee1 tags:

 #### Using a different provider than the default one

- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long
+- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/lon
    - We will be using "OpenStreetMap", for which the argument is "nominatim"
    - We also need to specify argument to `user_agent` parameter, indicating where the request is coming from; for example: "cs320_bot"

 %% Cell type:code id:ab0e699f tags:

 ``` python
 incidents = gpd.tools.geocode(fixed_addrs, provider="nominatim", user_agent="cs320bot")
 incidents
 ```

 %% Cell type:markdown id:4ad73b2c-5171-45c2-a3f0-eef30f93c492 tags:

 It is often a good idea to drop na values. Although in this version of the example, there are no failed geocodings.

 %% Cell type:code id:41a7d12e-73be-4442-b43c-285229e6cfdb tags:

 ``` python
 incidents = incidents.dropna()
 incidents
 ```

 %% Cell type:markdown id:b30733e6-9491-4be0-bf20-bba64903334d tags:

 #### Self-practice

 If you want practice with regex, try to write regular expression and use the match result to make sure that "Madison" and "Wisconsin" is part of each address.

 %% Cell type:code id:843bbba2-3de5-4cc0-b76b-92e7bebdd379 tags:

 ``` python
 # self-practice
 for addr in incidents["address"]:
    print(addr)
 ```

 %% Cell type:code id:1a04c2b0 tags:

 ``` python
 ax = city.plot(color="lightgray")
 water.plot(color="lightblue", ax=ax)
 fire.plot(color="red", ax=ax, marker="+", label="Fire")
 police2.plot(color="blue", ax=ax, label="Police")
 incidents.to_crs(city.crs).plot(ax=ax, color="k", label="Incidents")
 ax.legend(loc="upper left", frameon=False)
 ax.set_axis_off()
 ```

--- a/lecture-material/17-Visualization_2/naturalearth_lowres.geojson
+++ b/lecture-material/17-Visualization_2/naturalearth_lowres.geojson
--- a/lecture-material/18-Regression_1/18-regression.ipynb
+++ b/lecture-material/18-Regression_1/18-regression.ipynb
--- a/lecture-material/18-Regression_1/18-regression_001.ipynb
+++ b/lecture-material/18-Regression_1/18-regression_001.ipynb
 %% Cell type:markdown id:9e33d56e tags:

 # Regression 1

 %% Cell type:code id:e6f50cc3 tags:

 ``` python
 import os
 import pandas as pd
 import geopandas as gpd

 # new import statements
 ```

 %% Cell type:markdown id:570b4253 tags:

 #### Covid deaths analysis

 - Source: https://github.com/cs320-wisc/s22/tree/main/lec/29%20Regression%201
    - Specifically, let's analyze "COVID-19 Data by Census Tract V2"
        - Status Flag Values: -999: Census tracts, municipalities, school districts, and zip codes with 0–4 aggregate counts for any data have been suppressed. County data with 0-4 aggregate counts by demographic factors (e.g., by age group, race, ethnicity) have been suppressed.

 %% Cell type:code id:696eec18 tags:

 ``` python
 # Read the "covid.geojson" file
 dataset_file = "covid.geojson"
 df =
 ```

 %% Cell type:code id:c3e6454f tags:

 ``` python
 df.head()
 ```

 %% Cell type:code id:905da51e tags:

 ``` python
 # Explore the columns
 df
 ```

 %% Cell type:code id:a3434aba tags:

 ``` python
 # Create a geographic plot
 df
 ```

 %% Cell type:markdown id:e3e73632 tags:

 ### Predicting "DTH_CUM_CP"

 %% Cell type:markdown id:43cebffa tags:

 ### How can we get a clean dataset of COVID deaths in WI?

 %% Cell type:code id:fa2f30ae tags:

 ``` python
 # Replace -999 with 2; 2 is between 0-4; random choice instead of using 0
 df =
 # we must communicate in final results what percent of values were guessed (imputed)
 ```

 %% Cell type:markdown id:4cff709d tags:

 How would we know if the data is now clean?

 %% Cell type:code id:950c2041 tags:

 ``` python
 # Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
 df
 ```

 %% Cell type:markdown id:4073a940 tags:

 Which points are concerning? Let's take a closer look.

 #### Which rows have "DTH_CUM_CP" greater than 300?

 %% Cell type:code id:a655c465 tags:

 ``` python
 df["DTH_CUM_CP"]
 ```

 %% Cell type:markdown id:d377143e tags:

 #### Valid rows have "GEOID" that only contains digits

 Using `str` methods to perform filtering: `str.fullmatch` does a full string match given a reg-ex. Because it does full string match anchor characters (`^`, `$`) won't be needed.

 %% Cell type:code id:529781db tags:

 ``` python
 ```

 %% Cell type:code id:af16925b tags:

 ``` python
 df["GEOID"]
 ```

 %% Cell type:code id:1d583d06 tags:

 ``` python
 df = df[df["GEOID"].str.fullmatch(r"\d+")]
 df.plot.scatter(x="POP", y="DTH_CUM_CP")
 ```

 %% Cell type:markdown id:1be50600 tags:

 ### How can we train/fit models to known data to predict unknowns?
 - Feature(s) => Predictions
    - Population => Deaths
    - Cases => Deaths
    - Cases by Age => Deaths

 - General structure for fitting models:
    ```python
    model = <some model>
    model.fit(X, y)
    y = model.predict(X)
    ```
    - where `X` needs to be a matrix or a `DataFrame` and `y` needs to be an array (vector) or a `Series`
    - after fitting, `model` object instance stores the information about relationship between features (x values) and predictions (y values)
    - `predict` returns a `numpy` array, which can be treated like a list

 %% Cell type:markdown id:0d0e65c3 tags:

 ### Predicting "DTH_CUM_CP" using "POP" as feature.

 %% Cell type:code id:3dbdbba4 tags:

 ``` python
 # We must specify a list of columns to make sure we extract a DataFrame and not a Series
 # Feature DataFrame
 df
 ```

 %% Cell type:code id:22aad05e tags:

 ``` python
 # Label Series: "DTH_CUM_CP"
 df
 ```

 %% Cell type:markdown id:797d6831 tags:

 ### Let's use `LinearRegression` model.

 - `from sklearn.linear_model import LinearRegression`

 %% Cell type:code id:51ad5b05 tags:

 ``` python
 xcols =
 ycol =

 model =
 model
+```
+
+%% Cell type:code id:94c3a0e7-4099-4f46-a8af-b065ae0b6873 tags:
+
+``` python
+# Let's now make predictions for the known data
 # less interesting because we are predicting what we already know
-y = model
+y = model.predict(df[xcols])
+y
 ```

 %% Cell type:markdown id:e589d923 tags:

 Predicting for new values of x.

 %% Cell type:code id:dd8f0440 tags:

 ``` python
 predict_df = pd.DataFrame({"POP": [1000, 2000, 3000]})
 predict_df
 ```

 %% Cell type:code id:5315289f tags:

 ``` python
 # Predict for the new data
 ```

 %% Cell type:code id:0cea4830 tags:

 ``` python
 # Insert a new column called "predicted deaths" with the predictions
 predict_df["predicted deaths"] = model.predict(predict_df)
 predict_df
 ```

 %% Cell type:markdown id:1c649201 tags:

 ### How can we visualize model predictions?

 - Let's predict deaths for "POP" ranges like 0, 1000, 2000, ..., 20000

 %% Cell type:code id:496a67c9 tags:

 ``` python
 predict_df = pd.DataFrame({"POP": range(0, 20000, 1000)})
 predict_df
 ```

 %% Cell type:code id:b825412a tags:

 ``` python
 # Insert a new column called "predicted deaths" with the predictions
 predict_df["predicted deaths"] = model.predict(predict_df)
 predict_df
 ```

 %% Cell type:code id:ca0b47f5 tags:

 ``` python
 # Create a line plot to visualize relationship between "POP" and "predicted deaths"

 # Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
 ```

 %% Cell type:markdown id:3bb2d3f8 tags:

 ### How can we get a formula for the relationship?

 - `y=mx+c`, where `y` is our predictions and `x` are the features used for the fit
    - Slope of the line (`m`) given by `model.coef_[0]`
    - Intercept of the line (`c`) given by `model.intercept_`

 %% Cell type:markdown id:a31d19b3 tags:

 Model coefficients

 %% Cell type:code id:c894c1b1 tags:

 ``` python
 model
 ```

 %% Cell type:code id:4850a47e tags:

 ``` python
 # Slope of the line
 model.coef_
 ```

 %% Cell type:code id:32690085 tags:

 ``` python
 # Intercept of the line
 model
 ```

 %% Cell type:code id:cfbcee12 tags:

 ``` python
 print(f"deaths ~= {round(model.coef_[0], 4)} * population + {round(model.intercept_, 4)}")
 ```

 %% Cell type:markdown id:738ccba5 tags:

 ### How well does our model fit the data?
 - explained variance score
 - R^2 ("r squared")

 %% Cell type:markdown id:91e4af81 tags:

 #### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
 - requires `import sklearn`
 - calculates the explained variance score given:
    - y_true: actual death values in our example
    - y_pred: prediction of deaths in our example
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html

 %% Cell type:code id:286e2569 tags:

 ``` python
 xcols, ycol
 ```

 %% Cell type:code id:bb36ca0f tags:

 ``` python
 # Let's now make predictions for the known data
 predictions = model
 predictions
 ```

 %% Cell type:code id:92dadb4c tags:

 ``` python
 sklearn.metrics.explained_variance_score(, )
 ```

 %% Cell type:markdown id:ebd81950 tags:

 #### Explained variance score

 - `explained_variance_score = (known_var - explained_variance) / known_var`
    - where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`

 %% Cell type:markdown id:36bc3bb5 tags:

 What is the variation in known deaths?

 %% Cell type:code id:55a3dfcd tags:

 ``` python
 # Compute variance of "DTH_CUM_CP" column
 known_var = df[ycol]
 known_var
 ```

 %% Cell type:code id:c33a6fb1 tags:

 ``` python
 # explained_variance
 explained_variance = (df[ycol] - predictions).var()
 explained_variance
 ```

 %% Cell type:code id:dfb076b1 tags:

 ``` python
 # explained_variance score
 explained_variance_score = (known_var - explained_variance) / known_var
 explained_variance_score
 ```

 %% Cell type:code id:73a55a32 tags:

 ``` python
 # For comparison here is the explained variance score from sklearn
 sklearn.metrics.explained_variance_score(df[ycol], predictions)
 ```

 %% Cell type:markdown id:547452da tags:

 #### `sklearn.metrics.r2_score(y_true, y_pred)`

 - requires `import sklearn`
 - calculates the explained variance score given:
    - y_true: actual death values in our example
    - y_pred: prediction of deaths in our example
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

 %% Cell type:code id:d16ba67b tags:

 ``` python
 sklearn.metrics.r2_score(df[ycol], predictions)
 ```

 %% Cell type:markdown id:6e60fed7 tags:

 #### R^2 score (aka coefficient of determination) approximation

 - `r2_score = (known_var - r2_val) / known_var`
    - where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`

 %% Cell type:code id:d34ea427 tags:

 ``` python
 # r2_val
 r2_val = ((df[ycol] - predictions) ** 2).mean()
 r2_val
 ```

 %% Cell type:code id:b1c3574b tags:

 ``` python
 r2_score = (known_var - r2_val) / known_var
 r2_score # there might be minor rounding off differences
 ```

 %% Cell type:markdown id:adc33af9 tags:

 #### `model.score(X, y)`
 - invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y

 %% Cell type:code id:b3bde089 tags:

 ``` python
 model
 ```

 %% Cell type:markdown id:1768f9a9 tags:

 #### Did our model learn, or just memorize (that is, "overfit")?

 - Example: https://www.mathworks.com/discovery/overfitting.html
 - Split data into train and test

 %% Cell type:code id:87a77fb4 tags:

 ``` python
 # Split the data into two equal parts
 len(df) // 2
 ```

 %% Cell type:code id:bdd7cad0 tags:

 ``` python
 # Manual way of splitting train and test data
 train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
 len(train), len(test)
 ```

 %% Cell type:markdown id:2f45dd74 tags:

 Problem with manual splitting is, we need to make sure that the data is not sorted in some way.

 %% Cell type:markdown id:3a781391 tags:

 #### `train_test_split(<dataframe>, test_size=<val>)`

 - requires `from sklearn.model_selection import train_test_split`
 - shuffles the data and then splits based on 75%-25% split between train and test
    - produces new train and test data every single time
 - `test_size` parameter can take two kind of values:
    - actual number of rows that we want in test data
    - fractional number representing the ratio of train versus test data
    - default value is `0.25`
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

 %% Cell type:code id:1d3a86c9 tags:

 ``` python

 len(train), len(test)
 ```

 %% Cell type:code id:49f7dfe8 tags:

 ``` python
 # Test size using row count
 train, test = train_test_split(df, test_size=120)
 len(train), len(test)
 ```

 %% Cell type:code id:1a29cf9d tags:

 ``` python
 # Test size using fraction
 train, test = train_test_split(df, test_size=0.5)
 len(train), len(test)
 ```

 %% Cell type:code id:7934e9ee tags:

 ``` python
 # Running this cell twice will give you two different train datasets
 train, test = train_test_split(df)
 train.head()
 ```

 %% Cell type:code id:0fe05a2e tags:

 ``` python
 train, test = train_test_split(df)

 # Let's use the train and the test data
 model = LinearRegression()
 # Fit using training data
 model.fit(, )
 # Predict using test data
 y = model.predict()
 # We can use score directly as it automatically invokes predict
 model
 ```

 %% Cell type:markdown id:e0f0e21b tags:

 Running the above cell again will give you entirely different model and score.

 %% Cell type:markdown id:003b1c50 tags:

 #### How can we minimize noise due to random train/test splits?

 ### Cross validation: `cross_val_score(estimator, X, y)`

 - requires `from sklearn.model_selection import cross_val_score`
 -  do many different train/test splits of the values, fitting and scoring the model across each combination
 - cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
 - function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

 %% Cell type:code id:bfa17fce tags:

 ``` python
 train, test = train_test_split(df)

 model = LinearRegression()
 scores =
 scores
 ```

 %% Cell type:code id:284f776f tags:

 ``` python
 # Compute mean of the scores
 scores
 ```

 %% Cell type:markdown id:53c9d4d4 tags:

 #### How can we compare models?
 - model 1: POP => DEATHS
 - model 2: CASES (POS_CUM_CP) => DEATHS

 %% Cell type:code id:ffd9791b tags:

 ``` python
 model1 = LinearRegression()
 model2 = LinearRegression()
 model1_scores = cross_val_score(model1, )
 model2_scores = cross_val_score(model2, )
 ```

 %% Cell type:code id:60f0bf73 tags:

 ``` python
 model1_scores.mean()
 ```

 %% Cell type:code id:e3070bf6 tags:

 ``` python
 model2_scores.mean()
 ```

 %% Cell type:markdown id:dced0919 tags:

 Which of these two models do you think will perform better? Probably model2.

 %% Cell type:code id:dfedd8d4 tags:

 ``` python
 means = pd.Series({"model1": model1_scores.mean(),
                   "model2": model2_scores.mean()})
 means.plot.bar(figsize=(3, 3))
 ```

 %% Cell type:markdown id:312aa001 tags:

 How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.

 %% Cell type:code id:5123c3a9 tags:

 ``` python
 model1_scores.std()
 ```

 %% Cell type:code id:230b9dc9 tags:

 ``` python
 model2_scores.std()
 ```

 %% Cell type:code id:484c7af9 tags:

 ``` python
 err = pd.Series({"model1": model1_scores.std(),
                 "model2": model2_scores.std()})
 err
 ```

 %% Cell type:code id:233cd91d tags:

 ``` python
 # Plot error bar by passing argument to paramenter yerr
 means.plot.bar(figsize=(3, 3), )
 ```

 %% Cell type:markdown id:c3b68db6 tags:

 Pick a winner and run it one more time against test data.

 %% Cell type:markdown id:09f5bce9 tags:

 #### How can we use multiple x variables (multiple regression)?

 %% Cell type:code id:2538534d tags:

 ``` python
 model = LinearRegression()
 xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
       'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
       'POS_80_89_CP', 'POS_90_CP']
 ycol = "DTH_CUM_CP"

 model.fit(train[xcols], train[ycol])
 model.score(test[xcols], test[ycol])
 ```

 %% Cell type:markdown id:92e4c272 tags:

 #### How can we interpret what features the model is relying on?

 %% Cell type:code id:68e3d21a tags:

 ``` python
 model.coef_
 ```

 %% Cell type:code id:44bd5b07 tags:

 ``` python
 pd.Series(model.coef_).plot.bar(figsize=(3, 2))
 ```

 %% Cell type:code id:a606f455-c8b4-4d93-a2f4-737c4e799fa0 tags:

 ``` python
 ```

 %% Cell type:markdown id:9e33d56e tags:

 # Regression 1

 %% Cell type:code id:e6f50cc3 tags:

 ``` python
 import os
 import pandas as pd
 import geopandas as gpd

 # new import statements
 ```

 %% Cell type:markdown id:570b4253 tags:

 #### Covid deaths analysis

 - Source: https://github.com/cs320-wisc/s22/tree/main/lec/29%20Regression%201
    - Specifically, let's analyze "COVID-19 Data by Census Tract V2"
        - Status Flag Values: -999: Census tracts, municipalities, school districts, and zip codes with 0–4 aggregate counts for any data have been suppressed. County data with 0-4 aggregate counts by demographic factors (e.g., by age group, race, ethnicity) have been suppressed.

 %% Cell type:code id:696eec18 tags:

 ``` python
 # Read the "covid.geojson" file
 dataset_file = "covid.geojson"
 df =
 ```

 %% Cell type:code id:c3e6454f tags:

 ``` python
 df.head()
 ```

 %% Cell type:code id:905da51e tags:

 ``` python
 # Explore the columns
 df
 ```

 %% Cell type:code id:a3434aba tags:

 ``` python
 # Create a geographic plot
 df
 ```

 %% Cell type:markdown id:e3e73632 tags:

 ### Predicting "DTH_CUM_CP"

 %% Cell type:markdown id:43cebffa tags:

 ### How can we get a clean dataset of COVID deaths in WI?

 %% Cell type:code id:fa2f30ae tags:

 ``` python
 # Replace -999 with 2; 2 is between 0-4; random choice instead of using 0
 df =
 # we must communicate in final results what percent of values were guessed (imputed)
 ```

 %% Cell type:markdown id:4cff709d tags:

 How would we know if the data is now clean?

 %% Cell type:code id:950c2041 tags:

 ``` python
 # Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
 df
 ```

 %% Cell type:markdown id:4073a940 tags:

 Which points are concerning? Let's take a closer look.

 #### Which rows have "DTH_CUM_CP" greater than 300?

 %% Cell type:code id:a655c465 tags:

 ``` python
 df["DTH_CUM_CP"]
 ```

 %% Cell type:markdown id:d377143e tags:

 #### Valid rows have "GEOID" that only contains digits

 Using `str` methods to perform filtering: `str.fullmatch` does a full string match given a reg-ex. Because it does full string match anchor characters (`^`, `$`) won't be needed.

 %% Cell type:code id:529781db tags:

 ``` python
 ```

 %% Cell type:code id:af16925b tags:

 ``` python
 df["GEOID"]
 ```

 %% Cell type:code id:1d583d06 tags:

 ``` python
 df = df[df["GEOID"].str.fullmatch(r"\d+")]
 df.plot.scatter(x="POP", y="DTH_CUM_CP")
 ```

 %% Cell type:markdown id:1be50600 tags:

 ### How can we train/fit models to known data to predict unknowns?
 - Feature(s) => Predictions
    - Population => Deaths
    - Cases => Deaths
    - Cases by Age => Deaths

 - General structure for fitting models:
    ```python
    model = <some model>
    model.fit(X, y)
    y = model.predict(X)
    ```
    - where `X` needs to be a matrix or a `DataFrame` and `y` needs to be an array (vector) or a `Series`
    - after fitting, `model` object instance stores the information about relationship between features (x values) and predictions (y values)
    - `predict` returns a `numpy` array, which can be treated like a list

 %% Cell type:markdown id:0d0e65c3 tags:

 ### Predicting "DTH_CUM_CP" using "POP" as feature.

 %% Cell type:code id:3dbdbba4 tags:

 ``` python
 # We must specify a list of columns to make sure we extract a DataFrame and not a Series
 # Feature DataFrame
 df
 ```

 %% Cell type:code id:22aad05e tags:

 ``` python
 # Label Series: "DTH_CUM_CP"
 df
 ```

 %% Cell type:markdown id:797d6831 tags:

 ### Let's use `LinearRegression` model.

 - `from sklearn.linear_model import LinearRegression`

 %% Cell type:code id:51ad5b05 tags:

 ``` python
 xcols =
 ycol =

 model =
 model
+```
+
+%% Cell type:code id:94c3a0e7-4099-4f46-a8af-b065ae0b6873 tags:
+
+``` python
+# Let's now make predictions for the known data
 # less interesting because we are predicting what we already know
-y = model
+y = model.predict(df[xcols])
+y
 ```

 %% Cell type:markdown id:e589d923 tags:

 Predicting for new values of x.

 %% Cell type:code id:dd8f0440 tags:

 ``` python
 predict_df = pd.DataFrame({"POP": [1000, 2000, 3000]})
 predict_df
 ```

 %% Cell type:code id:5315289f tags:

 ``` python
 # Predict for the new data
 ```

 %% Cell type:code id:0cea4830 tags:

 ``` python
 # Insert a new column called "predicted deaths" with the predictions
 predict_df["predicted deaths"] = model.predict(predict_df)
 predict_df
 ```

 %% Cell type:markdown id:1c649201 tags:

 ### How can we visualize model predictions?

 - Let's predict deaths for "POP" ranges like 0, 1000, 2000, ..., 20000

 %% Cell type:code id:496a67c9 tags:

 ``` python
 predict_df = pd.DataFrame({"POP": range(0, 20000, 1000)})
 predict_df
 ```

 %% Cell type:code id:b825412a tags:

 ``` python
 # Insert a new column called "predicted deaths" with the predictions
 predict_df["predicted deaths"] = model.predict(predict_df)
 predict_df
 ```

 %% Cell type:code id:ca0b47f5 tags:

 ``` python
 # Create a line plot to visualize relationship between "POP" and "predicted deaths"

 # Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
 ```

 %% Cell type:markdown id:3bb2d3f8 tags:

 ### How can we get a formula for the relationship?

 - `y=mx+c`, where `y` is our predictions and `x` are the features used for the fit
    - Slope of the line (`m`) given by `model.coef_[0]`
    - Intercept of the line (`c`) given by `model.intercept_`

 %% Cell type:markdown id:a31d19b3 tags:

 Model coefficients

 %% Cell type:code id:c894c1b1 tags:

 ``` python
 model
 ```

 %% Cell type:code id:4850a47e tags:

 ``` python
 # Slope of the line
 model.coef_
 ```

 %% Cell type:code id:32690085 tags:

 ``` python
 # Intercept of the line
 model
 ```

 %% Cell type:code id:cfbcee12 tags:

 ``` python
 print(f"deaths ~= {round(model.coef_[0], 4)} * population + {round(model.intercept_, 4)}")
 ```

 %% Cell type:markdown id:738ccba5 tags:

 ### How well does our model fit the data?
 - explained variance score
 - R^2 ("r squared")

 %% Cell type:markdown id:91e4af81 tags:

 #### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
 - requires `import sklearn`
 - calculates the explained variance score given:
    - y_true: actual death values in our example
    - y_pred: prediction of deaths in our example
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html

 %% Cell type:code id:286e2569 tags:

 ``` python
 xcols, ycol
 ```

 %% Cell type:code id:bb36ca0f tags:

 ``` python
 # Let's now make predictions for the known data
 predictions = model
 predictions
 ```

 %% Cell type:code id:92dadb4c tags:

 ``` python
 sklearn.metrics.explained_variance_score(, )
 ```

 %% Cell type:markdown id:ebd81950 tags:

 #### Explained variance score

 - `explained_variance_score = (known_var - explained_variance) / known_var`
    - where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`

 %% Cell type:markdown id:36bc3bb5 tags:

 What is the variation in known deaths?

 %% Cell type:code id:55a3dfcd tags:

 ``` python
 # Compute variance of "DTH_CUM_CP" column
 known_var = df[ycol]
 known_var
 ```

 %% Cell type:code id:c33a6fb1 tags:

 ``` python
 # explained_variance
 explained_variance = (df[ycol] - predictions).var()
 explained_variance
 ```

 %% Cell type:code id:dfb076b1 tags:

 ``` python
 # explained_variance score
 explained_variance_score = (known_var - explained_variance) / known_var
 explained_variance_score
 ```

 %% Cell type:code id:73a55a32 tags:

 ``` python
 # For comparison here is the explained variance score from sklearn
 sklearn.metrics.explained_variance_score(df[ycol], predictions)
 ```

 %% Cell type:markdown id:547452da tags:

 #### `sklearn.metrics.r2_score(y_true, y_pred)`

 - requires `import sklearn`
 - calculates the explained variance score given:
    - y_true: actual death values in our example
    - y_pred: prediction of deaths in our example
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

 %% Cell type:code id:d16ba67b tags:

 ``` python
 sklearn.metrics.r2_score(df[ycol], predictions)
 ```

 %% Cell type:markdown id:6e60fed7 tags:

 #### R^2 score (aka coefficient of determination) approximation

 - `r2_score = (known_var - r2_val) / known_var`
    - where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`

 %% Cell type:code id:d34ea427 tags:

 ``` python
 # r2_val
 r2_val = ((df[ycol] - predictions) ** 2).mean()
 r2_val
 ```

 %% Cell type:code id:b1c3574b tags:

 ``` python
 r2_score = (known_var - r2_val) / known_var
 r2_score # there might be minor rounding off differences
 ```

 %% Cell type:markdown id:adc33af9 tags:

 #### `model.score(X, y)`
 - invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y

 %% Cell type:code id:b3bde089 tags:

 ``` python
 model
 ```

 %% Cell type:markdown id:1768f9a9 tags:

 #### Did our model learn, or just memorize (that is, "overfit")?

 - Example: https://www.mathworks.com/discovery/overfitting.html
 - Split data into train and test

 %% Cell type:code id:87a77fb4 tags:

 ``` python
 # Split the data into two equal parts
 len(df) // 2
 ```

 %% Cell type:code id:bdd7cad0 tags:

 ``` python
 # Manual way of splitting train and test data
 train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
 len(train), len(test)
 ```

 %% Cell type:markdown id:2f45dd74 tags:

 Problem with manual splitting is, we need to make sure that the data is not sorted in some way.

 %% Cell type:markdown id:3a781391 tags:

 #### `train_test_split(<dataframe>, test_size=<val>)`

 - requires `from sklearn.model_selection import train_test_split`
 - shuffles the data and then splits based on 75%-25% split between train and test
    - produces new train and test data every single time
 - `test_size` parameter can take two kind of values:
    - actual number of rows that we want in test data
    - fractional number representing the ratio of train versus test data
    - default value is `0.25`
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

 %% Cell type:code id:1d3a86c9 tags:

 ``` python

 len(train), len(test)
 ```

 %% Cell type:code id:49f7dfe8 tags:

 ``` python
 # Test size using row count
 train, test = train_test_split(df, test_size=120)
 len(train), len(test)
 ```

 %% Cell type:code id:1a29cf9d tags:

 ``` python
 # Test size using fraction
 train, test = train_test_split(df, test_size=0.5)
 len(train), len(test)
 ```

 %% Cell type:code id:7934e9ee tags:

 ``` python
 # Running this cell twice will give you two different train datasets
 train, test = train_test_split(df)
 train.head()
 ```

 %% Cell type:code id:0fe05a2e tags:

 ``` python
 train, test = train_test_split(df)

 # Let's use the train and the test data
 model = LinearRegression()
 # Fit using training data
 model.fit(, )
 # Predict using test data
 y = model.predict()
 # We can use score directly as it automatically invokes predict
 model
 ```

 %% Cell type:markdown id:e0f0e21b tags:

 Running the above cell again will give you entirely different model and score.

 %% Cell type:markdown id:003b1c50 tags:

 #### How can we minimize noise due to random train/test splits?

 ### Cross validation: `cross_val_score(estimator, X, y)`

 - requires `from sklearn.model_selection import cross_val_score`
 -  do many different train/test splits of the values, fitting and scoring the model across each combination
 - cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
 - function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

 %% Cell type:code id:bfa17fce tags:

 ``` python
 train, test = train_test_split(df)

 model = LinearRegression()
 scores =
 scores
 ```

 %% Cell type:code id:284f776f tags:

 ``` python
 # Compute mean of the scores
 scores
 ```

 %% Cell type:markdown id:53c9d4d4 tags:

 #### How can we compare models?
 - model 1: POP => DEATHS
 - model 2: CASES (POS_CUM_CP) => DEATHS

 %% Cell type:code id:ffd9791b tags:

 ``` python
 model1 = LinearRegression()
 model2 = LinearRegression()
 model1_scores = cross_val_score(model1, )
 model2_scores = cross_val_score(model2, )
 ```

 %% Cell type:code id:60f0bf73 tags:

 ``` python
 model1_scores.mean()
 ```

 %% Cell type:code id:e3070bf6 tags:

 ``` python
 model2_scores.mean()
 ```

 %% Cell type:markdown id:dced0919 tags:

 Which of these two models do you think will perform better? Probably model2.

 %% Cell type:code id:dfedd8d4 tags:

 ``` python
 means = pd.Series({"model1": model1_scores.mean(),
                   "model2": model2_scores.mean()})
 means.plot.bar(figsize=(3, 3))
 ```

 %% Cell type:markdown id:312aa001 tags:

 How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.

 %% Cell type:code id:5123c3a9 tags:

 ``` python
 model1_scores.std()
 ```

 %% Cell type:code id:230b9dc9 tags:

 ``` python
 model2_scores.std()
 ```

 %% Cell type:code id:484c7af9 tags:

 ``` python
 err = pd.Series({"model1": model1_scores.std(),
                 "model2": model2_scores.std()})
 err
 ```

 %% Cell type:code id:233cd91d tags:

 ``` python
 # Plot error bar by passing argument to paramenter yerr
 means.plot.bar(figsize=(3, 3), )
 ```

 %% Cell type:markdown id:c3b68db6 tags:

 Pick a winner and run it one more time against test data.

 %% Cell type:markdown id:09f5bce9 tags:

 #### How can we use multiple x variables (multiple regression)?

 %% Cell type:code id:2538534d tags:

 ``` python
 model = LinearRegression()
 xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
       'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
       'POS_80_89_CP', 'POS_90_CP']
 ycol = "DTH_CUM_CP"

 model.fit(train[xcols], train[ycol])
 model.score(test[xcols], test[ycol])
 ```

 %% Cell type:markdown id:92e4c272 tags:

 #### How can we interpret what features the model is relying on?

 %% Cell type:code id:68e3d21a tags:

 ``` python
 model.coef_
 ```

 %% Cell type:code id:44bd5b07 tags:

 ``` python
 pd.Series(model.coef_).plot.bar(figsize=(3, 2))
 ```

 %% Cell type:code id:a606f455-c8b4-4d93-a2f4-737c4e799fa0 tags:

 ``` python
 ```

--- a/lecture-material/18-Regression_1/18-regression_002.ipynb
+++ b/lecture-material/18-Regression_1/18-regression_002.ipynb
 %% Cell type:markdown id:9e33d56e tags:

 # Regression 1

 %% Cell type:code id:e6f50cc3 tags:

 ``` python
 import os
 import pandas as pd
 import geopandas as gpd

 # new import statements
 ```

 %% Cell type:markdown id:570b4253 tags:

 #### Covid deaths analysis

 - Source: https://github.com/cs320-wisc/s22/tree/main/lec/29%20Regression%201
    - Specifically, let's analyze "COVID-19 Data by Census Tract V2"
        - Status Flag Values: -999: Census tracts, municipalities, school districts, and zip codes with 0–4 aggregate counts for any data have been suppressed. County data with 0-4 aggregate counts by demographic factors (e.g., by age group, race, ethnicity) have been suppressed.

 %% Cell type:code id:696eec18 tags:

 ``` python
 # Read the "covid.geojson" file
 dataset_file = "covid.geojson"
 df =
 ```

 %% Cell type:code id:c3e6454f tags:

 ``` python
 df.head()
 ```

 %% Cell type:code id:905da51e tags:

 ``` python
 # Explore the columns
 df
 ```

 %% Cell type:code id:a3434aba tags:

 ``` python
 # Create a geographic plot
 df
 ```

 %% Cell type:markdown id:e3e73632 tags:

 ### Predicting "DTH_CUM_CP"

 %% Cell type:markdown id:43cebffa tags:

 ### How can we get a clean dataset of COVID deaths in WI?

 %% Cell type:code id:fa2f30ae tags:

 ``` python
 # Replace -999 with 2; 2 is between 0-4; random choice instead of using 0
 df =
 # we must communicate in final results what percent of values were guessed (imputed)
 ```

 %% Cell type:markdown id:4cff709d tags:

 How would we know if the data is now clean?

 %% Cell type:code id:950c2041 tags:

 ``` python
 # Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
 df
 ```

 %% Cell type:markdown id:4073a940 tags:

 Which points are concerning? Let's take a closer look.

 #### Which rows have "DTH_CUM_CP" greater than 300?

 %% Cell type:code id:a655c465 tags:

 ``` python
 df["DTH_CUM_CP"]
 ```

 %% Cell type:markdown id:d377143e tags:

 #### Valid rows have "GEOID" that only contains digits

 Using `str` methods to perform filtering: `str.fullmatch` does a full string match given a reg-ex. Because it does full string match anchor characters (`^`, `$`) won't be needed.

 %% Cell type:code id:529781db tags:

 ``` python
 ```

 %% Cell type:code id:af16925b tags:

 ``` python
 df["GEOID"]
 ```

 %% Cell type:code id:1d583d06 tags:

 ``` python
 df = df[df["GEOID"].str.fullmatch(r"\d+")]
 df.plot.scatter(x="POP", y="DTH_CUM_CP")
 ```

 %% Cell type:markdown id:1be50600 tags:

 ### How can we train/fit models to known data to predict unknowns?
 - Feature(s) => Predictions
    - Population => Deaths
    - Cases => Deaths
    - Cases by Age => Deaths

 - General structure for fitting models:
    ```python
    model = <some model>
    model.fit(X, y)
    y = model.predict(X)
    ```
    - where `X` needs to be a matrix or a `DataFrame` and `y` needs to be an array (vector) or a `Series`
    - after fitting, `model` object instance stores the information about relationship between features (x values) and predictions (y values)
    - `predict` returns a `numpy` array, which can be treated like a list

 %% Cell type:markdown id:0d0e65c3 tags:

 ### Predicting "DTH_CUM_CP" using "POP" as feature.

 %% Cell type:code id:3dbdbba4 tags:

 ``` python
 # We must specify a list of columns to make sure we extract a DataFrame and not a Series
 # Feature DataFrame
 df
 ```

 %% Cell type:code id:22aad05e tags:

 ``` python
 # Label Series: "DTH_CUM_CP"
 df
 ```

 %% Cell type:markdown id:797d6831 tags:

 ### Let's use `LinearRegression` model.

 - `from sklearn.linear_model import LinearRegression`

 %% Cell type:code id:51ad5b05 tags:

 ``` python
 xcols =
 ycol =

 model =
 model
+```
+
+%% Cell type:code id:94c3a0e7-4099-4f46-a8af-b065ae0b6873 tags:
+
+``` python
+# Let's now make predictions for the known data
 # less interesting because we are predicting what we already know
-y = model
+y = model.predict(df[xcols])
+y
 ```

 %% Cell type:markdown id:e589d923 tags:

 Predicting for new values of x.

 %% Cell type:code id:dd8f0440 tags:

 ``` python
 predict_df = pd.DataFrame({"POP": [1000, 2000, 3000]})
 predict_df
 ```

 %% Cell type:code id:5315289f tags:

 ``` python
 # Predict for the new data
 ```

 %% Cell type:code id:0cea4830 tags:

 ``` python
 # Insert a new column called "predicted deaths" with the predictions
 predict_df["predicted deaths"] = model.predict(predict_df)
 predict_df
 ```

 %% Cell type:markdown id:1c649201 tags:

 ### How can we visualize model predictions?

 - Let's predict deaths for "POP" ranges like 0, 1000, 2000, ..., 20000

 %% Cell type:code id:496a67c9 tags:

 ``` python
 predict_df = pd.DataFrame({"POP": range(0, 20000, 1000)})
 predict_df
 ```

 %% Cell type:code id:b825412a tags:

 ``` python
 # Insert a new column called "predicted deaths" with the predictions
 predict_df["predicted deaths"] = model.predict(predict_df)
 predict_df
 ```

 %% Cell type:code id:ca0b47f5 tags:

 ``` python
 # Create a line plot to visualize relationship between "POP" and "predicted deaths"

 # Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
 ```

 %% Cell type:markdown id:3bb2d3f8 tags:

 ### How can we get a formula for the relationship?

 - `y=mx+c`, where `y` is our predictions and `x` are the features used for the fit
    - Slope of the line (`m`) given by `model.coef_[0]`
    - Intercept of the line (`c`) given by `model.intercept_`

 %% Cell type:markdown id:a31d19b3 tags:

 Model coefficients

 %% Cell type:code id:c894c1b1 tags:

 ``` python
 model
 ```

 %% Cell type:code id:4850a47e tags:

 ``` python
 # Slope of the line
 model.coef_
 ```

 %% Cell type:code id:32690085 tags:

 ``` python
 # Intercept of the line
 model
 ```

 %% Cell type:code id:cfbcee12 tags:

 ``` python
 print(f"deaths ~= {round(model.coef_[0], 4)} * population + {round(model.intercept_, 4)}")
 ```

 %% Cell type:markdown id:738ccba5 tags:

 ### How well does our model fit the data?
 - explained variance score
 - R^2 ("r squared")

 %% Cell type:markdown id:91e4af81 tags:

 #### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
 - requires `import sklearn`
 - calculates the explained variance score given:
    - y_true: actual death values in our example
    - y_pred: prediction of deaths in our example
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html

 %% Cell type:code id:286e2569 tags:

 ``` python
 xcols, ycol
 ```

 %% Cell type:code id:bb36ca0f tags:

 ``` python
 # Let's now make predictions for the known data
 predictions = model
 predictions
 ```

 %% Cell type:code id:92dadb4c tags:

 ``` python
 sklearn.metrics.explained_variance_score(, )
 ```

 %% Cell type:markdown id:ebd81950 tags:

 #### Explained variance score

 - `explained_variance_score = (known_var - explained_variance) / known_var`
    - where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`

 %% Cell type:markdown id:36bc3bb5 tags:

 What is the variation in known deaths?

 %% Cell type:code id:55a3dfcd tags:

 ``` python
 # Compute variance of "DTH_CUM_CP" column
 known_var = df[ycol]
 known_var
 ```

 %% Cell type:code id:c33a6fb1 tags:

 ``` python
 # explained_variance
 explained_variance = (df[ycol] - predictions).var()
 explained_variance
 ```

 %% Cell type:code id:dfb076b1 tags:

 ``` python
 # explained_variance score
 explained_variance_score = (known_var - explained_variance) / known_var
 explained_variance_score
 ```

 %% Cell type:code id:73a55a32 tags:

 ``` python
 # For comparison here is the explained variance score from sklearn
 sklearn.metrics.explained_variance_score(df[ycol], predictions)
 ```

 %% Cell type:markdown id:547452da tags:

 #### `sklearn.metrics.r2_score(y_true, y_pred)`

 - requires `import sklearn`
 - calculates the explained variance score given:
    - y_true: actual death values in our example
    - y_pred: prediction of deaths in our example
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

 %% Cell type:code id:d16ba67b tags:

 ``` python
 sklearn.metrics.r2_score(df[ycol], predictions)
 ```

 %% Cell type:markdown id:6e60fed7 tags:

 #### R^2 score (aka coefficient of determination) approximation

 - `r2_score = (known_var - r2_val) / known_var`
    - where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`

 %% Cell type:code id:d34ea427 tags:

 ``` python
 # r2_val
 r2_val = ((df[ycol] - predictions) ** 2).mean()
 r2_val
 ```

 %% Cell type:code id:b1c3574b tags:

 ``` python
 r2_score = (known_var - r2_val) / known_var
 r2_score # there might be minor rounding off differences
 ```

 %% Cell type:markdown id:adc33af9 tags:

 #### `model.score(X, y)`
 - invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y

 %% Cell type:code id:b3bde089 tags:

 ``` python
 model
 ```

 %% Cell type:markdown id:1768f9a9 tags:

 #### Did our model learn, or just memorize (that is, "overfit")?

 - Example: https://www.mathworks.com/discovery/overfitting.html
 - Split data into train and test

 %% Cell type:code id:87a77fb4 tags:

 ``` python
 # Split the data into two equal parts
 len(df) // 2
 ```

 %% Cell type:code id:bdd7cad0 tags:

 ``` python
 # Manual way of splitting train and test data
 train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
 len(train), len(test)
 ```

 %% Cell type:markdown id:2f45dd74 tags:

 Problem with manual splitting is, we need to make sure that the data is not sorted in some way.

 %% Cell type:markdown id:3a781391 tags:

 #### `train_test_split(<dataframe>, test_size=<val>)`

 - requires `from sklearn.model_selection import train_test_split`
 - shuffles the data and then splits based on 75%-25% split between train and test
    - produces new train and test data every single time
 - `test_size` parameter can take two kind of values:
    - actual number of rows that we want in test data
    - fractional number representing the ratio of train versus test data
    - default value is `0.25`
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

 %% Cell type:code id:1d3a86c9 tags:

 ``` python

 len(train), len(test)
 ```

 %% Cell type:code id:49f7dfe8 tags:

 ``` python
 # Test size using row count
 train, test = train_test_split(df, test_size=120)
 len(train), len(test)
 ```

 %% Cell type:code id:1a29cf9d tags:

 ``` python
 # Test size using fraction
 train, test = train_test_split(df, test_size=0.5)
 len(train), len(test)
 ```

 %% Cell type:code id:7934e9ee tags:

 ``` python
 # Running this cell twice will give you two different train datasets
 train, test = train_test_split(df)
 train.head()
 ```

 %% Cell type:code id:0fe05a2e tags:

 ``` python
 train, test = train_test_split(df)

 # Let's use the train and the test data
 model = LinearRegression()
 # Fit using training data
 model.fit(, )
 # Predict using test data
 y = model.predict()
 # We can use score directly as it automatically invokes predict
 model
 ```

 %% Cell type:markdown id:e0f0e21b tags:

 Running the above cell again will give you entirely different model and score.

 %% Cell type:markdown id:003b1c50 tags:

 #### How can we minimize noise due to random train/test splits?

 ### Cross validation: `cross_val_score(estimator, X, y)`

 - requires `from sklearn.model_selection import cross_val_score`
 -  do many different train/test splits of the values, fitting and scoring the model across each combination
 - cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
 - function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

 %% Cell type:code id:bfa17fce tags:

 ``` python
 train, test = train_test_split(df)

 model = LinearRegression()
 scores =
 scores
 ```

 %% Cell type:code id:284f776f tags:

 ``` python
 # Compute mean of the scores
 scores
 ```

 %% Cell type:markdown id:53c9d4d4 tags:

 #### How can we compare models?
 - model 1: POP => DEATHS
 - model 2: CASES (POS_CUM_CP) => DEATHS

 %% Cell type:code id:ffd9791b tags:

 ``` python
 model1 = LinearRegression()
 model2 = LinearRegression()
 model1_scores = cross_val_score(model1, )
 model2_scores = cross_val_score(model2, )
 ```

 %% Cell type:code id:60f0bf73 tags:

 ``` python
 model1_scores.mean()
 ```

 %% Cell type:code id:e3070bf6 tags:

 ``` python
 model2_scores.mean()
 ```

 %% Cell type:markdown id:dced0919 tags:

 Which of these two models do you think will perform better? Probably model2.

 %% Cell type:code id:dfedd8d4 tags:

 ``` python
 means = pd.Series({"model1": model1_scores.mean(),
                   "model2": model2_scores.mean()})
 means.plot.bar(figsize=(3, 3))
 ```

 %% Cell type:markdown id:312aa001 tags:

 How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.

 %% Cell type:code id:5123c3a9 tags:

 ``` python
 model1_scores.std()
 ```

 %% Cell type:code id:230b9dc9 tags:

 ``` python
 model2_scores.std()
 ```

 %% Cell type:code id:484c7af9 tags:

 ``` python
 err = pd.Series({"model1": model1_scores.std(),
                 "model2": model2_scores.std()})
 err
 ```

 %% Cell type:code id:233cd91d tags:

 ``` python
 # Plot error bar by passing argument to paramenter yerr
 means.plot.bar(figsize=(3, 3), )
 ```

 %% Cell type:markdown id:c3b68db6 tags:

 Pick a winner and run it one more time against test data.

 %% Cell type:markdown id:09f5bce9 tags:

 #### How can we use multiple x variables (multiple regression)?

 %% Cell type:code id:2538534d tags:

 ``` python
 model = LinearRegression()
 xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
       'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
       'POS_80_89_CP', 'POS_90_CP']
 ycol = "DTH_CUM_CP"

 model.fit(train[xcols], train[ycol])
 model.score(test[xcols], test[ycol])
 ```

 %% Cell type:markdown id:92e4c272 tags:

 #### How can we interpret what features the model is relying on?

 %% Cell type:code id:68e3d21a tags:

 ``` python
 model.coef_
 ```

 %% Cell type:code id:44bd5b07 tags:

 ``` python
 pd.Series(model.coef_).plot.bar(figsize=(3, 2))
 ```

 %% Cell type:code id:a606f455-c8b4-4d93-a2f4-737c4e799fa0 tags:

 ``` python
 ```

 %% Cell type:markdown id:9e33d56e tags:

 # Regression 1

 %% Cell type:code id:e6f50cc3 tags:

 ``` python
 import os
 import pandas as pd
 import geopandas as gpd

 # new import statements
 ```

 %% Cell type:markdown id:570b4253 tags:

 #### Covid deaths analysis

 - Source: https://github.com/cs320-wisc/s22/tree/main/lec/29%20Regression%201
    - Specifically, let's analyze "COVID-19 Data by Census Tract V2"
        - Status Flag Values: -999: Census tracts, municipalities, school districts, and zip codes with 0–4 aggregate counts for any data have been suppressed. County data with 0-4 aggregate counts by demographic factors (e.g., by age group, race, ethnicity) have been suppressed.

 %% Cell type:code id:696eec18 tags:

 ``` python
 # Read the "covid.geojson" file
 dataset_file = "covid.geojson"
 df =
 ```

 %% Cell type:code id:c3e6454f tags:

 ``` python
 df.head()
 ```

 %% Cell type:code id:905da51e tags:

 ``` python
 # Explore the columns
 df
 ```

 %% Cell type:code id:a3434aba tags:

 ``` python
 # Create a geographic plot
 df
 ```

 %% Cell type:markdown id:e3e73632 tags:

 ### Predicting "DTH_CUM_CP"

 %% Cell type:markdown id:43cebffa tags:

 ### How can we get a clean dataset of COVID deaths in WI?

 %% Cell type:code id:fa2f30ae tags:

 ``` python
 # Replace -999 with 2; 2 is between 0-4; random choice instead of using 0
 df =
 # we must communicate in final results what percent of values were guessed (imputed)
 ```

 %% Cell type:markdown id:4cff709d tags:

 How would we know if the data is now clean?

 %% Cell type:code id:950c2041 tags:

 ``` python
 # Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
 df
 ```

 %% Cell type:markdown id:4073a940 tags:

 Which points are concerning? Let's take a closer look.

 #### Which rows have "DTH_CUM_CP" greater than 300?

 %% Cell type:code id:a655c465 tags:

 ``` python
 df["DTH_CUM_CP"]
 ```

 %% Cell type:markdown id:d377143e tags:

 #### Valid rows have "GEOID" that only contains digits

 Using `str` methods to perform filtering: `str.fullmatch` does a full string match given a reg-ex. Because it does full string match anchor characters (`^`, `$`) won't be needed.

 %% Cell type:code id:529781db tags:

 ``` python
 ```

 %% Cell type:code id:af16925b tags:

 ``` python
 df["GEOID"]
 ```

 %% Cell type:code id:1d583d06 tags:

 ``` python
 df = df[df["GEOID"].str.fullmatch(r"\d+")]
 df.plot.scatter(x="POP", y="DTH_CUM_CP")
 ```

 %% Cell type:markdown id:1be50600 tags:

 ### How can we train/fit models to known data to predict unknowns?
 - Feature(s) => Predictions
    - Population => Deaths
    - Cases => Deaths
    - Cases by Age => Deaths

 - General structure for fitting models:
    ```python
    model = <some model>
    model.fit(X, y)
    y = model.predict(X)
    ```
    - where `X` needs to be a matrix or a `DataFrame` and `y` needs to be an array (vector) or a `Series`
    - after fitting, `model` object instance stores the information about relationship between features (x values) and predictions (y values)
    - `predict` returns a `numpy` array, which can be treated like a list

 %% Cell type:markdown id:0d0e65c3 tags:

 ### Predicting "DTH_CUM_CP" using "POP" as feature.

 %% Cell type:code id:3dbdbba4 tags:

 ``` python
 # We must specify a list of columns to make sure we extract a DataFrame and not a Series
 # Feature DataFrame
 df
 ```

 %% Cell type:code id:22aad05e tags:

 ``` python
 # Label Series: "DTH_CUM_CP"
 df
 ```

 %% Cell type:markdown id:797d6831 tags:

 ### Let's use `LinearRegression` model.

 - `from sklearn.linear_model import LinearRegression`

 %% Cell type:code id:51ad5b05 tags:

 ``` python
 xcols =
 ycol =

 model =
 model
+```
+
+%% Cell type:code id:94c3a0e7-4099-4f46-a8af-b065ae0b6873 tags:
+
+``` python
+# Let's now make predictions for the known data
 # less interesting because we are predicting what we already know
-y = model
+y = model.predict(df[xcols])
+y
 ```

 %% Cell type:markdown id:e589d923 tags:

 Predicting for new values of x.

 %% Cell type:code id:dd8f0440 tags:

 ``` python
 predict_df = pd.DataFrame({"POP": [1000, 2000, 3000]})
 predict_df
 ```

 %% Cell type:code id:5315289f tags:

 ``` python
 # Predict for the new data
 ```

 %% Cell type:code id:0cea4830 tags:

 ``` python
 # Insert a new column called "predicted deaths" with the predictions
 predict_df["predicted deaths"] = model.predict(predict_df)
 predict_df
 ```

 %% Cell type:markdown id:1c649201 tags:

 ### How can we visualize model predictions?

 - Let's predict deaths for "POP" ranges like 0, 1000, 2000, ..., 20000

 %% Cell type:code id:496a67c9 tags:

 ``` python
 predict_df = pd.DataFrame({"POP": range(0, 20000, 1000)})
 predict_df
 ```

 %% Cell type:code id:b825412a tags:

 ``` python
 # Insert a new column called "predicted deaths" with the predictions
 predict_df["predicted deaths"] = model.predict(predict_df)
 predict_df
 ```

 %% Cell type:code id:ca0b47f5 tags:

 ``` python
 # Create a line plot to visualize relationship between "POP" and "predicted deaths"

 # Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
 ```

 %% Cell type:markdown id:3bb2d3f8 tags:

 ### How can we get a formula for the relationship?

 - `y=mx+c`, where `y` is our predictions and `x` are the features used for the fit
    - Slope of the line (`m`) given by `model.coef_[0]`
    - Intercept of the line (`c`) given by `model.intercept_`

 %% Cell type:markdown id:a31d19b3 tags:

 Model coefficients

 %% Cell type:code id:c894c1b1 tags:

 ``` python
 model
 ```

 %% Cell type:code id:4850a47e tags:

 ``` python
 # Slope of the line
 model.coef_
 ```

 %% Cell type:code id:32690085 tags:

 ``` python
 # Intercept of the line
 model
 ```

 %% Cell type:code id:cfbcee12 tags:

 ``` python
 print(f"deaths ~= {round(model.coef_[0], 4)} * population + {round(model.intercept_, 4)}")
 ```

 %% Cell type:markdown id:738ccba5 tags:

 ### How well does our model fit the data?
 - explained variance score
 - R^2 ("r squared")

 %% Cell type:markdown id:91e4af81 tags:

 #### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
 - requires `import sklearn`
 - calculates the explained variance score given:
    - y_true: actual death values in our example
    - y_pred: prediction of deaths in our example
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html

 %% Cell type:code id:286e2569 tags:

 ``` python
 xcols, ycol
 ```

 %% Cell type:code id:bb36ca0f tags:

 ``` python
 # Let's now make predictions for the known data
 predictions = model
 predictions
 ```

 %% Cell type:code id:92dadb4c tags:

 ``` python
 sklearn.metrics.explained_variance_score(, )
 ```

 %% Cell type:markdown id:ebd81950 tags:

 #### Explained variance score

 - `explained_variance_score = (known_var - explained_variance) / known_var`
    - where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`

 %% Cell type:markdown id:36bc3bb5 tags:

 What is the variation in known deaths?

 %% Cell type:code id:55a3dfcd tags:

 ``` python
 # Compute variance of "DTH_CUM_CP" column
 known_var = df[ycol]
 known_var
 ```

 %% Cell type:code id:c33a6fb1 tags:

 ``` python
 # explained_variance
 explained_variance = (df[ycol] - predictions).var()
 explained_variance
 ```

 %% Cell type:code id:dfb076b1 tags:

 ``` python
 # explained_variance score
 explained_variance_score = (known_var - explained_variance) / known_var
 explained_variance_score
 ```

 %% Cell type:code id:73a55a32 tags:

 ``` python
 # For comparison here is the explained variance score from sklearn
 sklearn.metrics.explained_variance_score(df[ycol], predictions)
 ```

 %% Cell type:markdown id:547452da tags:

 #### `sklearn.metrics.r2_score(y_true, y_pred)`

 - requires `import sklearn`
 - calculates the explained variance score given:
    - y_true: actual death values in our example
    - y_pred: prediction of deaths in our example
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

 %% Cell type:code id:d16ba67b tags:

 ``` python
 sklearn.metrics.r2_score(df[ycol], predictions)
 ```

 %% Cell type:markdown id:6e60fed7 tags:

 #### R^2 score (aka coefficient of determination) approximation

 - `r2_score = (known_var - r2_val) / known_var`
    - where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`

 %% Cell type:code id:d34ea427 tags:

 ``` python
 # r2_val
 r2_val = ((df[ycol] - predictions) ** 2).mean()
 r2_val
 ```

 %% Cell type:code id:b1c3574b tags:

 ``` python
 r2_score = (known_var - r2_val) / known_var
 r2_score # there might be minor rounding off differences
 ```

 %% Cell type:markdown id:adc33af9 tags:

 #### `model.score(X, y)`
 - invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y

 %% Cell type:code id:b3bde089 tags:

 ``` python
 model
 ```

 %% Cell type:markdown id:1768f9a9 tags:

 #### Did our model learn, or just memorize (that is, "overfit")?

 - Example: https://www.mathworks.com/discovery/overfitting.html
 - Split data into train and test

 %% Cell type:code id:87a77fb4 tags:

 ``` python
 # Split the data into two equal parts
 len(df) // 2
 ```

 %% Cell type:code id:bdd7cad0 tags:

 ``` python
 # Manual way of splitting train and test data
 train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
 len(train), len(test)
 ```

 %% Cell type:markdown id:2f45dd74 tags:

 Problem with manual splitting is, we need to make sure that the data is not sorted in some way.

 %% Cell type:markdown id:3a781391 tags:

 #### `train_test_split(<dataframe>, test_size=<val>)`

 - requires `from sklearn.model_selection import train_test_split`
 - shuffles the data and then splits based on 75%-25% split between train and test
    - produces new train and test data every single time
 - `test_size` parameter can take two kind of values:
    - actual number of rows that we want in test data
    - fractional number representing the ratio of train versus test data
    - default value is `0.25`
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

 %% Cell type:code id:1d3a86c9 tags:

 ``` python

 len(train), len(test)
 ```

 %% Cell type:code id:49f7dfe8 tags:

 ``` python
 # Test size using row count
 train, test = train_test_split(df, test_size=120)
 len(train), len(test)
 ```

 %% Cell type:code id:1a29cf9d tags:

 ``` python
 # Test size using fraction
 train, test = train_test_split(df, test_size=0.5)
 len(train), len(test)
 ```

 %% Cell type:code id:7934e9ee tags:

 ``` python
 # Running this cell twice will give you two different train datasets
 train, test = train_test_split(df)
 train.head()
 ```

 %% Cell type:code id:0fe05a2e tags:

 ``` python
 train, test = train_test_split(df)

 # Let's use the train and the test data
 model = LinearRegression()
 # Fit using training data
 model.fit(, )
 # Predict using test data
 y = model.predict()
 # We can use score directly as it automatically invokes predict
 model
 ```

 %% Cell type:markdown id:e0f0e21b tags:

 Running the above cell again will give you entirely different model and score.

 %% Cell type:markdown id:003b1c50 tags:

 #### How can we minimize noise due to random train/test splits?

 ### Cross validation: `cross_val_score(estimator, X, y)`

 - requires `from sklearn.model_selection import cross_val_score`
 -  do many different train/test splits of the values, fitting and scoring the model across each combination
 - cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
 - function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

 %% Cell type:code id:bfa17fce tags:

 ``` python
 train, test = train_test_split(df)

 model = LinearRegression()
 scores =
 scores
 ```

 %% Cell type:code id:284f776f tags:

 ``` python
 # Compute mean of the scores
 scores
 ```

 %% Cell type:markdown id:53c9d4d4 tags:

 #### How can we compare models?
 - model 1: POP => DEATHS
 - model 2: CASES (POS_CUM_CP) => DEATHS

 %% Cell type:code id:ffd9791b tags:

 ``` python
 model1 = LinearRegression()
 model2 = LinearRegression()
 model1_scores = cross_val_score(model1, )
 model2_scores = cross_val_score(model2, )
 ```

 %% Cell type:code id:60f0bf73 tags:

 ``` python
 model1_scores.mean()
 ```

 %% Cell type:code id:e3070bf6 tags:

 ``` python
 model2_scores.mean()
 ```

 %% Cell type:markdown id:dced0919 tags:

 Which of these two models do you think will perform better? Probably model2.

 %% Cell type:code id:dfedd8d4 tags:

 ``` python
 means = pd.Series({"model1": model1_scores.mean(),
                   "model2": model2_scores.mean()})
 means.plot.bar(figsize=(3, 3))
 ```

 %% Cell type:markdown id:312aa001 tags:

 How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.

 %% Cell type:code id:5123c3a9 tags:

 ``` python
 model1_scores.std()
 ```

 %% Cell type:code id:230b9dc9 tags:

 ``` python
 model2_scores.std()
 ```

 %% Cell type:code id:484c7af9 tags:

 ``` python
 err = pd.Series({"model1": model1_scores.std(),
                 "model2": model2_scores.std()})
 err
 ```

 %% Cell type:code id:233cd91d tags:

 ``` python
 # Plot error bar by passing argument to paramenter yerr
 means.plot.bar(figsize=(3, 3), )
 ```

 %% Cell type:markdown id:c3b68db6 tags:

 Pick a winner and run it one more time against test data.

 %% Cell type:markdown id:09f5bce9 tags:

 #### How can we use multiple x variables (multiple regression)?

 %% Cell type:code id:2538534d tags:

 ``` python
 model = LinearRegression()
 xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
       'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
       'POS_80_89_CP', 'POS_90_CP']
 ycol = "DTH_CUM_CP"

 model.fit(train[xcols], train[ycol])
 model.score(test[xcols], test[ycol])
 ```

 %% Cell type:markdown id:92e4c272 tags:

 #### How can we interpret what features the model is relying on?

 %% Cell type:code id:68e3d21a tags:

 ``` python
 model.coef_
 ```

 %% Cell type:code id:44bd5b07 tags:

 ``` python
 pd.Series(model.coef_).plot.bar(figsize=(3, 2))
 ```

 %% Cell type:code id:a606f455-c8b4-4d93-a2f4-737c4e799fa0 tags:

 ``` python
 ```

--- a/lecture-material/18-Regression_1/document related to the dataset.pdf
+++ b/lecture-material/18-Regression_1/document related to the dataset.pdf
--- a/lecture-material/19-Regression_2/19-regression.ipynb
+++ b/lecture-material/19-Regression_2/19-regression.ipynb
--- a/lecture-material/19-Regression_2/19-regression_001.ipynb
+++ b/lecture-material/19-Regression_2/19-regression_001.ipynb
 %% Cell type:markdown id:9e33d56e tags:

 # Regression 2

 %% Cell type:code id:e6f50cc3 tags:

 ``` python
 import os
 import pandas as pd
 import geopandas as gpd
 import matplotlib.pyplot as plt

 # new import statements
 from sklearn.linear_model import LinearRegression
 import sklearn
 from sklearn.model_selection import train_test_split, cross_val_score

 from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
 from sklearn.pipeline import Pipeline
 from sklearn.compose import make_column_transformer

 import numpy as np
 ```

-%% Cell type:code id:696eec18 tags:
+%% Cell type:markdown id:8d0baac0-fcc2-4cdf-9276-9db86ffacd10 tags:

-``` python
-# Read the "covid.geojson" file
-dataset_file = "covid.geojson"
-df = gpd.read_file(dataset_file)
-```
-
-%% Cell type:markdown id:b37d6f28-fabd-4418-8c85-8e7cd24331f5 tags:
-
-### How well does our model fit the data?
- explained variance score
- R^2 ("r squared")
-
-%% Cell type:markdown id:a2bf3bc0-03b8-4c41-93a9-bef8a76bf3de tags:
-
-#### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
-    - y_true: actual death values in our example
-    - y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html
-
-%% Cell type:markdown id:c7fbd149-0c50-46ec-bfb1-f33e9fece60e tags:
-
-### Predicting "DTH_CUM_CP"
-
-%% Cell type:code id:cebb790b-9374-4d02-afe0-40e1efc8f176 tags:
-
-``` python
-df = df.replace(-999, 2)
-# Removing outlier point
-df = df[df["GEOID"].str.fullmatch(r"\d+")]
-```
-
-%% Cell type:code id:f4a4dd37-d7d3-41ca-8319-1442b56b2a24 tags:
-
-``` python
-xcols = ["POP"]
-ycol = "DTH_CUM_CP"
-```
-
-%% Cell type:markdown id:87c4b295-39ca-445c-865d-4829d77f55ce tags:
-
-### Let's use `LinearRegression` model.
-
- `from sklearn.linear_model import LinearRegression`
-
-%% Cell type:code id:78c2c891-caed-4359-bc91-a97de21db812 tags:
-
-``` python
-model = LinearRegression()
-model.fit(df[xcols], df[ycol])
-```
-
-%% Cell type:code id:7787a889-c693-40b3-8e66-9a266bb0c683 tags:
-
-``` python
-# Let's now make predictions for the known data
-predictions = model.predict(df[xcols])
-predictions
-```
-
-%% Cell type:code id:7952f67c-57b3-4ca8-bb42-b6287f012157 tags:
-
-``` python
-sklearn.metrics.explained_variance_score(df[ycol], predictions)
-```
-
-%% Cell type:markdown id:39caf7a5-c782-4c02-8c3e-8385eddb108f tags:
-
-#### Explained variance score
-
- `explained_variance_score = (known_var - explained_variance) / known_var`
-    - where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`
-
-%% Cell type:markdown id:908b8af9-6ba1-4e92-9569-36168b253fae tags:
-
-What is the variation in known deaths?
-
-%% Cell type:code id:5fc126c4-de9a-4a35-abe6-8d5820227f6f tags:
-
-``` python
-# Compute variance of "DTH_CUM_CP" column
-known_var = df[ycol].var()
-known_var
-```
-
-%% Cell type:code id:a2f489d5-1325-4870-b6f8-d2b7949f12cf tags:
-
-``` python
-# explained_variance
-explained_variance = (df[ycol] - predictions).var()
-explained_variance
-```
-
-%% Cell type:code id:e049b2bd-35e8-4a75-9d6d-d70c2b998541 tags:
-
-``` python
-# explained_variance score
-explained_variance_score = (known_var - explained_variance) / known_var
-explained_variance_score
-```
-
-%% Cell type:code id:ced64de4-d1dc-464e-8855-3341a8f3b15a tags:
-
-``` python
-# For comparison here is the explained variance score from sklearn
-sklearn.metrics.explained_variance_score(df[ycol], predictions)
-```
-
-%% Cell type:markdown id:e85ad41f-8d4b-471f-b689-333346d7d660 tags:
-
-#### `sklearn.metrics.r2_score(y_true, y_pred)`
-
- requires `import sklearn`
- calculates the explained variance score given:
-    - y_true: actual death values in our example
-    - y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
-
-%% Cell type:code id:557a7646-7843-4956-8d72-9104aa7113de tags:
-
-``` python
-sklearn.metrics.r2_score(df[ycol], predictions)
-```
-
-%% Cell type:markdown id:3ceeca6f-e7d4-47fa-b53e-8c3bbba41594 tags:
-
-#### R^2 score (aka coefficient of determination) approximation
-
- `r2_score = (known_var - r2_val) / known_var`
-    - where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`
-
-%% Cell type:code id:c1ea5c04-3dd7-412c-9cbf-32f4a69bdb20 tags:
-
-``` python
-# r2_val
-r2_val = ((df[ycol] - predictions) ** 2).mean()
-r2_val
-```
-
-%% Cell type:code id:e3e83156-8e02-4216-9738-bed6b0052608 tags:
-
-``` python
-r2_score = (known_var - r2_val) / known_var
-r2_score # there might be minor rounding off differences
-```
-
-%% Cell type:markdown id:d1795241-d9d7-42c5-8bf4-614ec7d972d7 tags:
-
-#### `model.score(X, y)`
- invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y
-
-%% Cell type:code id:a50b84d0-5c1b-4f84-9f41-d6157057d645 tags:
-
-``` python
-model.score(df[xcols], df[ycol])
-```
-
-%% Cell type:markdown id:1768f9a9 tags:
-
-#### Did our model learn, or just memorize (that is, "overfit")?
-
- Split data into train and test
-
-%% Cell type:code id:80cef390-831c-443a-9149-9141c8d34ef2 tags:
-
-``` python
-# Split the data into two equal parts
-len(df) // 2
-```
-
-%% Cell type:code id:68c16194-1c7b-4f3a-81a3-4b0352f36872 tags:
-
-``` python
-# Manual way of splitting train and test data
-train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
-len(train), len(test)
-```
-
-%% Cell type:markdown id:3a781391 tags:
-
-#### `train_test_split(<dataframe>, test_size=<val>)`
-
- requires `from sklearn.model_selection import train_test_split`
- shuffles the data and then splits based on 75%-25% split between train and test
-    - produces new train and test data every single time
- `test_size` parameter can take two kind of values:
-    - actual number of rows that we want in test data
-    - fractional number representing the ratio of train versus test data
-    - default value is `0.25`
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
-
-%% Cell type:code id:5c3de0c2 tags:
-
-``` python
-xcols, ycol
-```
-
-%% Cell type:code id:c52c86c4-8afe-4d49-97a5-e2fcb18242c5 tags:
-
-``` python
-train, test = train_test_split(df)
-len(train), len(test)
-```
-
-%% Cell type:code id:b3ccdf1a-0958-402c-80e1-2eed12abdf88 tags:
-
-``` python
-# Test size using row count
-train, test = train_test_split(df, test_size=120)
-len(train), len(test)
-```
-
-%% Cell type:code id:adeea246-f80a-46c2-b62b-9e95bef5d02f tags:
-
-``` python
-# Test size using fraction
-train, test = train_test_split(df, test_size=0.5)
-len(train), len(test)
-```
-
-%% Cell type:code id:ba4c6310-0f35-44ac-9250-79b6eb04f7cc tags:
-
-``` python
-# Running this cell twice will give you two different train datasets
-train, test = train_test_split(df)
-train.head()
-```
-
-%% Cell type:code id:0fe05a2e tags:
-
-``` python
-train, test = train_test_split(df)
-
-# Let's use the train and the test data
-model = LinearRegression()
-# Fit using training data
-model.fit(train[xcols], train[ycol])
-# Predict using test data
-y = model.predict(test[xcols])
-# We can use score directly as it automatically invokes predict
-model.score(test[xcols], test[ycol])
-```
-
-%% Cell type:markdown id:e0f0e21b tags:
-
-Running the above cell again will give you entirely different model and score.
-
-%% Cell type:markdown id:003b1c50 tags:
-
-#### How can we minimize noise due to random train/test splits?
-
-### Cross validation: `cross_val_score(estimator, X, y)`
-
- requires `from sklearn.model_selection import cross_val_score`
-  do many different train/test splits of the values, fitting and scoring the model across each combination
- cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
- function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
-
-%% Cell type:code id:bfa17fce tags:
-
-``` python
-train, test = train_test_split(df)
-
-model = LinearRegression()
-scores = cross_val_score(???)
-scores
-```
-
-%% Cell type:code id:284f776f tags:
-
-``` python
-# Compute mean of the scores
-scores.mean()
-```
-
-%% Cell type:markdown id:53c9d4d4 tags:
-
-#### How can we compare models?
- model 1: POP => DEATHS
- model 2: CASES (POS_CUM_CP) => DEATHS
-
-%% Cell type:code id:ffd9791b tags:
-
-``` python
-model1 = LinearRegression()
-model2 = LinearRegression()
-model1_scores = cross_val_score(model1, train[["POP"]], train[ycol])
-model2_scores = cross_val_score(model2, train[["POS_CUM_CP"]], train[ycol])
-```
-
-%% Cell type:code id:60f0bf73 tags:
-
-``` python
-model1_scores.mean()
-```
-
-%% Cell type:code id:e3070bf6 tags:
-
-``` python
-model2_scores.mean()
-```
-
-%% Cell type:markdown id:dced0919 tags:
-
-Which of these two models do you think will perform better? Probably model2.
-
-%% Cell type:code id:dfedd8d4 tags:
-
-``` python
-means = pd.Series({"model1": model1_scores.mean(),
-                   "model2": model2_scores.mean()})
-means.plot.bar(figsize=(3, 3))
-```
-
-%% Cell type:markdown id:312aa001 tags:
-
-How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.
-
-%% Cell type:code id:5123c3a9 tags:
-
-``` python
-model1_scores.std()
-```
-
-%% Cell type:code id:230b9dc9 tags:
-
-``` python
-model2_scores.std()
-```
-
-%% Cell type:code id:484c7af9 tags:
-
-``` python
-err = pd.Series({"model1": model1_scores.std(),
-                 "model2": model2_scores.std()})
-err
-```
-
-%% Cell type:code id:233cd91d tags:
-
-``` python
-# Plot error bar by passing argument to paramenter yerr
-means.plot.bar(figsize=(3, 3))
-```
-
-%% Cell type:markdown id:c3b68db6 tags:
-
-Pick a winner and run it one more time against test data.
-
-%% Cell type:markdown id:09f5bce9 tags:
-
-#### How can we use multiple x variables (multiple regression)?
-
- Features: Positive cases per age range
-
-%% Cell type:code id:d3016079 tags:
-
-``` python
-df.columns
-```
-
-%% Cell type:code id:2538534d tags:
-
-``` python
-xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
-       'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
-       'POS_80_89_CP', 'POS_90_CP']
-ycol = "DTH_CUM_CP"
-
-model = LinearRegression()
-model.fit(train[xcols], train[ycol])
-model.score(test[xcols], test[ycol])
-```
-
-%% Cell type:markdown id:92e4c272 tags:
-
-#### How can we interpret what features the model is relying on?
-
-%% Cell type:code id:68e3d21a tags:
-
-``` python
-```
-
-%% Cell type:code id:98fa5442 tags:
-
-``` python
-pd.Series(model.coef_)
-```
-
-%% Cell type:code id:44bd5b07 tags:
-
-``` python
-pd.Series(model.coef_, index=???).plot.bar(figsize=(3, 2))
-```
+### Pipeline, Transformer and Estimator

 %% Cell type:markdown id:93be2ad2 tags:

 <div>
 <img src="attachment:Simple_model.png" width="300"/>
 </div>

 %% Cell type:markdown id:c854dab5 tags:

 ### Lake Michigan Waves

 Source: https://data.cityofchicago.org/Parks-Recreation/Beach-Water-Quality-Automated-Sensors/qmqz-2xku

 %% Cell type:code id:a3109957 tags:

 ``` python
+# Read the "waves.csv" file
 df = pd.read_csv("waves.csv")
 df.head()
 ```

 %% Cell type:code id:18d71eb8 tags:

 ``` python
 df.columns
 ```

 %% Cell type:markdown id:d3f9dc12 tags:

 #### Can we predict wave height based on wave period (time between waves)?

 %% Cell type:code id:149aa67c tags:

 ``` python
 df = df[(df["Wave Period"] > 0) & (df["Wave Height"] > 0)]
 df.plot.scatter(x="Wave Period", y="Wave Height", color="k", alpha=0.05)
 ```

 %% Cell type:markdown id:4da30be1 tags:

 Observation: non-linear relationship => fitting straight line will not work.

 %% Cell type:markdown id:62bd175c tags:

 #### Can we predict wave height based on beach name and wave period?

 %% Cell type:code id:3f045b08 tags:

 ``` python
 beach_names = sorted(set(df["Beach Name"]))
 beach_names
 ```

 %% Cell type:code id:bc626c9d tags:

 ``` python
 fig, axes = plt.subplots(2, 3, figsize=(12, 8), sharey=True)
 plt.subplots_adjust(hspace=0.3)
 axes = list(axes.reshape(-1))
 for b in beach_names:
    ax = axes.pop(0)
    ax.set_title(b)
    beach_df = df[df["Beach Name"] == b]
    beach_df.plot.scatter(x="Wave Period", y="Wave Height",
                          color="k", alpha=0.1, ax=ax)
 ```

 %% Cell type:markdown id:ed02085d tags:

 Obversation: which beach (categorical feature) is important.

 %% Cell type:markdown id:ceebe2c1 tags:

 ### Four Models

 1. wave period (linear)
 2. wave period (polynomial)
 3. beach
 4. beach and wave period

 %% Cell type:markdown id:2d4fe892 tags:

 `train_test_split(<dataframe>, random_state=<some number>)`
 - `random_state` enables us to control the randomization
 - when we pass the same number, we will get the same training and test data (psuedo randomization)

 %% Cell type:code id:a454b7ec tags:

 ``` python
 train, test = train_test_split(df, random_state=320)
 train.head()
 ```

 %% Cell type:markdown id:d6f3871a tags:

 #### Model 1: Wave Period (Linear)

 %% Cell type:markdown id:4c15ac91 tags:

 `cross_val_score(estimator, X, y, cv=<number>)`
 - `cv` enables us mention how many folds we want for the cross-validation

 %% Cell type:code id:199a5ca6 tags:

 ``` python
 xcols = ["Wave Period"]
 ycol = "Wave Height"

 m1 = ???
 scores = ???
 scores
 ```

 %% Cell type:code id:15bec5d5 tags:

 ``` python
 scores.mean()
 ```

 %% Cell type:markdown id:aa5a272e tags:

 We want the mean score to be large.

 %% Cell type:code id:306e21bd tags:

 ``` python
 scores.std()
 ```

 %% Cell type:markdown id:ea190e50 tags:

 We want the standard deviation to be small, to make sure that our model isn't too sensitive to changes in training and test data.

 %% Cell type:markdown id:b126b01f tags:

 #### Model 2: Wave Period (Polynomial)

 LinearRegression can do this:

 ```
 y = 3*x1 + 5*x2
 ```
 It CANNOT do this:
 ```
 y = 3*x + 5*x^2
 ```
 TRICK:
 ```
 x1 = x
 x2 = x^2
 ```

 <div>
 <img src="attachment:Pipeline.png" width="500"/>
 </div>

 %% Cell type:code id:5b6f8a12 tags:

 ``` python
 # Let's make a copy before we add new columns
 # Recommendation: don't change the data referred to by a specific variable
 # throughout your code

 train2 = train[xcols].copy()
 train2.head()
 ```

 %% Cell type:markdown id:6be8df92 tags:

 Manually adding columns for x^2, x^3, etc.,

 %% Cell type:code id:54f8d249 tags:

 ``` python
 train2["Wave Period ^ 2"] = train2["Wave Period"] ** 2
 train2["Wave Period ^ 3"] = train2["Wave Period"] ** 3
-train2["sqrt(Wave Period"] = train2["Wave Period"] ** 0.5
+train2["sqrt(Wave Period)"] = train2["Wave Period"] ** 0.5
 train2.head()
 ```

 %% Cell type:markdown id:155f7e7a tags:

 #### `PolynomialFeatures(degree=<val>, include_bias=False)`

 - `degree` enables us to mention how many degrees we need
 - `include_bias` default value is True, which will add a column of 1s - we typically don't use that.
 - returns an object instance on which we can invoke `fit` and `transform`:
    - `transform(X, columns=<col names>)`: transform data to polynomial features`
    - `fit_transform(X[, y])`: fit to data, then transform it.
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

 %% Cell type:code id:e6c43483 tags:

 ``` python
 pf =
 ```

 %% Cell type:code id:6f81a519 tags:

 ``` python
 # Fit the data
 pf
 ```

 %% Cell type:code id:8d6d2e1c tags:

 ``` python
 # Transform the data
 pf
 ```

 %% Cell type:code id:ecfc2493 tags:

 ``` python
 # Convert result of transformation into a DataFrame - step 1
 pd.DataFrame(pf.transform(train[xcols]))
 ```

 %% Cell type:markdown id:129b21dc tags:

 How can we add meaningful column names? `pf.get_feature_names_out()`

 %% Cell type:code id:1ab1eaa3 tags:

 ``` python
 pf
 ```

 %% Cell type:code id:2d62b3d3 tags:

 ``` python
 # Convert result of transformation into a DataFrame - step 2
 pd.DataFrame(pf.transform(train[xcols]), columns=???)
 ```

 %% Cell type:code id:4beebe68 tags:

 ``` python
 # Putting all the steps together
 pf = PolynomialFeatures(degree=4, include_bias=False)
 pf.fit(train[xcols])
 pd.DataFrame(pf.transform(train[xcols]), columns=pf.get_feature_names_out()).head()
 ```

 %% Cell type:markdown id:29d11ffb tags:

 `fit_transform(X[, y])`: fit to data, then transform it.

 %% Cell type:code id:91c3a781 tags:

 ``` python
 new_data = ???
 pd.DataFrame(new_data, columns=pf.get_feature_names_out()).head()
 ```

 %% Cell type:markdown id:86de680e tags:

 #### `Pipeline(...)`

 - Argument: list of steps in the pipeline:
    - each step represented as a tuple with name of the step and the object instance
    - last step will be the estimator
 - documention: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

 %% Cell type:code id:bb509930 tags:

 ``` python
 m2 = Pipeline([
    #PolynomialFeatures(degree=2, include_bias=False)
    # LinearRegression()
 ])
 m2
 ```

 %% Cell type:code id:ccf76663 tags:

 ``` python
 scores = cross_val_score(m2, train[xcols], train[ycol], cv=10)
 scores.mean()
 ```

 %% Cell type:markdown id:b7acca2c tags:

 **Conclusion:** mean R^2 score increased from 0.0029 (linear model) to 0.0489 (polynomial) - not bad!

 %% Cell type:markdown id:7dff1435 tags:

 #### Model 3: Beach Name (Categorical)

 %% Cell type:code id:76a31355 tags:

 ``` python
 train["Beach Name"].unique()
 ```

 %% Cell type:markdown id:61644774 tags:

 Naive way of assigning numerial values to a categorical column:
 - `Ohio Street Beach`: 1
 - `Calumet Beach`: 2
 - `Rainbow Beach`: 3, etc.,
 - Problem: `Calumet Beach` will become an average of `Ohio Street Beach` and `Rainbow Beach` => this doesn't make any sense!

 %% Cell type:markdown id:41c6be06 tags:

 #### `OneHotEncoder()`

 - encodes categorical features as a one-hot numeric array
 - returns a "sparse matrix", which needs to be explicitly converted into an `array` using `to_array()` method, before `DataFrame` conversion
 - documention: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

 %% Cell type:code id:55ec83e1-2800-41a2-80ce-f05a1de54c8a tags:

 ``` python
 xcols = ["Beach Name"]
 ```

 %% Cell type:code id:ef069c7b tags:

 ``` python
 ```

 %% Cell type:code id:5cb24e10 tags:

 ``` python
 # fit_transform
 ```

 %% Cell type:code id:d5303636 tags:

 ``` python
 # All steps
 oh = OneHotEncoder()
 pd.DataFrame(oh.fit_transform(train[xcols]).toarray(), columns=oh.get_feature_names_out())
 ```

 %% Cell type:code id:625a8edc tags:

 ``` python
 m3 = Pipeline([
    ("oh", OneHotEncoder()),
    ("lr", LinearRegression())
 ])
 m3
 ```

 %% Cell type:code id:5494c1fa tags:

 ``` python
 scores = cross_val_score(m3, train[xcols], train[ycol], cv=10)
 scores.mean()
 ```

 %% Cell type:markdown id:2ce15dac tags:

 **Conclusion:** mean R^2 score is slightly lower than 0.0489 (polynomial).

 %% Cell type:markdown id:e4bf025e tags:

 <div>
 <img src="attachment:ColumnTransformer.png" width="500"/>
 </div>

 %% Cell type:markdown id:0f5a1a98 tags:

 #### `make_column_transformer(...)`

 - Argument: transformations
    - each transformer argument will be a `tuple` with object instance as first item and list of feature columns as the second
 - documention: https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html

 %% Cell type:markdown id:a41f509f tags:

 #### Model 4: Beach Name (Categorical) and Wave Period (Polynomial)

 %% Cell type:code id:40378fb3 tags:

 ``` python
 custom_trans = make_column_transformer(
    (PolynomialFeatures(), ["Wave Period"]),
    (OneHotEncoder(), ["Beach Name"]),
 )
 custom_trans
 ```

 %% Cell type:code id:2d6e0ba2 tags:

 ``` python
 m4 = Pipeline([
    ("transformers", custom_trans),
    ("lr", LinearRegression()),
 ])
 m4
 ```

 %% Cell type:code id:4248010b tags:

 ``` python
 xcols = ["Beach Name", "Wave Period"]
 ```

 %% Cell type:code id:57b9ffc8 tags:

 ``` python
 scores = cross_val_score(m4, train[xcols], train[ycol], cv=10)
 scores.mean()
 ```

 %% Cell type:markdown id:73ade800 tags:

 **Conclusion:** mean R^2 score increased to 0.0885 when compared to 0.0489 (polynomial).

 %% Cell type:markdown id:247947ef tags:

 ### Let's evaluate this model by running it against the test data

 %% Cell type:code id:e7e58157 tags:

 ``` python
 m4.fit(train[xcols], train[ycol])
 m4.score(test[xcols], test[ycol])
 ```

 %% Cell type:markdown id:cd689ae1 tags:

 How to extract `coef_` for this `Pipeline`?

 %% Cell type:code id:50c7e102 tags:

 ``` python
 m4["transformers"].get_feature_names_out()
 ```

 %% Cell type:code id:28ccbf92 tags:

 ``` python
 m4["lr"].coef_
 ```

 %% Cell type:code id:09bc63d9-ab2b-4723-94cf-94ebac037d8d tags:

 ``` python
 s = pd.Series(m4["lr"].coef_, index=m4["transformers"].get_feature_names_out())
 s.plot.bar()
 ```

 %% Cell type:code id:acd00dc3 tags:

 ``` python
 s = pd.Series(m4["lr"].coef_, index=m4["transformers"].get_feature_names_out())
 s.plot.barh()
 ```

 %% Cell type:markdown id:279ae068-9fc8-45d3-9bcf-c9ae9252f19f tags:

 What are the beaches that have the biggest waves?

 %% Cell type:code id:2ff1d03b-48bb-4150-bc27-42f5064f88df tags:

 ``` python
 ```

 %% Cell type:markdown id:9e33d56e tags:

 # Regression 2

 %% Cell type:code id:e6f50cc3 tags:

 ``` python
 import os
 import pandas as pd
 import geopandas as gpd
 import matplotlib.pyplot as plt

 # new import statements
 from sklearn.linear_model import LinearRegression
 import sklearn
 from sklearn.model_selection import train_test_split, cross_val_score

 from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
 from sklearn.pipeline import Pipeline
 from sklearn.compose import make_column_transformer

 import numpy as np
 ```

-%% Cell type:code id:696eec18 tags:
+%% Cell type:markdown id:8d0baac0-fcc2-4cdf-9276-9db86ffacd10 tags:

-``` python
-# Read the "covid.geojson" file
-dataset_file = "covid.geojson"
-df = gpd.read_file(dataset_file)
-```
-
-%% Cell type:markdown id:b37d6f28-fabd-4418-8c85-8e7cd24331f5 tags:
-
-### How well does our model fit the data?
- explained variance score
- R^2 ("r squared")
-
-%% Cell type:markdown id:a2bf3bc0-03b8-4c41-93a9-bef8a76bf3de tags:
-
-#### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
-    - y_true: actual death values in our example
-    - y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html
-
-%% Cell type:markdown id:c7fbd149-0c50-46ec-bfb1-f33e9fece60e tags:
-
-### Predicting "DTH_CUM_CP"
-
-%% Cell type:code id:cebb790b-9374-4d02-afe0-40e1efc8f176 tags:
-
-``` python
-df = df.replace(-999, 2)
-# Removing outlier point
-df = df[df["GEOID"].str.fullmatch(r"\d+")]
-```
-
-%% Cell type:code id:f4a4dd37-d7d3-41ca-8319-1442b56b2a24 tags:
-
-``` python
-xcols = ["POP"]
-ycol = "DTH_CUM_CP"
-```
-
-%% Cell type:markdown id:87c4b295-39ca-445c-865d-4829d77f55ce tags:
-
-### Let's use `LinearRegression` model.
-
- `from sklearn.linear_model import LinearRegression`
-
-%% Cell type:code id:78c2c891-caed-4359-bc91-a97de21db812 tags:
-
-``` python
-model = LinearRegression()
-model.fit(df[xcols], df[ycol])
-```
-
-%% Cell type:code id:7787a889-c693-40b3-8e66-9a266bb0c683 tags:
-
-``` python
-# Let's now make predictions for the known data
-predictions = model.predict(df[xcols])
-predictions
-```
-
-%% Cell type:code id:7952f67c-57b3-4ca8-bb42-b6287f012157 tags:
-
-``` python
-sklearn.metrics.explained_variance_score(df[ycol], predictions)
-```
-
-%% Cell type:markdown id:39caf7a5-c782-4c02-8c3e-8385eddb108f tags:
-
-#### Explained variance score
-
- `explained_variance_score = (known_var - explained_variance) / known_var`
-    - where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`
-
-%% Cell type:markdown id:908b8af9-6ba1-4e92-9569-36168b253fae tags:
-
-What is the variation in known deaths?
-
-%% Cell type:code id:5fc126c4-de9a-4a35-abe6-8d5820227f6f tags:
-
-``` python
-# Compute variance of "DTH_CUM_CP" column
-known_var = df[ycol].var()
-known_var
-```
-
-%% Cell type:code id:a2f489d5-1325-4870-b6f8-d2b7949f12cf tags:
-
-``` python
-# explained_variance
-explained_variance = (df[ycol] - predictions).var()
-explained_variance
-```
-
-%% Cell type:code id:e049b2bd-35e8-4a75-9d6d-d70c2b998541 tags:
-
-``` python
-# explained_variance score
-explained_variance_score = (known_var - explained_variance) / known_var
-explained_variance_score
-```
-
-%% Cell type:code id:ced64de4-d1dc-464e-8855-3341a8f3b15a tags:
-
-``` python
-# For comparison here is the explained variance score from sklearn
-sklearn.metrics.explained_variance_score(df[ycol], predictions)
-```
-
-%% Cell type:markdown id:e85ad41f-8d4b-471f-b689-333346d7d660 tags:
-
-#### `sklearn.metrics.r2_score(y_true, y_pred)`
-
- requires `import sklearn`
- calculates the explained variance score given:
-    - y_true: actual death values in our example
-    - y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
-
-%% Cell type:code id:557a7646-7843-4956-8d72-9104aa7113de tags:
-
-``` python
-sklearn.metrics.r2_score(df[ycol], predictions)
-```
-
-%% Cell type:markdown id:3ceeca6f-e7d4-47fa-b53e-8c3bbba41594 tags:
-
-#### R^2 score (aka coefficient of determination) approximation
-
- `r2_score = (known_var - r2_val) / known_var`
-    - where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`
-
-%% Cell type:code id:c1ea5c04-3dd7-412c-9cbf-32f4a69bdb20 tags:
-
-``` python
-# r2_val
-r2_val = ((df[ycol] - predictions) ** 2).mean()
-r2_val
-```
-
-%% Cell type:code id:e3e83156-8e02-4216-9738-bed6b0052608 tags:
-
-``` python
-r2_score = (known_var - r2_val) / known_var
-r2_score # there might be minor rounding off differences
-```
-
-%% Cell type:markdown id:d1795241-d9d7-42c5-8bf4-614ec7d972d7 tags:
-
-#### `model.score(X, y)`
- invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y
-
-%% Cell type:code id:a50b84d0-5c1b-4f84-9f41-d6157057d645 tags:
-
-``` python
-model.score(df[xcols], df[ycol])
-```
-
-%% Cell type:markdown id:1768f9a9 tags:
-
-#### Did our model learn, or just memorize (that is, "overfit")?
-
- Split data into train and test
-
-%% Cell type:code id:80cef390-831c-443a-9149-9141c8d34ef2 tags:
-
-``` python
-# Split the data into two equal parts
-len(df) // 2
-```
-
-%% Cell type:code id:68c16194-1c7b-4f3a-81a3-4b0352f36872 tags:
-
-``` python
-# Manual way of splitting train and test data
-train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
-len(train), len(test)
-```
-
-%% Cell type:markdown id:3a781391 tags:
-
-#### `train_test_split(<dataframe>, test_size=<val>)`
-
- requires `from sklearn.model_selection import train_test_split`
- shuffles the data and then splits based on 75%-25% split between train and test
-    - produces new train and test data every single time
- `test_size` parameter can take two kind of values:
-    - actual number of rows that we want in test data
-    - fractional number representing the ratio of train versus test data
-    - default value is `0.25`
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
-
-%% Cell type:code id:5c3de0c2 tags:
-
-``` python
-xcols, ycol
-```
-
-%% Cell type:code id:c52c86c4-8afe-4d49-97a5-e2fcb18242c5 tags:
-
-``` python
-train, test = train_test_split(df)
-len(train), len(test)
-```
-
-%% Cell type:code id:b3ccdf1a-0958-402c-80e1-2eed12abdf88 tags:
-
-``` python
-# Test size using row count
-train, test = train_test_split(df, test_size=120)
-len(train), len(test)
-```
-
-%% Cell type:code id:adeea246-f80a-46c2-b62b-9e95bef5d02f tags:
-
-``` python
-# Test size using fraction
-train, test = train_test_split(df, test_size=0.5)
-len(train), len(test)
-```
-
-%% Cell type:code id:ba4c6310-0f35-44ac-9250-79b6eb04f7cc tags:
-
-``` python
-# Running this cell twice will give you two different train datasets
-train, test = train_test_split(df)
-train.head()
-```
-
-%% Cell type:code id:0fe05a2e tags:
-
-``` python
-train, test = train_test_split(df)
-
-# Let's use the train and the test data
-model = LinearRegression()
-# Fit using training data
-model.fit(train[xcols], train[ycol])
-# Predict using test data
-y = model.predict(test[xcols])
-# We can use score directly as it automatically invokes predict
-model.score(test[xcols], test[ycol])
-```
-
-%% Cell type:markdown id:e0f0e21b tags:
-
-Running the above cell again will give you entirely different model and score.
-
-%% Cell type:markdown id:003b1c50 tags:
-
-#### How can we minimize noise due to random train/test splits?
-
-### Cross validation: `cross_val_score(estimator, X, y)`
-
- requires `from sklearn.model_selection import cross_val_score`
-  do many different train/test splits of the values, fitting and scoring the model across each combination
- cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
- function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
-
-%% Cell type:code id:bfa17fce tags:
-
-``` python
-train, test = train_test_split(df)
-
-model = LinearRegression()
-scores = cross_val_score(???)
-scores
-```
-
-%% Cell type:code id:284f776f tags:
-
-``` python
-# Compute mean of the scores
-scores.mean()
-```
-
-%% Cell type:markdown id:53c9d4d4 tags:
-
-#### How can we compare models?
- model 1: POP => DEATHS
- model 2: CASES (POS_CUM_CP) => DEATHS
-
-%% Cell type:code id:ffd9791b tags:
-
-``` python
-model1 = LinearRegression()
-model2 = LinearRegression()
-model1_scores = cross_val_score(model1, train[["POP"]], train[ycol])
-model2_scores = cross_val_score(model2, train[["POS_CUM_CP"]], train[ycol])
-```
-
-%% Cell type:code id:60f0bf73 tags:
-
-``` python
-model1_scores.mean()
-```
-
-%% Cell type:code id:e3070bf6 tags:
-
-``` python
-model2_scores.mean()
-```
-
-%% Cell type:markdown id:dced0919 tags:
-
-Which of these two models do you think will perform better? Probably model2.
-
-%% Cell type:code id:dfedd8d4 tags:
-
-``` python
-means = pd.Series({"model1": model1_scores.mean(),
-                   "model2": model2_scores.mean()})
-means.plot.bar(figsize=(3, 3))
-```
-
-%% Cell type:markdown id:312aa001 tags:
-
-How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.
-
-%% Cell type:code id:5123c3a9 tags:
-
-``` python
-model1_scores.std()
-```
-
-%% Cell type:code id:230b9dc9 tags:
-
-``` python
-model2_scores.std()
-```
-
-%% Cell type:code id:484c7af9 tags:
-
-``` python
-err = pd.Series({"model1": model1_scores.std(),
-                 "model2": model2_scores.std()})
-err
-```
-
-%% Cell type:code id:233cd91d tags:
-
-``` python
-# Plot error bar by passing argument to paramenter yerr
-means.plot.bar(figsize=(3, 3))
-```
-
-%% Cell type:markdown id:c3b68db6 tags:
-
-Pick a winner and run it one more time against test data.
-
-%% Cell type:markdown id:09f5bce9 tags:
-
-#### How can we use multiple x variables (multiple regression)?
-
- Features: Positive cases per age range
-
-%% Cell type:code id:d3016079 tags:
-
-``` python
-df.columns
-```
-
-%% Cell type:code id:2538534d tags:
-
-``` python
-xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
-       'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
-       'POS_80_89_CP', 'POS_90_CP']
-ycol = "DTH_CUM_CP"
-
-model = LinearRegression()
-model.fit(train[xcols], train[ycol])
-model.score(test[xcols], test[ycol])
-```
-
-%% Cell type:markdown id:92e4c272 tags:
-
-#### How can we interpret what features the model is relying on?
-
-%% Cell type:code id:68e3d21a tags:
-
-``` python
-```
-
-%% Cell type:code id:98fa5442 tags:
-
-``` python
-pd.Series(model.coef_)
-```
-
-%% Cell type:code id:44bd5b07 tags:
-
-``` python
-pd.Series(model.coef_, index=???).plot.bar(figsize=(3, 2))
-```
+### Pipeline, Transformer and Estimator

 %% Cell type:markdown id:93be2ad2 tags:

 <div>
 <img src="attachment:Simple_model.png" width="300"/>
 </div>

 %% Cell type:markdown id:c854dab5 tags:

 ### Lake Michigan Waves

 Source: https://data.cityofchicago.org/Parks-Recreation/Beach-Water-Quality-Automated-Sensors/qmqz-2xku

 %% Cell type:code id:a3109957 tags:

 ``` python
+# Read the "waves.csv" file
 df = pd.read_csv("waves.csv")
 df.head()
 ```

 %% Cell type:code id:18d71eb8 tags:

 ``` python
 df.columns
 ```

 %% Cell type:markdown id:d3f9dc12 tags:

 #### Can we predict wave height based on wave period (time between waves)?

 %% Cell type:code id:149aa67c tags:

 ``` python
 df = df[(df["Wave Period"] > 0) & (df["Wave Height"] > 0)]
 df.plot.scatter(x="Wave Period", y="Wave Height", color="k", alpha=0.05)
 ```

 %% Cell type:markdown id:4da30be1 tags:

 Observation: non-linear relationship => fitting straight line will not work.

 %% Cell type:markdown id:62bd175c tags:

 #### Can we predict wave height based on beach name and wave period?

 %% Cell type:code id:3f045b08 tags:

 ``` python
 beach_names = sorted(set(df["Beach Name"]))
 beach_names
 ```

 %% Cell type:code id:bc626c9d tags:

 ``` python
 fig, axes = plt.subplots(2, 3, figsize=(12, 8), sharey=True)
 plt.subplots_adjust(hspace=0.3)
 axes = list(axes.reshape(-1))
 for b in beach_names:
    ax = axes.pop(0)
    ax.set_title(b)
    beach_df = df[df["Beach Name"] == b]
    beach_df.plot.scatter(x="Wave Period", y="Wave Height",
                          color="k", alpha=0.1, ax=ax)
 ```

 %% Cell type:markdown id:ed02085d tags:

 Obversation: which beach (categorical feature) is important.

 %% Cell type:markdown id:ceebe2c1 tags:

 ### Four Models

 1. wave period (linear)
 2. wave period (polynomial)
 3. beach
 4. beach and wave period

 %% Cell type:markdown id:2d4fe892 tags:

 `train_test_split(<dataframe>, random_state=<some number>)`
 - `random_state` enables us to control the randomization
 - when we pass the same number, we will get the same training and test data (psuedo randomization)

 %% Cell type:code id:a454b7ec tags:

 ``` python
 train, test = train_test_split(df, random_state=320)
 train.head()
 ```

 %% Cell type:markdown id:d6f3871a tags:

 #### Model 1: Wave Period (Linear)

 %% Cell type:markdown id:4c15ac91 tags:

 `cross_val_score(estimator, X, y, cv=<number>)`
 - `cv` enables us mention how many folds we want for the cross-validation

 %% Cell type:code id:199a5ca6 tags:

 ``` python
 xcols = ["Wave Period"]
 ycol = "Wave Height"

 m1 = ???
 scores = ???
 scores
 ```

 %% Cell type:code id:15bec5d5 tags:

 ``` python
 scores.mean()
 ```

 %% Cell type:markdown id:aa5a272e tags:

 We want the mean score to be large.

 %% Cell type:code id:306e21bd tags:

 ``` python
 scores.std()
 ```

 %% Cell type:markdown id:ea190e50 tags:

 We want the standard deviation to be small, to make sure that our model isn't too sensitive to changes in training and test data.

 %% Cell type:markdown id:b126b01f tags:

 #### Model 2: Wave Period (Polynomial)

 LinearRegression can do this:

 ```
 y = 3*x1 + 5*x2
 ```
 It CANNOT do this:
 ```
 y = 3*x + 5*x^2
 ```
 TRICK:
 ```
 x1 = x
 x2 = x^2
 ```

 <div>
 <img src="attachment:Pipeline.png" width="500"/>
 </div>

 %% Cell type:code id:5b6f8a12 tags:

 ``` python
 # Let's make a copy before we add new columns
 # Recommendation: don't change the data referred to by a specific variable
 # throughout your code

 train2 = train[xcols].copy()
 train2.head()
 ```

 %% Cell type:markdown id:6be8df92 tags:

 Manually adding columns for x^2, x^3, etc.,

 %% Cell type:code id:54f8d249 tags:

 ``` python
 train2["Wave Period ^ 2"] = train2["Wave Period"] ** 2
 train2["Wave Period ^ 3"] = train2["Wave Period"] ** 3
-train2["sqrt(Wave Period"] = train2["Wave Period"] ** 0.5
+train2["sqrt(Wave Period)"] = train2["Wave Period"] ** 0.5
 train2.head()
 ```

 %% Cell type:markdown id:155f7e7a tags:

 #### `PolynomialFeatures(degree=<val>, include_bias=False)`

 - `degree` enables us to mention how many degrees we need
 - `include_bias` default value is True, which will add a column of 1s - we typically don't use that.
 - returns an object instance on which we can invoke `fit` and `transform`:
    - `transform(X, columns=<col names>)`: transform data to polynomial features`
    - `fit_transform(X[, y])`: fit to data, then transform it.
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

 %% Cell type:code id:e6c43483 tags:

 ``` python
 pf =
 ```

 %% Cell type:code id:6f81a519 tags:

 ``` python
 # Fit the data
 pf
 ```

 %% Cell type:code id:8d6d2e1c tags:

 ``` python
 # Transform the data
 pf
 ```

 %% Cell type:code id:ecfc2493 tags:

 ``` python
 # Convert result of transformation into a DataFrame - step 1
 pd.DataFrame(pf.transform(train[xcols]))
 ```

 %% Cell type:markdown id:129b21dc tags:

 How can we add meaningful column names? `pf.get_feature_names_out()`

 %% Cell type:code id:1ab1eaa3 tags:

 ``` python
 pf
 ```

 %% Cell type:code id:2d62b3d3 tags:

 ``` python
 # Convert result of transformation into a DataFrame - step 2
 pd.DataFrame(pf.transform(train[xcols]), columns=???)
 ```

 %% Cell type:code id:4beebe68 tags:

 ``` python
 # Putting all the steps together
 pf = PolynomialFeatures(degree=4, include_bias=False)
 pf.fit(train[xcols])
 pd.DataFrame(pf.transform(train[xcols]), columns=pf.get_feature_names_out()).head()
 ```

 %% Cell type:markdown id:29d11ffb tags:

 `fit_transform(X[, y])`: fit to data, then transform it.

 %% Cell type:code id:91c3a781 tags:

 ``` python
 new_data = ???
 pd.DataFrame(new_data, columns=pf.get_feature_names_out()).head()
 ```

 %% Cell type:markdown id:86de680e tags:

 #### `Pipeline(...)`

 - Argument: list of steps in the pipeline:
    - each step represented as a tuple with name of the step and the object instance
    - last step will be the estimator
 - documention: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

 %% Cell type:code id:bb509930 tags:

 ``` python
 m2 = Pipeline([
    #PolynomialFeatures(degree=2, include_bias=False)
    # LinearRegression()
 ])
 m2
 ```

 %% Cell type:code id:ccf76663 tags:

 ``` python
 scores = cross_val_score(m2, train[xcols], train[ycol], cv=10)
 scores.mean()
 ```

 %% Cell type:markdown id:b7acca2c tags:

 **Conclusion:** mean R^2 score increased from 0.0029 (linear model) to 0.0489 (polynomial) - not bad!

 %% Cell type:markdown id:7dff1435 tags:

 #### Model 3: Beach Name (Categorical)

 %% Cell type:code id:76a31355 tags:

 ``` python
 train["Beach Name"].unique()
 ```

 %% Cell type:markdown id:61644774 tags:

 Naive way of assigning numerial values to a categorical column:
 - `Ohio Street Beach`: 1
 - `Calumet Beach`: 2
 - `Rainbow Beach`: 3, etc.,
 - Problem: `Calumet Beach` will become an average of `Ohio Street Beach` and `Rainbow Beach` => this doesn't make any sense!

 %% Cell type:markdown id:41c6be06 tags:

 #### `OneHotEncoder()`

 - encodes categorical features as a one-hot numeric array
 - returns a "sparse matrix", which needs to be explicitly converted into an `array` using `to_array()` method, before `DataFrame` conversion
 - documention: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

 %% Cell type:code id:55ec83e1-2800-41a2-80ce-f05a1de54c8a tags:

 ``` python
 xcols = ["Beach Name"]
 ```

 %% Cell type:code id:ef069c7b tags:

 ``` python
 ```

 %% Cell type:code id:5cb24e10 tags:

 ``` python
 # fit_transform
 ```

 %% Cell type:code id:d5303636 tags:

 ``` python
 # All steps
 oh = OneHotEncoder()
 pd.DataFrame(oh.fit_transform(train[xcols]).toarray(), columns=oh.get_feature_names_out())
 ```

 %% Cell type:code id:625a8edc tags:

 ``` python
 m3 = Pipeline([
    ("oh", OneHotEncoder()),
    ("lr", LinearRegression())
 ])
 m3
 ```

 %% Cell type:code id:5494c1fa tags:

 ``` python
 scores = cross_val_score(m3, train[xcols], train[ycol], cv=10)
 scores.mean()
 ```

 %% Cell type:markdown id:2ce15dac tags:

 **Conclusion:** mean R^2 score is slightly lower than 0.0489 (polynomial).

 %% Cell type:markdown id:e4bf025e tags:

 <div>
 <img src="attachment:ColumnTransformer.png" width="500"/>
 </div>

 %% Cell type:markdown id:0f5a1a98 tags:

 #### `make_column_transformer(...)`

 - Argument: transformations
    - each transformer argument will be a `tuple` with object instance as first item and list of feature columns as the second
 - documention: https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html

 %% Cell type:markdown id:a41f509f tags:

 #### Model 4: Beach Name (Categorical) and Wave Period (Polynomial)

 %% Cell type:code id:40378fb3 tags:

 ``` python
 custom_trans = make_column_transformer(
    (PolynomialFeatures(), ["Wave Period"]),
    (OneHotEncoder(), ["Beach Name"]),
 )
 custom_trans
 ```

 %% Cell type:code id:2d6e0ba2 tags:

 ``` python
 m4 = Pipeline([
    ("transformers", custom_trans),
    ("lr", LinearRegression()),
 ])
 m4
 ```

 %% Cell type:code id:4248010b tags:

 ``` python
 xcols = ["Beach Name", "Wave Period"]
 ```

 %% Cell type:code id:57b9ffc8 tags:

 ``` python
 scores = cross_val_score(m4, train[xcols], train[ycol], cv=10)
 scores.mean()
 ```

 %% Cell type:markdown id:73ade800 tags:

 **Conclusion:** mean R^2 score increased to 0.0885 when compared to 0.0489 (polynomial).

 %% Cell type:markdown id:247947ef tags:

 ### Let's evaluate this model by running it against the test data

 %% Cell type:code id:e7e58157 tags:

 ``` python
 m4.fit(train[xcols], train[ycol])
 m4.score(test[xcols], test[ycol])
 ```

 %% Cell type:markdown id:cd689ae1 tags:

 How to extract `coef_` for this `Pipeline`?

 %% Cell type:code id:50c7e102 tags:

 ``` python
 m4["transformers"].get_feature_names_out()
 ```

 %% Cell type:code id:28ccbf92 tags:

 ``` python
 m4["lr"].coef_
 ```

 %% Cell type:code id:09bc63d9-ab2b-4723-94cf-94ebac037d8d tags:

 ``` python
 s = pd.Series(m4["lr"].coef_, index=m4["transformers"].get_feature_names_out())
 s.plot.bar()
 ```

 %% Cell type:code id:acd00dc3 tags:

 ``` python
 s = pd.Series(m4["lr"].coef_, index=m4["transformers"].get_feature_names_out())
 s.plot.barh()
 ```

 %% Cell type:markdown id:279ae068-9fc8-45d3-9bcf-c9ae9252f19f tags:

 What are the beaches that have the biggest waves?

 %% Cell type:code id:2ff1d03b-48bb-4150-bc27-42f5064f88df tags:

 ``` python
 ```

--- a/lecture-material/19-Regression_2/19-regression_002.ipynb
+++ b/lecture-material/19-Regression_2/19-regression_002.ipynb
 %% Cell type:markdown id:9e33d56e tags:

 # Regression 2

 %% Cell type:code id:e6f50cc3 tags:

 ``` python
 import os
 import pandas as pd
 import geopandas as gpd
 import matplotlib.pyplot as plt

 # new import statements
 from sklearn.linear_model import LinearRegression
 import sklearn
 from sklearn.model_selection import train_test_split, cross_val_score

 from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
 from sklearn.pipeline import Pipeline
 from sklearn.compose import make_column_transformer

 import numpy as np
 ```

-%% Cell type:code id:696eec18 tags:
+%% Cell type:markdown id:8d0baac0-fcc2-4cdf-9276-9db86ffacd10 tags:

-``` python
-# Read the "covid.geojson" file
-dataset_file = "covid.geojson"
-df = gpd.read_file(dataset_file)
-```
-
-%% Cell type:markdown id:b37d6f28-fabd-4418-8c85-8e7cd24331f5 tags:
-
-### How well does our model fit the data?
- explained variance score
- R^2 ("r squared")
-
-%% Cell type:markdown id:a2bf3bc0-03b8-4c41-93a9-bef8a76bf3de tags:
-
-#### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
-    - y_true: actual death values in our example
-    - y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html
-
-%% Cell type:markdown id:c7fbd149-0c50-46ec-bfb1-f33e9fece60e tags:
-
-### Predicting "DTH_CUM_CP"
-
-%% Cell type:code id:cebb790b-9374-4d02-afe0-40e1efc8f176 tags:
-
-``` python
-df = df.replace(-999, 2)
-# Removing outlier point
-df = df[df["GEOID"].str.fullmatch(r"\d+")]
-```
-
-%% Cell type:code id:f4a4dd37-d7d3-41ca-8319-1442b56b2a24 tags:
-
-``` python
-xcols = ["POP"]
-ycol = "DTH_CUM_CP"
-```
-
-%% Cell type:markdown id:87c4b295-39ca-445c-865d-4829d77f55ce tags:
-
-### Let's use `LinearRegression` model.
-
- `from sklearn.linear_model import LinearRegression`
-
-%% Cell type:code id:78c2c891-caed-4359-bc91-a97de21db812 tags:
-
-``` python
-model = LinearRegression()
-model.fit(df[xcols], df[ycol])
-```
-
-%% Cell type:code id:7787a889-c693-40b3-8e66-9a266bb0c683 tags:
-
-``` python
-# Let's now make predictions for the known data
-predictions = model.predict(df[xcols])
-predictions
-```
-
-%% Cell type:code id:7952f67c-57b3-4ca8-bb42-b6287f012157 tags:
-
-``` python
-sklearn.metrics.explained_variance_score(df[ycol], predictions)
-```
-
-%% Cell type:markdown id:39caf7a5-c782-4c02-8c3e-8385eddb108f tags:
-
-#### Explained variance score
-
- `explained_variance_score = (known_var - explained_variance) / known_var`
-    - where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`
-
-%% Cell type:markdown id:908b8af9-6ba1-4e92-9569-36168b253fae tags:
-
-What is the variation in known deaths?
-
-%% Cell type:code id:5fc126c4-de9a-4a35-abe6-8d5820227f6f tags:
-
-``` python
-# Compute variance of "DTH_CUM_CP" column
-known_var = df[ycol].var()
-known_var
-```
-
-%% Cell type:code id:a2f489d5-1325-4870-b6f8-d2b7949f12cf tags:
-
-``` python
-# explained_variance
-explained_variance = (df[ycol] - predictions).var()
-explained_variance
-```
-
-%% Cell type:code id:e049b2bd-35e8-4a75-9d6d-d70c2b998541 tags:
-
-``` python
-# explained_variance score
-explained_variance_score = (known_var - explained_variance) / known_var
-explained_variance_score
-```
-
-%% Cell type:code id:ced64de4-d1dc-464e-8855-3341a8f3b15a tags:
-
-``` python
-# For comparison here is the explained variance score from sklearn
-sklearn.metrics.explained_variance_score(df[ycol], predictions)
-```
-
-%% Cell type:markdown id:e85ad41f-8d4b-471f-b689-333346d7d660 tags:
-
-#### `sklearn.metrics.r2_score(y_true, y_pred)`
-
- requires `import sklearn`
- calculates the explained variance score given:
-    - y_true: actual death values in our example
-    - y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
-
-%% Cell type:code id:557a7646-7843-4956-8d72-9104aa7113de tags:
-
-``` python
-sklearn.metrics.r2_score(df[ycol], predictions)
-```
-
-%% Cell type:markdown id:3ceeca6f-e7d4-47fa-b53e-8c3bbba41594 tags:
-
-#### R^2 score (aka coefficient of determination) approximation
-
- `r2_score = (known_var - r2_val) / known_var`
-    - where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`
-
-%% Cell type:code id:c1ea5c04-3dd7-412c-9cbf-32f4a69bdb20 tags:
-
-``` python
-# r2_val
-r2_val = ((df[ycol] - predictions) ** 2).mean()
-r2_val
-```
-
-%% Cell type:code id:e3e83156-8e02-4216-9738-bed6b0052608 tags:
-
-``` python
-r2_score = (known_var - r2_val) / known_var
-r2_score # there might be minor rounding off differences
-```
-
-%% Cell type:markdown id:d1795241-d9d7-42c5-8bf4-614ec7d972d7 tags:
-
-#### `model.score(X, y)`
- invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y
-
-%% Cell type:code id:a50b84d0-5c1b-4f84-9f41-d6157057d645 tags:
-
-``` python
-model.score(df[xcols], df[ycol])
-```
-
-%% Cell type:markdown id:1768f9a9 tags:
-
-#### Did our model learn, or just memorize (that is, "overfit")?
-
- Split data into train and test
-
-%% Cell type:code id:80cef390-831c-443a-9149-9141c8d34ef2 tags:
-
-``` python
-# Split the data into two equal parts
-len(df) // 2
-```
-
-%% Cell type:code id:68c16194-1c7b-4f3a-81a3-4b0352f36872 tags:
-
-``` python
-# Manual way of splitting train and test data
-train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
-len(train), len(test)
-```
-
-%% Cell type:markdown id:3a781391 tags:
-
-#### `train_test_split(<dataframe>, test_size=<val>)`
-
- requires `from sklearn.model_selection import train_test_split`
- shuffles the data and then splits based on 75%-25% split between train and test
-    - produces new train and test data every single time
- `test_size` parameter can take two kind of values:
-    - actual number of rows that we want in test data
-    - fractional number representing the ratio of train versus test data
-    - default value is `0.25`
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
-
-%% Cell type:code id:5c3de0c2 tags:
-
-``` python
-xcols, ycol
-```
-
-%% Cell type:code id:c52c86c4-8afe-4d49-97a5-e2fcb18242c5 tags:
-
-``` python
-train, test = train_test_split(df)
-len(train), len(test)
-```
-
-%% Cell type:code id:b3ccdf1a-0958-402c-80e1-2eed12abdf88 tags:
-
-``` python
-# Test size using row count
-train, test = train_test_split(df, test_size=120)
-len(train), len(test)
-```
-
-%% Cell type:code id:adeea246-f80a-46c2-b62b-9e95bef5d02f tags:
-
-``` python
-# Test size using fraction
-train, test = train_test_split(df, test_size=0.5)
-len(train), len(test)
-```
-
-%% Cell type:code id:ba4c6310-0f35-44ac-9250-79b6eb04f7cc tags:
-
-``` python
-# Running this cell twice will give you two different train datasets
-train, test = train_test_split(df)
-train.head()
-```
-
-%% Cell type:code id:0fe05a2e tags:
-
-``` python
-train, test = train_test_split(df)
-
-# Let's use the train and the test data
-model = LinearRegression()
-# Fit using training data
-model.fit(train[xcols], train[ycol])
-# Predict using test data
-y = model.predict(test[xcols])
-# We can use score directly as it automatically invokes predict
-model.score(test[xcols], test[ycol])
-```
-
-%% Cell type:markdown id:e0f0e21b tags:
-
-Running the above cell again will give you entirely different model and score.
-
-%% Cell type:markdown id:003b1c50 tags:
-
-#### How can we minimize noise due to random train/test splits?
-
-### Cross validation: `cross_val_score(estimator, X, y)`
-
- requires `from sklearn.model_selection import cross_val_score`
-  do many different train/test splits of the values, fitting and scoring the model across each combination
- cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
- function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
-
-%% Cell type:code id:bfa17fce tags:
-
-``` python
-train, test = train_test_split(df)
-
-model = LinearRegression()
-scores = cross_val_score(???)
-scores
-```
-
-%% Cell type:code id:284f776f tags:
-
-``` python
-# Compute mean of the scores
-scores.mean()
-```
-
-%% Cell type:markdown id:53c9d4d4 tags:
-
-#### How can we compare models?
- model 1: POP => DEATHS
- model 2: CASES (POS_CUM_CP) => DEATHS
-
-%% Cell type:code id:ffd9791b tags:
-
-``` python
-model1 = LinearRegression()
-model2 = LinearRegression()
-model1_scores = cross_val_score(model1, train[["POP"]], train[ycol])
-model2_scores = cross_val_score(model2, train[["POS_CUM_CP"]], train[ycol])
-```
-
-%% Cell type:code id:60f0bf73 tags:
-
-``` python
-model1_scores.mean()
-```
-
-%% Cell type:code id:e3070bf6 tags:
-
-``` python
-model2_scores.mean()
-```
-
-%% Cell type:markdown id:dced0919 tags:
-
-Which of these two models do you think will perform better? Probably model2.
-
-%% Cell type:code id:dfedd8d4 tags:
-
-``` python
-means = pd.Series({"model1": model1_scores.mean(),
-                   "model2": model2_scores.mean()})
-means.plot.bar(figsize=(3, 3))
-```
-
-%% Cell type:markdown id:312aa001 tags:
-
-How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.
-
-%% Cell type:code id:5123c3a9 tags:
-
-``` python
-model1_scores.std()
-```
-
-%% Cell type:code id:230b9dc9 tags:
-
-``` python
-model2_scores.std()
-```
-
-%% Cell type:code id:484c7af9 tags:
-
-``` python
-err = pd.Series({"model1": model1_scores.std(),
-                 "model2": model2_scores.std()})
-err
-```
-
-%% Cell type:code id:233cd91d tags:
-
-``` python
-# Plot error bar by passing argument to paramenter yerr
-means.plot.bar(figsize=(3, 3))
-```
-
-%% Cell type:markdown id:c3b68db6 tags:
-
-Pick a winner and run it one more time against test data.
-
-%% Cell type:markdown id:09f5bce9 tags:
-
-#### How can we use multiple x variables (multiple regression)?
-
- Features: Positive cases per age range
-
-%% Cell type:code id:d3016079 tags:
-
-``` python
-df.columns
-```
-
-%% Cell type:code id:2538534d tags:
-
-``` python
-xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
-       'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
-       'POS_80_89_CP', 'POS_90_CP']
-ycol = "DTH_CUM_CP"
-
-model = LinearRegression()
-model.fit(train[xcols], train[ycol])
-model.score(test[xcols], test[ycol])
-```
-
-%% Cell type:markdown id:92e4c272 tags:
-
-#### How can we interpret what features the model is relying on?
-
-%% Cell type:code id:68e3d21a tags:
-
-``` python
-```
-
-%% Cell type:code id:98fa5442 tags:
-
-``` python
-pd.Series(model.coef_)
-```
-
-%% Cell type:code id:44bd5b07 tags:
-
-``` python
-pd.Series(model.coef_, index=???).plot.bar(figsize=(3, 2))
-```
+### Pipeline, Transformer and Estimator

 %% Cell type:markdown id:93be2ad2 tags:

 <div>
 <img src="attachment:Simple_model.png" width="300"/>
 </div>

 %% Cell type:markdown id:c854dab5 tags:

 ### Lake Michigan Waves

 Source: https://data.cityofchicago.org/Parks-Recreation/Beach-Water-Quality-Automated-Sensors/qmqz-2xku

 %% Cell type:code id:a3109957 tags:

 ``` python
+# Read the "waves.csv" file
 df = pd.read_csv("waves.csv")
 df.head()
 ```

 %% Cell type:code id:18d71eb8 tags:

 ``` python
 df.columns
 ```

 %% Cell type:markdown id:d3f9dc12 tags:

 #### Can we predict wave height based on wave period (time between waves)?

 %% Cell type:code id:149aa67c tags:

 ``` python
 df = df[(df["Wave Period"] > 0) & (df["Wave Height"] > 0)]
 df.plot.scatter(x="Wave Period", y="Wave Height", color="k", alpha=0.05)
 ```

 %% Cell type:markdown id:4da30be1 tags:

 Observation: non-linear relationship => fitting straight line will not work.

 %% Cell type:markdown id:62bd175c tags:

 #### Can we predict wave height based on beach name and wave period?

 %% Cell type:code id:3f045b08 tags:

 ``` python
 beach_names = sorted(set(df["Beach Name"]))
 beach_names
 ```

 %% Cell type:code id:bc626c9d tags:

 ``` python
 fig, axes = plt.subplots(2, 3, figsize=(12, 8), sharey=True)
 plt.subplots_adjust(hspace=0.3)
 axes = list(axes.reshape(-1))
 for b in beach_names:
    ax = axes.pop(0)
    ax.set_title(b)
    beach_df = df[df["Beach Name"] == b]
    beach_df.plot.scatter(x="Wave Period", y="Wave Height",
                          color="k", alpha=0.1, ax=ax)
 ```

 %% Cell type:markdown id:ed02085d tags:

 Obversation: which beach (categorical feature) is important.

 %% Cell type:markdown id:ceebe2c1 tags:

 ### Four Models

 1. wave period (linear)
 2. wave period (polynomial)
 3. beach
 4. beach and wave period

 %% Cell type:markdown id:2d4fe892 tags:

 `train_test_split(<dataframe>, random_state=<some number>)`
 - `random_state` enables us to control the randomization
 - when we pass the same number, we will get the same training and test data (psuedo randomization)

 %% Cell type:code id:a454b7ec tags:

 ``` python
 train, test = train_test_split(df, random_state=320)
 train.head()
 ```

 %% Cell type:markdown id:d6f3871a tags:

 #### Model 1: Wave Period (Linear)

 %% Cell type:markdown id:4c15ac91 tags:

 `cross_val_score(estimator, X, y, cv=<number>)`
 - `cv` enables us mention how many folds we want for the cross-validation

 %% Cell type:code id:199a5ca6 tags:

 ``` python
 xcols = ["Wave Period"]
 ycol = "Wave Height"

 m1 = ???
 scores = ???
 scores
 ```

 %% Cell type:code id:15bec5d5 tags:

 ``` python
 scores.mean()
 ```

 %% Cell type:markdown id:aa5a272e tags:

 We want the mean score to be large.

 %% Cell type:code id:306e21bd tags:

 ``` python
 scores.std()
 ```

 %% Cell type:markdown id:ea190e50 tags:

 We want the standard deviation to be small, to make sure that our model isn't too sensitive to changes in training and test data.

 %% Cell type:markdown id:b126b01f tags:

 #### Model 2: Wave Period (Polynomial)

 LinearRegression can do this:

 ```
 y = 3*x1 + 5*x2
 ```
 It CANNOT do this:
 ```
 y = 3*x + 5*x^2
 ```
 TRICK:
 ```
 x1 = x
 x2 = x^2
 ```

 <div>
 <img src="attachment:Pipeline.png" width="500"/>
 </div>

 %% Cell type:code id:5b6f8a12 tags:

 ``` python
 # Let's make a copy before we add new columns
 # Recommendation: don't change the data referred to by a specific variable
 # throughout your code

 train2 = train[xcols].copy()
 train2.head()
 ```

 %% Cell type:markdown id:6be8df92 tags:

 Manually adding columns for x^2, x^3, etc.,

 %% Cell type:code id:54f8d249 tags:

 ``` python
 train2["Wave Period ^ 2"] = train2["Wave Period"] ** 2
 train2["Wave Period ^ 3"] = train2["Wave Period"] ** 3
-train2["sqrt(Wave Period"] = train2["Wave Period"] ** 0.5
+train2["sqrt(Wave Period)"] = train2["Wave Period"] ** 0.5
 train2.head()
 ```

 %% Cell type:markdown id:155f7e7a tags:

 #### `PolynomialFeatures(degree=<val>, include_bias=False)`

 - `degree` enables us to mention how many degrees we need
 - `include_bias` default value is True, which will add a column of 1s - we typically don't use that.
 - returns an object instance on which we can invoke `fit` and `transform`:
    - `transform(X, columns=<col names>)`: transform data to polynomial features`
    - `fit_transform(X[, y])`: fit to data, then transform it.
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

 %% Cell type:code id:e6c43483 tags:

 ``` python
 pf =
 ```

 %% Cell type:code id:6f81a519 tags:

 ``` python
 # Fit the data
 pf
 ```

 %% Cell type:code id:8d6d2e1c tags:

 ``` python
 # Transform the data
 pf
 ```

 %% Cell type:code id:ecfc2493 tags:

 ``` python
 # Convert result of transformation into a DataFrame - step 1
 pd.DataFrame(pf.transform(train[xcols]))
 ```

 %% Cell type:markdown id:129b21dc tags:

 How can we add meaningful column names? `pf.get_feature_names_out()`

 %% Cell type:code id:1ab1eaa3 tags:

 ``` python
 pf
 ```

 %% Cell type:code id:2d62b3d3 tags:

 ``` python
 # Convert result of transformation into a DataFrame - step 2
 pd.DataFrame(pf.transform(train[xcols]), columns=???)
 ```

 %% Cell type:code id:4beebe68 tags:

 ``` python
 # Putting all the steps together
 pf = PolynomialFeatures(degree=4, include_bias=False)
 pf.fit(train[xcols])
 pd.DataFrame(pf.transform(train[xcols]), columns=pf.get_feature_names_out()).head()
 ```

 %% Cell type:markdown id:29d11ffb tags:

 `fit_transform(X[, y])`: fit to data, then transform it.

 %% Cell type:code id:91c3a781 tags:

 ``` python
 new_data = ???
 pd.DataFrame(new_data, columns=pf.get_feature_names_out()).head()
 ```

 %% Cell type:markdown id:86de680e tags:

 #### `Pipeline(...)`

 - Argument: list of steps in the pipeline:
    - each step represented as a tuple with name of the step and the object instance
    - last step will be the estimator
 - documention: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

 %% Cell type:code id:bb509930 tags:

 ``` python
 m2 = Pipeline([
    #PolynomialFeatures(degree=2, include_bias=False)
    # LinearRegression()
 ])
 m2
 ```

 %% Cell type:code id:ccf76663 tags:

 ``` python
 scores = cross_val_score(m2, train[xcols], train[ycol], cv=10)
 scores.mean()
 ```

 %% Cell type:markdown id:b7acca2c tags:

 **Conclusion:** mean R^2 score increased from 0.0029 (linear model) to 0.0489 (polynomial) - not bad!

 %% Cell type:markdown id:7dff1435 tags:

 #### Model 3: Beach Name (Categorical)

 %% Cell type:code id:76a31355 tags:

 ``` python
 train["Beach Name"].unique()
 ```

 %% Cell type:markdown id:61644774 tags:

 Naive way of assigning numerial values to a categorical column:
 - `Ohio Street Beach`: 1
 - `Calumet Beach`: 2
 - `Rainbow Beach`: 3, etc.,
 - Problem: `Calumet Beach` will become an average of `Ohio Street Beach` and `Rainbow Beach` => this doesn't make any sense!

 %% Cell type:markdown id:41c6be06 tags:

 #### `OneHotEncoder()`

 - encodes categorical features as a one-hot numeric array
 - returns a "sparse matrix", which needs to be explicitly converted into an `array` using `to_array()` method, before `DataFrame` conversion
 - documention: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

 %% Cell type:code id:55ec83e1-2800-41a2-80ce-f05a1de54c8a tags:

 ``` python
 xcols = ["Beach Name"]
 ```

 %% Cell type:code id:ef069c7b tags:

 ``` python
 ```

 %% Cell type:code id:5cb24e10 tags:

 ``` python
 # fit_transform
 ```

 %% Cell type:code id:d5303636 tags:

 ``` python
 # All steps
 oh = OneHotEncoder()
 pd.DataFrame(oh.fit_transform(train[xcols]).toarray(), columns=oh.get_feature_names_out())
 ```

 %% Cell type:code id:625a8edc tags:

 ``` python
 m3 = Pipeline([
    ("oh", OneHotEncoder()),
    ("lr", LinearRegression())
 ])
 m3
 ```

 %% Cell type:code id:5494c1fa tags:

 ``` python
 scores = cross_val_score(m3, train[xcols], train[ycol], cv=10)
 scores.mean()
 ```

 %% Cell type:markdown id:2ce15dac tags:

 **Conclusion:** mean R^2 score is slightly lower than 0.0489 (polynomial).

 %% Cell type:markdown id:e4bf025e tags:

 <div>
 <img src="attachment:ColumnTransformer.png" width="500"/>
 </div>

 %% Cell type:markdown id:0f5a1a98 tags:

 #### `make_column_transformer(...)`

 - Argument: transformations
    - each transformer argument will be a `tuple` with object instance as first item and list of feature columns as the second
 - documention: https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html

 %% Cell type:markdown id:a41f509f tags:

 #### Model 4: Beach Name (Categorical) and Wave Period (Polynomial)

 %% Cell type:code id:40378fb3 tags:

 ``` python
 custom_trans = make_column_transformer(
    (PolynomialFeatures(), ["Wave Period"]),
    (OneHotEncoder(), ["Beach Name"]),
 )
 custom_trans
 ```

 %% Cell type:code id:2d6e0ba2 tags:

 ``` python
 m4 = Pipeline([
    ("transformers", custom_trans),
    ("lr", LinearRegression()),
 ])
 m4
 ```

 %% Cell type:code id:4248010b tags:

 ``` python
 xcols = ["Beach Name", "Wave Period"]
 ```

 %% Cell type:code id:57b9ffc8 tags:

 ``` python
 scores = cross_val_score(m4, train[xcols], train[ycol], cv=10)
 scores.mean()
 ```

 %% Cell type:markdown id:73ade800 tags:

 **Conclusion:** mean R^2 score increased to 0.0885 when compared to 0.0489 (polynomial).

 %% Cell type:markdown id:247947ef tags:

 ### Let's evaluate this model by running it against the test data

 %% Cell type:code id:e7e58157 tags:

 ``` python
 m4.fit(train[xcols], train[ycol])
 m4.score(test[xcols], test[ycol])
 ```

 %% Cell type:markdown id:cd689ae1 tags:

 How to extract `coef_` for this `Pipeline`?

 %% Cell type:code id:50c7e102 tags:

 ``` python
 m4["transformers"].get_feature_names_out()
 ```

 %% Cell type:code id:28ccbf92 tags:

 ``` python
 m4["lr"].coef_
 ```

 %% Cell type:code id:09bc63d9-ab2b-4723-94cf-94ebac037d8d tags:

 ``` python
 s = pd.Series(m4["lr"].coef_, index=m4["transformers"].get_feature_names_out())
 s.plot.bar()
 ```

 %% Cell type:code id:acd00dc3 tags:

 ``` python
 s = pd.Series(m4["lr"].coef_, index=m4["transformers"].get_feature_names_out())
 s.plot.barh()
 ```

 %% Cell type:markdown id:279ae068-9fc8-45d3-9bcf-c9ae9252f19f tags:

 What are the beaches that have the biggest waves?

 %% Cell type:code id:2ff1d03b-48bb-4150-bc27-42f5064f88df tags:

 ``` python
 ```

 %% Cell type:markdown id:9e33d56e tags:

 # Regression 2

 %% Cell type:code id:e6f50cc3 tags:

 ``` python
 import os
 import pandas as pd
 import geopandas as gpd
 import matplotlib.pyplot as plt

 # new import statements
 from sklearn.linear_model import LinearRegression
 import sklearn
 from sklearn.model_selection import train_test_split, cross_val_score

 from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
 from sklearn.pipeline import Pipeline
 from sklearn.compose import make_column_transformer

 import numpy as np
 ```

-%% Cell type:code id:696eec18 tags:
+%% Cell type:markdown id:8d0baac0-fcc2-4cdf-9276-9db86ffacd10 tags:

-``` python
-# Read the "covid.geojson" file
-dataset_file = "covid.geojson"
-df = gpd.read_file(dataset_file)
-```
-
-%% Cell type:markdown id:b37d6f28-fabd-4418-8c85-8e7cd24331f5 tags:
-
-### How well does our model fit the data?
- explained variance score
- R^2 ("r squared")
-
-%% Cell type:markdown id:a2bf3bc0-03b8-4c41-93a9-bef8a76bf3de tags:
-
-#### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
-    - y_true: actual death values in our example
-    - y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html
-
-%% Cell type:markdown id:c7fbd149-0c50-46ec-bfb1-f33e9fece60e tags:
-
-### Predicting "DTH_CUM_CP"
-
-%% Cell type:code id:cebb790b-9374-4d02-afe0-40e1efc8f176 tags:
-
-``` python
-df = df.replace(-999, 2)
-# Removing outlier point
-df = df[df["GEOID"].str.fullmatch(r"\d+")]
-```
-
-%% Cell type:code id:f4a4dd37-d7d3-41ca-8319-1442b56b2a24 tags:
-
-``` python
-xcols = ["POP"]
-ycol = "DTH_CUM_CP"
-```
-
-%% Cell type:markdown id:87c4b295-39ca-445c-865d-4829d77f55ce tags:
-
-### Let's use `LinearRegression` model.
-
- `from sklearn.linear_model import LinearRegression`
-
-%% Cell type:code id:78c2c891-caed-4359-bc91-a97de21db812 tags:
-
-``` python
-model = LinearRegression()
-model.fit(df[xcols], df[ycol])
-```
-
-%% Cell type:code id:7787a889-c693-40b3-8e66-9a266bb0c683 tags:
-
-``` python
-# Let's now make predictions for the known data
-predictions = model.predict(df[xcols])
-predictions
-```
-
-%% Cell type:code id:7952f67c-57b3-4ca8-bb42-b6287f012157 tags:
-
-``` python
-sklearn.metrics.explained_variance_score(df[ycol], predictions)
-```
-
-%% Cell type:markdown id:39caf7a5-c782-4c02-8c3e-8385eddb108f tags:
-
-#### Explained variance score
-
- `explained_variance_score = (known_var - explained_variance) / known_var`
-    - where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`
-
-%% Cell type:markdown id:908b8af9-6ba1-4e92-9569-36168b253fae tags:
-
-What is the variation in known deaths?
-
-%% Cell type:code id:5fc126c4-de9a-4a35-abe6-8d5820227f6f tags:
-
-``` python
-# Compute variance of "DTH_CUM_CP" column
-known_var = df[ycol].var()
-known_var
-```
-
-%% Cell type:code id:a2f489d5-1325-4870-b6f8-d2b7949f12cf tags:
-
-``` python
-# explained_variance
-explained_variance = (df[ycol] - predictions).var()
-explained_variance
-```
-
-%% Cell type:code id:e049b2bd-35e8-4a75-9d6d-d70c2b998541 tags:
-
-``` python
-# explained_variance score
-explained_variance_score = (known_var - explained_variance) / known_var
-explained_variance_score
-```
-
-%% Cell type:code id:ced64de4-d1dc-464e-8855-3341a8f3b15a tags:
-
-``` python
-# For comparison here is the explained variance score from sklearn
-sklearn.metrics.explained_variance_score(df[ycol], predictions)
-```
-
-%% Cell type:markdown id:e85ad41f-8d4b-471f-b689-333346d7d660 tags:
-
-#### `sklearn.metrics.r2_score(y_true, y_pred)`
-
- requires `import sklearn`
- calculates the explained variance score given:
-    - y_true: actual death values in our example
-    - y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
-
-%% Cell type:code id:557a7646-7843-4956-8d72-9104aa7113de tags:
-
-``` python
-sklearn.metrics.r2_score(df[ycol], predictions)
-```
-
-%% Cell type:markdown id:3ceeca6f-e7d4-47fa-b53e-8c3bbba41594 tags:
-
-#### R^2 score (aka coefficient of determination) approximation
-
- `r2_score = (known_var - r2_val) / known_var`
-    - where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`
-
-%% Cell type:code id:c1ea5c04-3dd7-412c-9cbf-32f4a69bdb20 tags:
-
-``` python
-# r2_val
-r2_val = ((df[ycol] - predictions) ** 2).mean()
-r2_val
-```
-
-%% Cell type:code id:e3e83156-8e02-4216-9738-bed6b0052608 tags:
-
-``` python
-r2_score = (known_var - r2_val) / known_var
-r2_score # there might be minor rounding off differences
-```
-
-%% Cell type:markdown id:d1795241-d9d7-42c5-8bf4-614ec7d972d7 tags:
-
-#### `model.score(X, y)`
- invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y
-
-%% Cell type:code id:a50b84d0-5c1b-4f84-9f41-d6157057d645 tags:
-
-``` python
-model.score(df[xcols], df[ycol])
-```
-
-%% Cell type:markdown id:1768f9a9 tags:
-
-#### Did our model learn, or just memorize (that is, "overfit")?
-
- Split data into train and test
-
-%% Cell type:code id:80cef390-831c-443a-9149-9141c8d34ef2 tags:
-
-``` python
-# Split the data into two equal parts
-len(df) // 2
-```
-
-%% Cell type:code id:68c16194-1c7b-4f3a-81a3-4b0352f36872 tags:
-
-``` python
-# Manual way of splitting train and test data
-train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
-len(train), len(test)
-```
-
-%% Cell type:markdown id:3a781391 tags:
-
-#### `train_test_split(<dataframe>, test_size=<val>)`
-
- requires `from sklearn.model_selection import train_test_split`
- shuffles the data and then splits based on 75%-25% split between train and test
-    - produces new train and test data every single time
- `test_size` parameter can take two kind of values:
-    - actual number of rows that we want in test data
-    - fractional number representing the ratio of train versus test data
-    - default value is `0.25`
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
-
-%% Cell type:code id:5c3de0c2 tags:
-
-``` python
-xcols, ycol
-```
-
-%% Cell type:code id:c52c86c4-8afe-4d49-97a5-e2fcb18242c5 tags:
-
-``` python
-train, test = train_test_split(df)
-len(train), len(test)
-```
-
-%% Cell type:code id:b3ccdf1a-0958-402c-80e1-2eed12abdf88 tags:
-
-``` python
-# Test size using row count
-train, test = train_test_split(df, test_size=120)
-len(train), len(test)
-```
-
-%% Cell type:code id:adeea246-f80a-46c2-b62b-9e95bef5d02f tags:
-
-``` python
-# Test size using fraction
-train, test = train_test_split(df, test_size=0.5)
-len(train), len(test)
-```
-
-%% Cell type:code id:ba4c6310-0f35-44ac-9250-79b6eb04f7cc tags:
-
-``` python
-# Running this cell twice will give you two different train datasets
-train, test = train_test_split(df)
-train.head()
-```
-
-%% Cell type:code id:0fe05a2e tags:
-
-``` python
-train, test = train_test_split(df)
-
-# Let's use the train and the test data
-model = LinearRegression()
-# Fit using training data
-model.fit(train[xcols], train[ycol])
-# Predict using test data
-y = model.predict(test[xcols])
-# We can use score directly as it automatically invokes predict
-model.score(test[xcols], test[ycol])
-```
-
-%% Cell type:markdown id:e0f0e21b tags:
-
-Running the above cell again will give you entirely different model and score.
-
-%% Cell type:markdown id:003b1c50 tags:
-
-#### How can we minimize noise due to random train/test splits?
-
-### Cross validation: `cross_val_score(estimator, X, y)`
-
- requires `from sklearn.model_selection import cross_val_score`
-  do many different train/test splits of the values, fitting and scoring the model across each combination
- cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
- function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
-
-%% Cell type:code id:bfa17fce tags:
-
-``` python
-train, test = train_test_split(df)
-
-model = LinearRegression()
-scores = cross_val_score(???)
-scores
-```
-
-%% Cell type:code id:284f776f tags:
-
-``` python
-# Compute mean of the scores
-scores.mean()
-```
-
-%% Cell type:markdown id:53c9d4d4 tags:
-
-#### How can we compare models?
- model 1: POP => DEATHS
- model 2: CASES (POS_CUM_CP) => DEATHS
-
-%% Cell type:code id:ffd9791b tags:
-
-``` python
-model1 = LinearRegression()
-model2 = LinearRegression()
-model1_scores = cross_val_score(model1, train[["POP"]], train[ycol])
-model2_scores = cross_val_score(model2, train[["POS_CUM_CP"]], train[ycol])
-```
-
-%% Cell type:code id:60f0bf73 tags:
-
-``` python
-model1_scores.mean()
-```
-
-%% Cell type:code id:e3070bf6 tags:
-
-``` python
-model2_scores.mean()
-```
-
-%% Cell type:markdown id:dced0919 tags:
-
-Which of these two models do you think will perform better? Probably model2.
-
-%% Cell type:code id:dfedd8d4 tags:
-
-``` python
-means = pd.Series({"model1": model1_scores.mean(),
-                   "model2": model2_scores.mean()})
-means.plot.bar(figsize=(3, 3))
-```
-
-%% Cell type:markdown id:312aa001 tags:
-
-How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.
-
-%% Cell type:code id:5123c3a9 tags:
-
-``` python
-model1_scores.std()
-```
-
-%% Cell type:code id:230b9dc9 tags:
-
-``` python
-model2_scores.std()
-```
-
-%% Cell type:code id:484c7af9 tags:
-
-``` python
-err = pd.Series({"model1": model1_scores.std(),
-                 "model2": model2_scores.std()})
-err
-```
-
-%% Cell type:code id:233cd91d tags:
-
-``` python
-# Plot error bar by passing argument to paramenter yerr
-means.plot.bar(figsize=(3, 3))
-```
-
-%% Cell type:markdown id:c3b68db6 tags:
-
-Pick a winner and run it one more time against test data.
-
-%% Cell type:markdown id:09f5bce9 tags:
-
-#### How can we use multiple x variables (multiple regression)?
-
- Features: Positive cases per age range
-
-%% Cell type:code id:d3016079 tags:
-
-``` python
-df.columns
-```
-
-%% Cell type:code id:2538534d tags:
-
-``` python
-xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
-       'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
-       'POS_80_89_CP', 'POS_90_CP']
-ycol = "DTH_CUM_CP"
-
-model = LinearRegression()
-model.fit(train[xcols], train[ycol])
-model.score(test[xcols], test[ycol])
-```
-
-%% Cell type:markdown id:92e4c272 tags:
-
-#### How can we interpret what features the model is relying on?
-
-%% Cell type:code id:68e3d21a tags:
-
-``` python
-```
-
-%% Cell type:code id:98fa5442 tags:
-
-``` python
-pd.Series(model.coef_)
-```
-
-%% Cell type:code id:44bd5b07 tags:
-
-``` python
-pd.Series(model.coef_, index=???).plot.bar(figsize=(3, 2))
-```
+### Pipeline, Transformer and Estimator

 %% Cell type:markdown id:93be2ad2 tags:

 <div>
 <img src="attachment:Simple_model.png" width="300"/>
 </div>

 %% Cell type:markdown id:c854dab5 tags:

 ### Lake Michigan Waves

 Source: https://data.cityofchicago.org/Parks-Recreation/Beach-Water-Quality-Automated-Sensors/qmqz-2xku

 %% Cell type:code id:a3109957 tags:

 ``` python
+# Read the "waves.csv" file
 df = pd.read_csv("waves.csv")
 df.head()
 ```

 %% Cell type:code id:18d71eb8 tags:

 ``` python
 df.columns
 ```

 %% Cell type:markdown id:d3f9dc12 tags:

 #### Can we predict wave height based on wave period (time between waves)?

 %% Cell type:code id:149aa67c tags:

 ``` python
 df = df[(df["Wave Period"] > 0) & (df["Wave Height"] > 0)]
 df.plot.scatter(x="Wave Period", y="Wave Height", color="k", alpha=0.05)
 ```

 %% Cell type:markdown id:4da30be1 tags:

 Observation: non-linear relationship => fitting straight line will not work.

 %% Cell type:markdown id:62bd175c tags:

 #### Can we predict wave height based on beach name and wave period?

 %% Cell type:code id:3f045b08 tags:

 ``` python
 beach_names = sorted(set(df["Beach Name"]))
 beach_names
 ```

 %% Cell type:code id:bc626c9d tags:

 ``` python
 fig, axes = plt.subplots(2, 3, figsize=(12, 8), sharey=True)
 plt.subplots_adjust(hspace=0.3)
 axes = list(axes.reshape(-1))
 for b in beach_names:
    ax = axes.pop(0)
    ax.set_title(b)
    beach_df = df[df["Beach Name"] == b]
    beach_df.plot.scatter(x="Wave Period", y="Wave Height",
                          color="k", alpha=0.1, ax=ax)
 ```

 %% Cell type:markdown id:ed02085d tags:

 Obversation: which beach (categorical feature) is important.

 %% Cell type:markdown id:ceebe2c1 tags:

 ### Four Models

 1. wave period (linear)
 2. wave period (polynomial)
 3. beach
 4. beach and wave period

 %% Cell type:markdown id:2d4fe892 tags:

 `train_test_split(<dataframe>, random_state=<some number>)`
 - `random_state` enables us to control the randomization
 - when we pass the same number, we will get the same training and test data (psuedo randomization)

 %% Cell type:code id:a454b7ec tags:

 ``` python
 train, test = train_test_split(df, random_state=320)
 train.head()
 ```

 %% Cell type:markdown id:d6f3871a tags:

 #### Model 1: Wave Period (Linear)

 %% Cell type:markdown id:4c15ac91 tags:

 `cross_val_score(estimator, X, y, cv=<number>)`
 - `cv` enables us mention how many folds we want for the cross-validation

 %% Cell type:code id:199a5ca6 tags:

 ``` python
 xcols = ["Wave Period"]
 ycol = "Wave Height"

 m1 = ???
 scores = ???
 scores
 ```

 %% Cell type:code id:15bec5d5 tags:

 ``` python
 scores.mean()
 ```

 %% Cell type:markdown id:aa5a272e tags:

 We want the mean score to be large.

 %% Cell type:code id:306e21bd tags:

 ``` python
 scores.std()
 ```

 %% Cell type:markdown id:ea190e50 tags:

 We want the standard deviation to be small, to make sure that our model isn't too sensitive to changes in training and test data.

 %% Cell type:markdown id:b126b01f tags:

 #### Model 2: Wave Period (Polynomial)

 LinearRegression can do this:

 ```
 y = 3*x1 + 5*x2
 ```
 It CANNOT do this:
 ```
 y = 3*x + 5*x^2
 ```
 TRICK:
 ```
 x1 = x
 x2 = x^2
 ```

 <div>
 <img src="attachment:Pipeline.png" width="500"/>
 </div>

 %% Cell type:code id:5b6f8a12 tags:

 ``` python
 # Let's make a copy before we add new columns
 # Recommendation: don't change the data referred to by a specific variable
 # throughout your code

 train2 = train[xcols].copy()
 train2.head()
 ```

 %% Cell type:markdown id:6be8df92 tags:

 Manually adding columns for x^2, x^3, etc.,

 %% Cell type:code id:54f8d249 tags:

 ``` python
 train2["Wave Period ^ 2"] = train2["Wave Period"] ** 2
 train2["Wave Period ^ 3"] = train2["Wave Period"] ** 3
-train2["sqrt(Wave Period"] = train2["Wave Period"] ** 0.5
+train2["sqrt(Wave Period)"] = train2["Wave Period"] ** 0.5
 train2.head()
 ```

 %% Cell type:markdown id:155f7e7a tags:

 #### `PolynomialFeatures(degree=<val>, include_bias=False)`

 - `degree` enables us to mention how many degrees we need
 - `include_bias` default value is True, which will add a column of 1s - we typically don't use that.
 - returns an object instance on which we can invoke `fit` and `transform`:
    - `transform(X, columns=<col names>)`: transform data to polynomial features`
    - `fit_transform(X[, y])`: fit to data, then transform it.
 - documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

 %% Cell type:code id:e6c43483 tags:

 ``` python
 pf =
 ```

 %% Cell type:code id:6f81a519 tags:

 ``` python
 # Fit the data
 pf
 ```

 %% Cell type:code id:8d6d2e1c tags:

 ``` python
 # Transform the data
 pf
 ```

 %% Cell type:code id:ecfc2493 tags:

 ``` python
 # Convert result of transformation into a DataFrame - step 1
 pd.DataFrame(pf.transform(train[xcols]))
 ```

 %% Cell type:markdown id:129b21dc tags:

 How can we add meaningful column names? `pf.get_feature_names_out()`

 %% Cell type:code id:1ab1eaa3 tags:

 ``` python
 pf
 ```

 %% Cell type:code id:2d62b3d3 tags:

 ``` python
 # Convert result of transformation into a DataFrame - step 2
 pd.DataFrame(pf.transform(train[xcols]), columns=???)
 ```

 %% Cell type:code id:4beebe68 tags:

 ``` python
 # Putting all the steps together
 pf = PolynomialFeatures(degree=4, include_bias=False)
 pf.fit(train[xcols])
 pd.DataFrame(pf.transform(train[xcols]), columns=pf.get_feature_names_out()).head()
 ```

 %% Cell type:markdown id:29d11ffb tags:

 `fit_transform(X[, y])`: fit to data, then transform it.

 %% Cell type:code id:91c3a781 tags:

 ``` python
 new_data = ???
 pd.DataFrame(new_data, columns=pf.get_feature_names_out()).head()
 ```

 %% Cell type:markdown id:86de680e tags:

 #### `Pipeline(...)`

 - Argument: list of steps in the pipeline:
    - each step represented as a tuple with name of the step and the object instance
    - last step will be the estimator
 - documention: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

 %% Cell type:code id:bb509930 tags:

 ``` python
 m2 = Pipeline([
    #PolynomialFeatures(degree=2, include_bias=False)
    # LinearRegression()
 ])
 m2
 ```

 %% Cell type:code id:ccf76663 tags:

 ``` python
 scores = cross_val_score(m2, train[xcols], train[ycol], cv=10)
 scores.mean()
 ```

 %% Cell type:markdown id:b7acca2c tags:

 **Conclusion:** mean R^2 score increased from 0.0029 (linear model) to 0.0489 (polynomial) - not bad!

 %% Cell type:markdown id:7dff1435 tags:

 #### Model 3: Beach Name (Categorical)

 %% Cell type:code id:76a31355 tags:

 ``` python
 train["Beach Name"].unique()
 ```

 %% Cell type:markdown id:61644774 tags:

 Naive way of assigning numerial values to a categorical column:
 - `Ohio Street Beach`: 1
 - `Calumet Beach`: 2
 - `Rainbow Beach`: 3, etc.,
 - Problem: `Calumet Beach` will become an average of `Ohio Street Beach` and `Rainbow Beach` => this doesn't make any sense!

 %% Cell type:markdown id:41c6be06 tags:

 #### `OneHotEncoder()`

 - encodes categorical features as a one-hot numeric array
 - returns a "sparse matrix", which needs to be explicitly converted into an `array` using `to_array()` method, before `DataFrame` conversion
 - documention: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

 %% Cell type:code id:55ec83e1-2800-41a2-80ce-f05a1de54c8a tags:

 ``` python
 xcols = ["Beach Name"]
 ```

 %% Cell type:code id:ef069c7b tags:

 ``` python
 ```

 %% Cell type:code id:5cb24e10 tags:

 ``` python
 # fit_transform
 ```

 %% Cell type:code id:d5303636 tags:

 ``` python
 # All steps
 oh = OneHotEncoder()
 pd.DataFrame(oh.fit_transform(train[xcols]).toarray(), columns=oh.get_feature_names_out())
 ```

 %% Cell type:code id:625a8edc tags:

 ``` python
 m3 = Pipeline([
    ("oh", OneHotEncoder()),
    ("lr", LinearRegression())
 ])
 m3
 ```

 %% Cell type:code id:5494c1fa tags:

 ``` python
 scores = cross_val_score(m3, train[xcols], train[ycol], cv=10)
 scores.mean()
 ```

 %% Cell type:markdown id:2ce15dac tags:

 **Conclusion:** mean R^2 score is slightly lower than 0.0489 (polynomial).

 %% Cell type:markdown id:e4bf025e tags:

 <div>
 <img src="attachment:ColumnTransformer.png" width="500"/>
 </div>

 %% Cell type:markdown id:0f5a1a98 tags:

 #### `make_column_transformer(...)`

 - Argument: transformations
    - each transformer argument will be a `tuple` with object instance as first item and list of feature columns as the second
 - documention: https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html

 %% Cell type:markdown id:a41f509f tags:

 #### Model 4: Beach Name (Categorical) and Wave Period (Polynomial)

 %% Cell type:code id:40378fb3 tags:

 ``` python
 custom_trans = make_column_transformer(
    (PolynomialFeatures(), ["Wave Period"]),
    (OneHotEncoder(), ["Beach Name"]),
 )
 custom_trans
 ```

 %% Cell type:code id:2d6e0ba2 tags:

 ``` python
 m4 = Pipeline([
    ("transformers", custom_trans),
    ("lr", LinearRegression()),
 ])
 m4
 ```

 %% Cell type:code id:4248010b tags:

 ``` python
 xcols = ["Beach Name", "Wave Period"]
 ```

 %% Cell type:code id:57b9ffc8 tags:

 ``` python
 scores = cross_val_score(m4, train[xcols], train[ycol], cv=10)
 scores.mean()
 ```

 %% Cell type:markdown id:73ade800 tags:

 **Conclusion:** mean R^2 score increased to 0.0885 when compared to 0.0489 (polynomial).

 %% Cell type:markdown id:247947ef tags:

 ### Let's evaluate this model by running it against the test data

 %% Cell type:code id:e7e58157 tags:

 ``` python
 m4.fit(train[xcols], train[ycol])
 m4.score(test[xcols], test[ycol])
 ```

 %% Cell type:markdown id:cd689ae1 tags:

 How to extract `coef_` for this `Pipeline`?

 %% Cell type:code id:50c7e102 tags:

 ``` python
 m4["transformers"].get_feature_names_out()
 ```

 %% Cell type:code id:28ccbf92 tags:

 ``` python
 m4["lr"].coef_
 ```

 %% Cell type:code id:09bc63d9-ab2b-4723-94cf-94ebac037d8d tags:

 ``` python
 s = pd.Series(m4["lr"].coef_, index=m4["transformers"].get_feature_names_out())
 s.plot.bar()
 ```

 %% Cell type:code id:acd00dc3 tags:

 ``` python
 s = pd.Series(m4["lr"].coef_, index=m4["transformers"].get_feature_names_out())
 s.plot.barh()
 ```

 %% Cell type:markdown id:279ae068-9fc8-45d3-9bcf-c9ae9252f19f tags:

 What are the beaches that have the biggest waves?

 %% Cell type:code id:2ff1d03b-48bb-4150-bc27-42f5064f88df tags:

 ``` python
 ```

--- a/lecture-material/19-Regression_2/covid.geojson
+++ b/lecture-material/19-Regression_2/covid.geojson
--- a/lecture-material/21-Classification_1/21-classification.ipynb
+++ b/lecture-material/21-Classification_1/21-classification.ipynb
--- a/lecture-material/21-Classification_1/21-classification_001.ipynb
+++ b/lecture-material/21-Classification_1/21-classification_001.ipynb
No results found