Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • cdis/cs/courses/cs320/sp25
  • akante/sp25
  • asaxena26/sp25
  • tppatel4/sp25
  • ssoens/sp25
  • whoover/sp25
  • sheard2/sp25
  • alhughes3/sp25
  • afzar/sp25
  • lysek/sp25
  • rtyang/sp25
  • tenolte/sp25
  • tkrohn2/sp25
  • lliu426/sp25
14 results
Show changes
Commits on Source (14)
Showing
with 3320 additions and 8678 deletions
No preview for this file type
No preview for this file type
No preview for this file type
No preview for this file type
No preview for this file type
No preview for this file type
%% Cell type:markdown id:471a762b tags:
# Visualization 2
%% Cell type:markdown id:2478daaa-cb6c-4d73-92af-01ae91e773fe tags:
### Geographic Data / Maps
#### Installation
```python
pip3 install --upgrade pip
pip3 install geopandas shapely descartes geopy netaddr
sudo apt install -y python3-rtree
```
- `import geopandas as gpd`
- `.shp` => Shapefile
- `gpd.datasets.get_path(<shp file path>)`:
- example: `gpd.datasets.get_path("naturalearth_lowres")`
- `gpd.read_file(<path>)`
%% Cell type:code id:e6f50cc3 tags:
``` python
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import math
import requests
import re
import os
# new import statements
import geopandas as gpd
from shapely.geometry import Point, Polygon, box
```
%% Cell type:code id:257671bc-1e79-47c2-aa7a-1725562d23ef tags:
``` python
!ls /home/gurmail.singh/.local/lib/python3.8/site-packages/geopandas/datasets/naturalearth_lowres
```
%% Cell type:code id:e7269706-188a-48e3-882a-ac068170b1af tags:
``` python
!ls /home/gurmail.singh/.local/lib/python3.8/site-packages/geopandas/datasets
```
%% Cell type:code id:273ed288 tags:
``` python
# Find the path for "naturalearth_lowres"
path =
# Read the shapefile for "naturalearth_lowres" and
# Read the geojson file for "naturalearth_lowres"
gdf = gpd.read_file("naturalearth_lowres.geojson")
# set index using "name" column
gdf =
gdf = gdf.set_index("name")
```
%% Cell type:code id:cf4d871c-d6d6-4d99-be96-75faf804d4a7 tags:
``` python
gdf.head()
```
%% Cell type:code id:5c983208-7851-4d25-92e7-3f110039411a tags:
``` python
type(gdf).__mro__
```
%% Cell type:code id:653edec9-8b82-4684-ae82-a9fa399459ce tags:
``` python
# All shapefiles have a column called "geometry"
# Get the "geometry" column
gdf["geometry"]
```
%% Cell type:code id:e2eb071a-b29a-4edd-8747-50b02fd43108 tags:
``` python
type(gdf["geometry"]).__mro__
```
%% Cell type:code id:da5b5783-d78e-460f-bd22-b41e7f24f871 tags:
``` python
# First country's name and geometry
```
%% Cell type:code id:8f4da997-567d-47f6-8594-f0a37b92b9e6 tags:
``` python
# Second country's name geometry
print(gdf.index[1])
gdf["geometry"].iat[1]
```
%% Cell type:code id:d380b32f-c401-4601-b6d3-02ac344b3f08 tags:
``` python
# Geometry for "United States of America"
# gdf.at[<row_index>, <column_name>]
```
%% Cell type:code id:3528f252-36b7-4ca5-9108-0951367837cc tags:
``` python
# Type of Tanzania's geometry
print(gdf.index[1], type(gdf["geometry"].iat[1]))
# Type of United States of America's geometry
print("United States of America", type(gdf.at["United States of America", "geometry"]))
```
%% Cell type:markdown id:47711b72-a24c-4da8-a804-3faf88a5fe8b tags:
- `gdf.plot(figsize=(<width>, <height>), column=<column name>)`
- `ax.set_axis_off()`
%% Cell type:code id:9874c7de-226e-4296-af60-45e091836a19 tags:
``` python
ax = gdf.plot(figsize=(8,4))
```
%% Cell type:code id:f5547cc5-3404-421e-aeed-4b4cf8ee8382 tags:
``` python
# Set facecolor="0.7", edgecolor="black"
ax = gdf.plot(figsize=(8,4))
# Turn off the axes
# ax.set_axis_off()
```
%% Cell type:code id:58da716e tags:
``` python
# Color the map based on population column, column="pop_est" and set cmap="cool" and legend=True
ax = gdf.plot(figsize=(8,4))
# Turn off the axes
ax.set_axis_off()
```
%% Cell type:markdown id:c7a85431-bdab-46ab-8e29-5f91d8a2e0bc tags:
#### Create a map where countries with >100M people are red, others are gray.
%% Cell type:code id:962a552b-94a6-4690-8018-f82173dd6096 tags:
``` python
# Create a map where countries with >100M people are red, others are gray
# Add a new column called color to gdf and set default value to "lightgray"
# Boolean indexing to set color to red for countries with "pop_est" > 1e8
# Create the plot
# ax = gdf.plot(figsize=(8,4), color=gdf["color"])
# ax.set_axis_off()
```
%% Cell type:markdown id:5a32c52c-509f-4f7f-8dd6-bc7fe53c64fd tags:
### All shapefile geometries are shapely shapes.
%% Cell type:code id:33e92def-a7db-4d90-adc8-a3d01d472ac1 tags:
``` python
type(gdf["geometry"].iat[2])
```
%% Cell type:markdown id:39c95c23 tags:
### Shapely shapes
- `from shapely.geometry import Point, Polygon, box`
- `Polygon([(<x1>, <y1>), (<x2>, <y2>), (<x3>, <y3>), ...])`
- `box(minx, miny, maxx, maxy)`
- `Point(<x>, <y>)`
- `<shapely object>.buffer(<size>)`
- example: `Point(5, 5).buffer(3)` creates a circle
%% Cell type:code id:61716db9 tags:
``` python
triangle = # triangle
triangle
```
%% Cell type:code id:6a36021a-8653-4698-a0ba-818c14091d79 tags:
``` python
type(triangle)
```
%% Cell type:code id:bddd958d-6fde-42df-8ae3-459e25fa3f04 tags:
``` python
box1 = # not a type; just a function that creates box
box1
```
%% Cell type:code id:eb7ade8d-96d2-4770-bce9-b3e35996b0b8 tags:
``` python
type(box1)
```
%% Cell type:code id:e5308c61-b1fa-433b-bb05-485ca7bd23da tags:
``` python
point =
point
```
%% Cell type:code id:5c669619-af78-477d-b807-3e6d99278f2f tags:
``` python
type(point)
```
%% Cell type:code id:f015d27e-8fd8-446f-992f-4f7a88cc582d tags:
``` python
# use buffer to create a circle from a point
circle =
circle
```
%% Cell type:code id:39fe5752-8038-46cc-b4a6-320e6a78bbd1 tags:
``` python
type(circle)
```
%% Cell type:code id:800cc48d-c241-4439-9161-ff309e50373f tags:
``` python
triangle_buffer = triangle.buffer(3)
triangle_buffer
```
%% Cell type:code id:6b9478d3-7cc9-4e80-ae84-e56d7bfd0b71 tags:
``` python
type(triangle_buffer)
```
%% Cell type:markdown id:1a340443-d750-43d5-99cd-28e7034ce898 tags:
#### Shapely methods
- Shapely methods:
- `union`: any point that is in either shape (OR)
- `intersection`: any point that is in both shapes (AND)
- `difference`: subtraction
- `intersects`: do they overlap? returns True / False
%% Cell type:code id:d1c5f9f7 tags:
``` python
# union triangle and box1
# it will give any point that is in either shape (OR)
```
%% Cell type:code id:8a2d3357 tags:
``` python
# intersection triangle and box1
# any point that is in both shapes (AND)
```
%% Cell type:code id:f327bb99-be34-4a00-a216-a6f40df48268 tags:
``` python
# difference of triangle and box1
```
%% Cell type:code id:9082b54a tags:
``` python
# difference of box1 and triangle
box1.difference(triangle) # subtraction
```
%% Cell type:code id:0def4359-78f6-4f14-80ae-05c25ff64bc5 tags:
``` python
# check whether triangle intersects box1
# the is, check do they overlap?
```
%% Cell type:markdown id:51a82cf3-fe28-46d2-a019-d87f6a0f41b6 tags:
Is the point "near" (<6 units) the triangle?
%% Cell type:code id:7a87b70f tags:
``` python
triangle.union(point.buffer(6))
```
%% Cell type:code id:e04daa60 tags:
``` python
triangle.intersects(point.buffer(6))
```
%% Cell type:markdown id:bf500303 tags:
#### Extracting "Europe" data from "naturalearth_lowres"
%% Cell type:code id:410b08cf tags:
``` python
# Europe bounding box
eur_window = box(-10.67, 34.5, 31.55, 71.05)
```
%% Cell type:markdown id:d45fdfcb-b244-4eff-a9d6-5cc4fefdfbd8 tags:
Can we use `intersects` method?
%% Cell type:code id:24f4a32b-9ed4-468b-a194-ebe0395fcdb6 tags:
``` python
gdf.intersects(eur_window)
```
%% Cell type:code id:ff455178-84da-45bf-b5b9-a71a7f9160f7 tags:
``` python
# Incorrect v1
gdf[gdf.intersects(eur_window)].plot()
```
%% Cell type:code id:03aa8da8-72f4-45fd-8931-e410d1204226 tags:
``` python
# Incorrect v2
gdf[~gdf.intersects(eur_window)].plot()
```
%% Cell type:markdown id:4173f3c5-017f-44ff-b713-158219fcf3e5 tags:
Can we use `intersection` method?
%% Cell type:code id:d7226fd9-3bb2-48c3-a5e9-5ce1ca0dd88f tags:
``` python
gdf.intersection(eur_window)
```
%% Cell type:code id:108b8b8a tags:
``` python
gdf.intersection(eur_window).plot()
```
%% Cell type:markdown id:f13e4e16-286a-4681-8bc5-9bd3fcf599a7 tags:
How can we get rid of empty polygons (and remove the warning)?
%% Cell type:code id:8c0dd051-a808-4fd4-b5ff-fee6ed580753 tags:
``` python
eur = gdf.intersection(eur_window)
eur.is_empty
```
%% Cell type:markdown id:acb1b886-f7c8-41d4-979a-2309c2b973bd tags:
Remove all the empty polygons using `is_empty`.
%% Cell type:code id:55a76a00 tags:
``` python
eur = eur[~eur.is_empty]
eur
```
%% Cell type:code id:08c59df7 tags:
``` python
eur.plot()
```
%% Cell type:markdown id:97a12d23 tags:
#### Centroids of European countries
%% Cell type:code id:eb5c83b7 tags:
``` python
# plot the centroids
ax = eur.plot(facecolor="lightgray", edgecolor="k")
eur.centroid.plot(ax=ax)
```
%% Cell type:markdown id:dbfab5e7-5941-438a-baf2-e05369106004 tags:
### Lat / Lon CRS
- Lon is x-coord
- Lat is y-coord
- tells you where the point on Earth is
- **IMPORTANT**: degrees are not a unit of distance. 1 degree of longitute near the equator is a lot farther than moving 1 degree of longitute near the north pole
Using `.crs` to access CRS of a gdf.
%% Cell type:code id:89251896 tags:
``` python
eur.crs
```
%% Cell type:markdown id:2a6ba447-6f1a-4d31-8c33-ec1eeb9e0deb tags:
#### Single CRS doesn't work for the whole earth
- Setting a different CRS for Europe that is based on meters.
- https://spatialreference.org/ref/?search=europe
%% Cell type:code id:586038b1 tags:
``` python
# Setting CRS to "EPSG:3035"
eur2 = eur.to_crs("EPSG:3035")
eur2.crs
```
%% Cell type:code id:d46c124c-9aff-4c2f-81fe-08885b606800 tags:
``` python
ax = eur2.plot(facecolor="lightgray", edgecolor="k")
eur2.centroid.plot()
```
%% Cell type:code id:045b9c33 tags:
``` python
ax = eur2.plot(facecolor="lightgray", edgecolor="k")
eur2.centroid.plot(ax=ax)
```
%% Cell type:markdown id:0634941f tags:
#### How much error does lat/long computation introduce?
#### How much error does lat/lon computation introduce?
%% Cell type:code id:c0b72aff tags:
``` python
ax = eur2.plot(facecolor="lightgray", edgecolor="k")
eur2.centroid.plot(ax=ax, color="k") # black => correct
eur.centroid.to_crs("EPSG:3035").plot(ax=ax, color="r") # red => miscalculated
```
%% Cell type:code id:ca9e306e tags:
``` python
type(eur2.iloc[0])
```
%% Cell type:code id:f489c88d-5964-4358-b8c4-fc80d28b5491 tags:
``` python
type(eur2).__mro__
```
%% Cell type:markdown id:85880c73 tags:
#### Area of European countries
%% Cell type:code id:3e4874d9 tags:
``` python
eur2.area # area in sq meters
```
%% Cell type:markdown id:95f55824 tags:
What is the area in **sq miles**?
%% Cell type:code id:85ee20c2 tags:
``` python
# Conversion: / 1000 / 1000 / 2.59
(eur2.area / 1000 / 1000 / 2.59).sort_values(ascending=False)
# careful! some countries (e.g., Russia) were cropped when we did intersection
```
%% Cell type:code id:cd600837 tags:
``` python
# area on screen, not real area
eur.area
```
%% Cell type:markdown id:daf1245f-9939-468d-9a5f-34225c2dbc51 tags:
### CRS
- `<GeoDataFrame object>.crs`: gives you information about current CRS.
- `<GeoDataFrame object>.to_crs(<TARGET CRS>)`: changes CRS to `<TARGET CRS>`.
%% Cell type:markdown id:9da3ee4c tags:
### Madison area emergency services
- Data source: https://data-cityofmadison.opendata.arcgis.com/
- Search for:
- "City limit"
- "Lakes and rivers"
- "Fire stations"
- "Police stations"
- CRS for Madison area: https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system#/media/File:Universal_Transverse_Mercator_zones.svg
%% Cell type:code id:a6f80847 tags:
``` python
city = gpd.read_file("City_Limit.zip").to_crs("epsg:32616")
```
%% Cell type:code id:7f8595be tags:
``` python
city.crs
```
%% Cell type:code id:9ebd361f tags:
``` python
water = gpd.read_file("Lakes_and_Rivers.zip").to_crs(city.crs)
fire = gpd.read_file("Fire_Stations.zip").to_crs(city.crs)
police = gpd.read_file("Police_Stations.zip").to_crs(city.crs)
```
%% Cell type:markdown id:723e2bff-545f-4a8c-9f72-8a9906a99b1b tags:
#### Run this on your virtual machine
`sudo sh -c "echo 'Options = UnsafeLegacyRenegotiation' >> /etc/ssl/openssl.cnf"`
then restart notebook!
%% Cell type:markdown id:b6069860-7dd9-43f2-970e-fd15980135ef tags:
#### GeoJSON
How to find the below URL?
- Go to info page of a dataset, for example: https://data-cityofmadison.opendata.arcgis.com/datasets/police-stations/explore?location=43.081769%2C-89.391550%2C12.81
- Then click on "I want to use this" > "View API Resources" > "GeoJSON"
%% Cell type:code id:3a095d5e tags:
``` python
url = "https://maps.cityofmadison.com/arcgis/rest/services/Public/OPEN_DATA/MapServer/2/query?outFields=*&where=1%3D1&f=geojson"
police2 = gpd.read_file(url).to_crs(city.crs)
```
%% Cell type:code id:248be81e tags:
``` python
ax = city.plot(color="lightgray")
water.plot(color="lightblue", ax=ax)
fire.plot(color="red", ax=ax, marker="+", label="Fire")
police2.plot(color="blue", ax=ax, label="Police")
ax.legend(loc="upper left", frameon=False)
ax.set_axis_off()
```
%% Cell type:code id:3a609d81 tags:
``` python
fire.to_file("fire.geojson")
```
%% Cell type:markdown id:77600884 tags:
### Geocoding: street address => lat / lon
- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long
- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/lon
#### Daily incident reports: https://www.cityofmadison.com/fire/daily-reports
%% Cell type:code id:6b0b2aa0 tags:
``` python
url = "https://www.cityofmadison.com/fire/daily-reports"
r = requests.get(url)
r
```
%% Cell type:code id:bee28b41 tags:
``` python
r.raise_for_status() # give me an exception if not 200 (e.g., 404)
```
%% Cell type:code id:0bae00e7 tags:
``` python
# doesn't work
# sometime the following doesn't work
# pd.read_html(url)
```
%% Cell type:code id:47173ec2 tags:
``` python
# print(r.text)
```
%% Cell type:markdown id:39f166c5 tags:
Find all **span** tags with **streetAddress** using regex.
%% Cell type:code id:ac7b9482-5512-49e4-b0b8-7f9ed34b5844 tags:
``` python
# <p>1700 block Thierer Road<br>
# <p>2300 block Pennsylvania Avenue<br>
# addrs = re.findall(r'<p>1700 block Thierer Road<br>', r.text)
```
%% Cell type:code id:8e9b49d2-3e0d-4a39-9dcf-13b288a165e7 tags:
``` python
addrs = re.findall(r'', r.text)
addrs = pd.Series(addrs)
addrs
```
%% Cell type:markdown id:4fc8ba0c-9d45-4251-8055-1310cabf15a9 tags:
#### Without city name and state name, geocoding would return match with the most famous location with such a street name.
%% Cell type:code id:095b9ebe-b583-4098-a3a2-47dd46610d59 tags:
``` python
geo_info = gpd.tools.geocode("1300 East Washington Ave")
geo_info = gpd.tools.geocode("2300 block Pennsylvania Avenue")
geo_info
```
%% Cell type:code id:e52ee0fc-d73c-4942-9bec-d1586a702f68 tags:
``` python
geo_info["address"].loc[0]
```
%% Cell type:markdown id:e80b714c-a962-4a47-b7cc-3a19d0da8ac0 tags:
#### To get the correct address we want, we should concatenate "Madison, Wisconsin" to the end of the address.
%% Cell type:code id:cf3f590b-8d00-46c4-ac57-6f2f61509f68 tags:
``` python
geo_info = gpd.tools.geocode("1300 East Washington Ave, Madison, Wisconsin")
geo_info = gpd.tools.geocode("2300 block Pennsylvania Avenue, Madison, Wisconsin")
geo_info
```
%% Cell type:markdown id:3caf8c12-67a3-4e19-a758-d556d010eead tags:
#### Addresses with "block" often won't work or won't give you the correct lat/long. We need to remove the word "block" before geocoding.
%% Cell type:code id:54103e4a-5ac2-4f19-811b-385c02623ed2 tags:
``` python
gpd.tools.geocode("800 block W. Johnson Street, Madison, Wisconsin")
```
%% Cell type:code id:66072f4b-2286-4d66-ba2c-18ccd2e37ed6 tags:
``` python
gpd.tools.geocode("800 W. Johnson Street, Madison, Wisconsin")
```
#### Addresses with "block" often won't work or won't give you the correct lat/lon. We need to remove the word "block" before geocoding.
%% Cell type:code id:cf982302 tags:
``` python
fixed_addrs = addrs.str.replace(" block ", " ") + ", Madison, WI"
fixed_addrs = addrs.str.replace(" block ", " ") + ", Madison, Wisconsin"
fixed_addrs
```
%% Cell type:markdown id:d6a267ed-6876-4866-a66f-74390a3d4ee1 tags:
#### Using a different provider than the default one
- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long
- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/lon
- We will be using "OpenStreetMap", for which the argument is "nominatim"
- We also need to specify argument to `user_agent` parameter, indicating where the request is coming from; for example: "cs320_bot"
%% Cell type:code id:ab0e699f tags:
``` python
incidents = gpd.tools.geocode(fixed_addrs, provider="nominatim", user_agent="cs320bot")
incidents
```
%% Cell type:markdown id:4ad73b2c-5171-45c2-a3f0-eef30f93c492 tags:
It is often a good idea to drop na values. Although in this version of the example, there are no failed geocodings.
%% Cell type:code id:41a7d12e-73be-4442-b43c-285229e6cfdb tags:
``` python
incidents = incidents.dropna()
incidents
```
%% Cell type:markdown id:b30733e6-9491-4be0-bf20-bba64903334d tags:
#### Self-practice
If you want practice with regex, try to write regular expression and use the match result to make sure that "Madison" and "Wisconsin" is part of each address.
%% Cell type:code id:843bbba2-3de5-4cc0-b76b-92e7bebdd379 tags:
``` python
# self-practice
for addr in incidents["address"]:
print(addr)
```
%% Cell type:code id:1a04c2b0 tags:
``` python
ax = city.plot(color="lightgray")
water.plot(color="lightblue", ax=ax)
fire.plot(color="red", ax=ax, marker="+", label="Fire")
police2.plot(color="blue", ax=ax, label="Police")
incidents.to_crs(city.crs).plot(ax=ax, color="k", label="Incidents")
ax.legend(loc="upper left", frameon=False)
ax.set_axis_off()
```
......
%% Cell type:markdown id:471a762b tags:
# Visualization 2
%% Cell type:markdown id:2478daaa-cb6c-4d73-92af-01ae91e773fe tags:
### Geographic Data / Maps
#### Installation
```python
pip3 install --upgrade pip
pip3 install geopandas shapely descartes geopy netaddr
sudo apt install -y python3-rtree
```
- `import geopandas as gpd`
- `.shp` => Shapefile
- `gpd.datasets.get_path(<shp file path>)`:
- example: `gpd.datasets.get_path("naturalearth_lowres")`
- `gpd.read_file(<path>)`
%% Cell type:code id:e6f50cc3 tags:
``` python
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import math
import requests
import re
import os
# new import statements
import geopandas as gpd
from shapely.geometry import Point, Polygon, box
```
%% Cell type:code id:257671bc-1e79-47c2-aa7a-1725562d23ef tags:
``` python
!ls /home/gurmail.singh/.local/lib/python3.8/site-packages/geopandas/datasets/naturalearth_lowres
```
%% Cell type:code id:e7269706-188a-48e3-882a-ac068170b1af tags:
``` python
!ls /home/gurmail.singh/.local/lib/python3.8/site-packages/geopandas/datasets
```
%% Cell type:code id:273ed288 tags:
``` python
# Find the path for "naturalearth_lowres"
path =
# Read the shapefile for "naturalearth_lowres" and
# Read the geojson file for "naturalearth_lowres"
gdf = gpd.read_file("naturalearth_lowres.geojson")
# set index using "name" column
gdf =
gdf = gdf.set_index("name")
```
%% Cell type:code id:cf4d871c-d6d6-4d99-be96-75faf804d4a7 tags:
``` python
gdf.head()
```
%% Cell type:code id:5c983208-7851-4d25-92e7-3f110039411a tags:
``` python
type(gdf).__mro__
```
%% Cell type:code id:653edec9-8b82-4684-ae82-a9fa399459ce tags:
``` python
# All shapefiles have a column called "geometry"
# Get the "geometry" column
gdf["geometry"]
```
%% Cell type:code id:e2eb071a-b29a-4edd-8747-50b02fd43108 tags:
``` python
type(gdf["geometry"]).__mro__
```
%% Cell type:code id:da5b5783-d78e-460f-bd22-b41e7f24f871 tags:
``` python
# First country's name and geometry
```
%% Cell type:code id:8f4da997-567d-47f6-8594-f0a37b92b9e6 tags:
``` python
# Second country's name geometry
print(gdf.index[1])
gdf["geometry"].iat[1]
```
%% Cell type:code id:d380b32f-c401-4601-b6d3-02ac344b3f08 tags:
``` python
# Geometry for "United States of America"
# gdf.at[<row_index>, <column_name>]
```
%% Cell type:code id:3528f252-36b7-4ca5-9108-0951367837cc tags:
``` python
# Type of Tanzania's geometry
print(gdf.index[1], type(gdf["geometry"].iat[1]))
# Type of United States of America's geometry
print("United States of America", type(gdf.at["United States of America", "geometry"]))
```
%% Cell type:markdown id:47711b72-a24c-4da8-a804-3faf88a5fe8b tags:
- `gdf.plot(figsize=(<width>, <height>), column=<column name>)`
- `ax.set_axis_off()`
%% Cell type:code id:9874c7de-226e-4296-af60-45e091836a19 tags:
``` python
ax = gdf.plot(figsize=(8,4))
```
%% Cell type:code id:f5547cc5-3404-421e-aeed-4b4cf8ee8382 tags:
``` python
# Set facecolor="0.7", edgecolor="black"
ax = gdf.plot(figsize=(8,4))
# Turn off the axes
# ax.set_axis_off()
```
%% Cell type:code id:58da716e tags:
``` python
# Color the map based on population column, column="pop_est" and set cmap="cool" and legend=True
ax = gdf.plot(figsize=(8,4))
# Turn off the axes
ax.set_axis_off()
```
%% Cell type:markdown id:c7a85431-bdab-46ab-8e29-5f91d8a2e0bc tags:
#### Create a map where countries with >100M people are red, others are gray.
%% Cell type:code id:962a552b-94a6-4690-8018-f82173dd6096 tags:
``` python
# Create a map where countries with >100M people are red, others are gray
# Add a new column called color to gdf and set default value to "lightgray"
# Boolean indexing to set color to red for countries with "pop_est" > 1e8
# Create the plot
# ax = gdf.plot(figsize=(8,4), color=gdf["color"])
# ax.set_axis_off()
```
%% Cell type:markdown id:5a32c52c-509f-4f7f-8dd6-bc7fe53c64fd tags:
### All shapefile geometries are shapely shapes.
%% Cell type:code id:33e92def-a7db-4d90-adc8-a3d01d472ac1 tags:
``` python
type(gdf["geometry"].iat[2])
```
%% Cell type:markdown id:39c95c23 tags:
### Shapely shapes
- `from shapely.geometry import Point, Polygon, box`
- `Polygon([(<x1>, <y1>), (<x2>, <y2>), (<x3>, <y3>), ...])`
- `box(minx, miny, maxx, maxy)`
- `Point(<x>, <y>)`
- `<shapely object>.buffer(<size>)`
- example: `Point(5, 5).buffer(3)` creates a circle
%% Cell type:code id:61716db9 tags:
``` python
triangle = # triangle
triangle
```
%% Cell type:code id:6a36021a-8653-4698-a0ba-818c14091d79 tags:
``` python
type(triangle)
```
%% Cell type:code id:bddd958d-6fde-42df-8ae3-459e25fa3f04 tags:
``` python
box1 = # not a type; just a function that creates box
box1
```
%% Cell type:code id:eb7ade8d-96d2-4770-bce9-b3e35996b0b8 tags:
``` python
type(box1)
```
%% Cell type:code id:e5308c61-b1fa-433b-bb05-485ca7bd23da tags:
``` python
point =
point
```
%% Cell type:code id:5c669619-af78-477d-b807-3e6d99278f2f tags:
``` python
type(point)
```
%% Cell type:code id:f015d27e-8fd8-446f-992f-4f7a88cc582d tags:
``` python
# use buffer to create a circle from a point
circle =
circle
```
%% Cell type:code id:39fe5752-8038-46cc-b4a6-320e6a78bbd1 tags:
``` python
type(circle)
```
%% Cell type:code id:800cc48d-c241-4439-9161-ff309e50373f tags:
``` python
triangle_buffer = triangle.buffer(3)
triangle_buffer
```
%% Cell type:code id:6b9478d3-7cc9-4e80-ae84-e56d7bfd0b71 tags:
``` python
type(triangle_buffer)
```
%% Cell type:markdown id:1a340443-d750-43d5-99cd-28e7034ce898 tags:
#### Shapely methods
- Shapely methods:
- `union`: any point that is in either shape (OR)
- `intersection`: any point that is in both shapes (AND)
- `difference`: subtraction
- `intersects`: do they overlap? returns True / False
%% Cell type:code id:d1c5f9f7 tags:
``` python
# union triangle and box1
# it will give any point that is in either shape (OR)
```
%% Cell type:code id:8a2d3357 tags:
``` python
# intersection triangle and box1
# any point that is in both shapes (AND)
```
%% Cell type:code id:f327bb99-be34-4a00-a216-a6f40df48268 tags:
``` python
# difference of triangle and box1
```
%% Cell type:code id:9082b54a tags:
``` python
# difference of box1 and triangle
box1.difference(triangle) # subtraction
```
%% Cell type:code id:0def4359-78f6-4f14-80ae-05c25ff64bc5 tags:
``` python
# check whether triangle intersects box1
# the is, check do they overlap?
```
%% Cell type:markdown id:51a82cf3-fe28-46d2-a019-d87f6a0f41b6 tags:
Is the point "near" (<6 units) the triangle?
%% Cell type:code id:7a87b70f tags:
``` python
triangle.union(point.buffer(6))
```
%% Cell type:code id:e04daa60 tags:
``` python
triangle.intersects(point.buffer(6))
```
%% Cell type:markdown id:bf500303 tags:
#### Extracting "Europe" data from "naturalearth_lowres"
%% Cell type:code id:410b08cf tags:
``` python
# Europe bounding box
eur_window = box(-10.67, 34.5, 31.55, 71.05)
```
%% Cell type:markdown id:d45fdfcb-b244-4eff-a9d6-5cc4fefdfbd8 tags:
Can we use `intersects` method?
%% Cell type:code id:24f4a32b-9ed4-468b-a194-ebe0395fcdb6 tags:
``` python
gdf.intersects(eur_window)
```
%% Cell type:code id:ff455178-84da-45bf-b5b9-a71a7f9160f7 tags:
``` python
# Incorrect v1
gdf[gdf.intersects(eur_window)].plot()
```
%% Cell type:code id:03aa8da8-72f4-45fd-8931-e410d1204226 tags:
``` python
# Incorrect v2
gdf[~gdf.intersects(eur_window)].plot()
```
%% Cell type:markdown id:4173f3c5-017f-44ff-b713-158219fcf3e5 tags:
Can we use `intersection` method?
%% Cell type:code id:d7226fd9-3bb2-48c3-a5e9-5ce1ca0dd88f tags:
``` python
gdf.intersection(eur_window)
```
%% Cell type:code id:108b8b8a tags:
``` python
gdf.intersection(eur_window).plot()
```
%% Cell type:markdown id:f13e4e16-286a-4681-8bc5-9bd3fcf599a7 tags:
How can we get rid of empty polygons (and remove the warning)?
%% Cell type:code id:8c0dd051-a808-4fd4-b5ff-fee6ed580753 tags:
``` python
eur = gdf.intersection(eur_window)
eur.is_empty
```
%% Cell type:markdown id:acb1b886-f7c8-41d4-979a-2309c2b973bd tags:
Remove all the empty polygons using `is_empty`.
%% Cell type:code id:55a76a00 tags:
``` python
eur = eur[~eur.is_empty]
eur
```
%% Cell type:code id:08c59df7 tags:
``` python
eur.plot()
```
%% Cell type:markdown id:97a12d23 tags:
#### Centroids of European countries
%% Cell type:code id:eb5c83b7 tags:
``` python
# plot the centroids
ax = eur.plot(facecolor="lightgray", edgecolor="k")
eur.centroid.plot(ax=ax)
```
%% Cell type:markdown id:dbfab5e7-5941-438a-baf2-e05369106004 tags:
### Lat / Lon CRS
- Lon is x-coord
- Lat is y-coord
- tells you where the point on Earth is
- **IMPORTANT**: degrees are not a unit of distance. 1 degree of longitute near the equator is a lot farther than moving 1 degree of longitute near the north pole
Using `.crs` to access CRS of a gdf.
%% Cell type:code id:89251896 tags:
``` python
eur.crs
```
%% Cell type:markdown id:2a6ba447-6f1a-4d31-8c33-ec1eeb9e0deb tags:
#### Single CRS doesn't work for the whole earth
- Setting a different CRS for Europe that is based on meters.
- https://spatialreference.org/ref/?search=europe
%% Cell type:code id:586038b1 tags:
``` python
# Setting CRS to "EPSG:3035"
eur2 = eur.to_crs("EPSG:3035")
eur2.crs
```
%% Cell type:code id:d46c124c-9aff-4c2f-81fe-08885b606800 tags:
``` python
ax = eur2.plot(facecolor="lightgray", edgecolor="k")
eur2.centroid.plot()
```
%% Cell type:code id:045b9c33 tags:
``` python
ax = eur2.plot(facecolor="lightgray", edgecolor="k")
eur2.centroid.plot(ax=ax)
```
%% Cell type:markdown id:0634941f tags:
#### How much error does lat/long computation introduce?
#### How much error does lat/lon computation introduce?
%% Cell type:code id:c0b72aff tags:
``` python
ax = eur2.plot(facecolor="lightgray", edgecolor="k")
eur2.centroid.plot(ax=ax, color="k") # black => correct
eur.centroid.to_crs("EPSG:3035").plot(ax=ax, color="r") # red => miscalculated
```
%% Cell type:code id:ca9e306e tags:
``` python
type(eur2.iloc[0])
```
%% Cell type:code id:f489c88d-5964-4358-b8c4-fc80d28b5491 tags:
``` python
type(eur2).__mro__
```
%% Cell type:markdown id:85880c73 tags:
#### Area of European countries
%% Cell type:code id:3e4874d9 tags:
``` python
eur2.area # area in sq meters
```
%% Cell type:markdown id:95f55824 tags:
What is the area in **sq miles**?
%% Cell type:code id:85ee20c2 tags:
``` python
# Conversion: / 1000 / 1000 / 2.59
(eur2.area / 1000 / 1000 / 2.59).sort_values(ascending=False)
# careful! some countries (e.g., Russia) were cropped when we did intersection
```
%% Cell type:code id:cd600837 tags:
``` python
# area on screen, not real area
eur.area
```
%% Cell type:markdown id:daf1245f-9939-468d-9a5f-34225c2dbc51 tags:
### CRS
- `<GeoDataFrame object>.crs`: gives you information about current CRS.
- `<GeoDataFrame object>.to_crs(<TARGET CRS>)`: changes CRS to `<TARGET CRS>`.
%% Cell type:markdown id:9da3ee4c tags:
### Madison area emergency services
- Data source: https://data-cityofmadison.opendata.arcgis.com/
- Search for:
- "City limit"
- "Lakes and rivers"
- "Fire stations"
- "Police stations"
- CRS for Madison area: https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system#/media/File:Universal_Transverse_Mercator_zones.svg
%% Cell type:code id:a6f80847 tags:
``` python
city = gpd.read_file("City_Limit.zip").to_crs("epsg:32616")
```
%% Cell type:code id:7f8595be tags:
``` python
city.crs
```
%% Cell type:code id:9ebd361f tags:
``` python
water = gpd.read_file("Lakes_and_Rivers.zip").to_crs(city.crs)
fire = gpd.read_file("Fire_Stations.zip").to_crs(city.crs)
police = gpd.read_file("Police_Stations.zip").to_crs(city.crs)
```
%% Cell type:markdown id:723e2bff-545f-4a8c-9f72-8a9906a99b1b tags:
#### Run this on your virtual machine
`sudo sh -c "echo 'Options = UnsafeLegacyRenegotiation' >> /etc/ssl/openssl.cnf"`
then restart notebook!
%% Cell type:markdown id:b6069860-7dd9-43f2-970e-fd15980135ef tags:
#### GeoJSON
How to find the below URL?
- Go to info page of a dataset, for example: https://data-cityofmadison.opendata.arcgis.com/datasets/police-stations/explore?location=43.081769%2C-89.391550%2C12.81
- Then click on "I want to use this" > "View API Resources" > "GeoJSON"
%% Cell type:code id:3a095d5e tags:
``` python
url = "https://maps.cityofmadison.com/arcgis/rest/services/Public/OPEN_DATA/MapServer/2/query?outFields=*&where=1%3D1&f=geojson"
police2 = gpd.read_file(url).to_crs(city.crs)
```
%% Cell type:code id:248be81e tags:
``` python
ax = city.plot(color="lightgray")
water.plot(color="lightblue", ax=ax)
fire.plot(color="red", ax=ax, marker="+", label="Fire")
police2.plot(color="blue", ax=ax, label="Police")
ax.legend(loc="upper left", frameon=False)
ax.set_axis_off()
```
%% Cell type:code id:3a609d81 tags:
``` python
fire.to_file("fire.geojson")
```
%% Cell type:markdown id:77600884 tags:
### Geocoding: street address => lat / lon
- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long
- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/lon
#### Daily incident reports: https://www.cityofmadison.com/fire/daily-reports
%% Cell type:code id:6b0b2aa0 tags:
``` python
url = "https://www.cityofmadison.com/fire/daily-reports"
r = requests.get(url)
r
```
%% Cell type:code id:bee28b41 tags:
``` python
r.raise_for_status() # give me an exception if not 200 (e.g., 404)
```
%% Cell type:code id:0bae00e7 tags:
``` python
# doesn't work
# sometime the following doesn't work
# pd.read_html(url)
```
%% Cell type:code id:47173ec2 tags:
``` python
# print(r.text)
```
%% Cell type:markdown id:39f166c5 tags:
Find all **span** tags with **streetAddress** using regex.
%% Cell type:code id:ac7b9482-5512-49e4-b0b8-7f9ed34b5844 tags:
``` python
# <p>1700 block Thierer Road<br>
# <p>2300 block Pennsylvania Avenue<br>
# addrs = re.findall(r'<p>1700 block Thierer Road<br>', r.text)
```
%% Cell type:code id:8e9b49d2-3e0d-4a39-9dcf-13b288a165e7 tags:
``` python
addrs = re.findall(r'', r.text)
addrs = pd.Series(addrs)
addrs
```
%% Cell type:markdown id:4fc8ba0c-9d45-4251-8055-1310cabf15a9 tags:
#### Without city name and state name, geocoding would return match with the most famous location with such a street name.
%% Cell type:code id:095b9ebe-b583-4098-a3a2-47dd46610d59 tags:
``` python
geo_info = gpd.tools.geocode("1300 East Washington Ave")
geo_info = gpd.tools.geocode("2300 block Pennsylvania Avenue")
geo_info
```
%% Cell type:code id:e52ee0fc-d73c-4942-9bec-d1586a702f68 tags:
``` python
geo_info["address"].loc[0]
```
%% Cell type:markdown id:e80b714c-a962-4a47-b7cc-3a19d0da8ac0 tags:
#### To get the correct address we want, we should concatenate "Madison, Wisconsin" to the end of the address.
%% Cell type:code id:cf3f590b-8d00-46c4-ac57-6f2f61509f68 tags:
``` python
geo_info = gpd.tools.geocode("1300 East Washington Ave, Madison, Wisconsin")
geo_info = gpd.tools.geocode("2300 block Pennsylvania Avenue, Madison, Wisconsin")
geo_info
```
%% Cell type:markdown id:3caf8c12-67a3-4e19-a758-d556d010eead tags:
#### Addresses with "block" often won't work or won't give you the correct lat/long. We need to remove the word "block" before geocoding.
%% Cell type:code id:54103e4a-5ac2-4f19-811b-385c02623ed2 tags:
``` python
gpd.tools.geocode("800 block W. Johnson Street, Madison, Wisconsin")
```
%% Cell type:code id:66072f4b-2286-4d66-ba2c-18ccd2e37ed6 tags:
``` python
gpd.tools.geocode("800 W. Johnson Street, Madison, Wisconsin")
```
#### Addresses with "block" often won't work or won't give you the correct lat/lon. We need to remove the word "block" before geocoding.
%% Cell type:code id:cf982302 tags:
``` python
fixed_addrs = addrs.str.replace(" block ", " ") + ", Madison, WI"
fixed_addrs = addrs.str.replace(" block ", " ") + ", Madison, Wisconsin"
fixed_addrs
```
%% Cell type:markdown id:d6a267ed-6876-4866-a66f-74390a3d4ee1 tags:
#### Using a different provider than the default one
- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/long
- `gpd.tools.geocode(<street address>, provider=<geocoding service name>, user_agent=<user agent name>)`: converts street address into lat/lon
- We will be using "OpenStreetMap", for which the argument is "nominatim"
- We also need to specify argument to `user_agent` parameter, indicating where the request is coming from; for example: "cs320_bot"
%% Cell type:code id:ab0e699f tags:
``` python
incidents = gpd.tools.geocode(fixed_addrs, provider="nominatim", user_agent="cs320bot")
incidents
```
%% Cell type:markdown id:4ad73b2c-5171-45c2-a3f0-eef30f93c492 tags:
It is often a good idea to drop na values. Although in this version of the example, there are no failed geocodings.
%% Cell type:code id:41a7d12e-73be-4442-b43c-285229e6cfdb tags:
``` python
incidents = incidents.dropna()
incidents
```
%% Cell type:markdown id:b30733e6-9491-4be0-bf20-bba64903334d tags:
#### Self-practice
If you want practice with regex, try to write regular expression and use the match result to make sure that "Madison" and "Wisconsin" is part of each address.
%% Cell type:code id:843bbba2-3de5-4cc0-b76b-92e7bebdd379 tags:
``` python
# self-practice
for addr in incidents["address"]:
print(addr)
```
%% Cell type:code id:1a04c2b0 tags:
``` python
ax = city.plot(color="lightgray")
water.plot(color="lightblue", ax=ax)
fire.plot(color="red", ax=ax, marker="+", label="Fire")
police2.plot(color="blue", ax=ax, label="Police")
incidents.to_crs(city.crs).plot(ax=ax, color="k", label="Incidents")
ax.legend(loc="upper left", frameon=False)
ax.set_axis_off()
```
......
Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:markdown id:9e33d56e tags:
# Regression 1
%% Cell type:code id:e6f50cc3 tags:
``` python
import os
import pandas as pd
import geopandas as gpd
# new import statements
```
%% Cell type:markdown id:570b4253 tags:
#### Covid deaths analysis
- Source: https://github.com/cs320-wisc/s22/tree/main/lec/29%20Regression%201
- Specifically, let's analyze "COVID-19 Data by Census Tract V2"
- Status Flag Values: -999: Census tracts, municipalities, school districts, and zip codes with 0–4 aggregate counts for any data have been suppressed. County data with 0-4 aggregate counts by demographic factors (e.g., by age group, race, ethnicity) have been suppressed.
%% Cell type:code id:696eec18 tags:
``` python
# Read the "covid.geojson" file
dataset_file = "covid.geojson"
df =
```
%% Cell type:code id:c3e6454f tags:
``` python
df.head()
```
%% Cell type:code id:905da51e tags:
``` python
# Explore the columns
df
```
%% Cell type:code id:a3434aba tags:
``` python
# Create a geographic plot
df
```
%% Cell type:markdown id:e3e73632 tags:
### Predicting "DTH_CUM_CP"
%% Cell type:markdown id:43cebffa tags:
### How can we get a clean dataset of COVID deaths in WI?
%% Cell type:code id:fa2f30ae tags:
``` python
# Replace -999 with 2; 2 is between 0-4; random choice instead of using 0
df =
# we must communicate in final results what percent of values were guessed (imputed)
```
%% Cell type:markdown id:4cff709d tags:
How would we know if the data is now clean?
%% Cell type:code id:950c2041 tags:
``` python
# Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
df
```
%% Cell type:markdown id:4073a940 tags:
Which points are concerning? Let's take a closer look.
#### Which rows have "DTH_CUM_CP" greater than 300?
%% Cell type:code id:a655c465 tags:
``` python
df["DTH_CUM_CP"]
```
%% Cell type:markdown id:d377143e tags:
#### Valid rows have "GEOID" that only contains digits
Using `str` methods to perform filtering: `str.fullmatch` does a full string match given a reg-ex. Because it does full string match anchor characters (`^`, `$`) won't be needed.
%% Cell type:code id:529781db tags:
``` python
```
%% Cell type:code id:af16925b tags:
``` python
df["GEOID"]
```
%% Cell type:code id:1d583d06 tags:
``` python
df = df[df["GEOID"].str.fullmatch(r"\d+")]
df.plot.scatter(x="POP", y="DTH_CUM_CP")
```
%% Cell type:markdown id:1be50600 tags:
### How can we train/fit models to known data to predict unknowns?
- Feature(s) => Predictions
- Population => Deaths
- Cases => Deaths
- Cases by Age => Deaths
- General structure for fitting models:
```python
model = <some model>
model.fit(X, y)
y = model.predict(X)
```
- where `X` needs to be a matrix or a `DataFrame` and `y` needs to be an array (vector) or a `Series`
- after fitting, `model` object instance stores the information about relationship between features (x values) and predictions (y values)
- `predict` returns a `numpy` array, which can be treated like a list
%% Cell type:markdown id:0d0e65c3 tags:
### Predicting "DTH_CUM_CP" using "POP" as feature.
%% Cell type:code id:3dbdbba4 tags:
``` python
# We must specify a list of columns to make sure we extract a DataFrame and not a Series
# Feature DataFrame
df
```
%% Cell type:code id:22aad05e tags:
``` python
# Label Series: "DTH_CUM_CP"
df
```
%% Cell type:markdown id:797d6831 tags:
### Let's use `LinearRegression` model.
- `from sklearn.linear_model import LinearRegression`
%% Cell type:code id:51ad5b05 tags:
``` python
xcols =
ycol =
model =
model
```
%% Cell type:code id:94c3a0e7-4099-4f46-a8af-b065ae0b6873 tags:
``` python
# Let's now make predictions for the known data
# less interesting because we are predicting what we already know
y = model
y = model.predict(df[xcols])
y
```
%% Cell type:markdown id:e589d923 tags:
Predicting for new values of x.
%% Cell type:code id:dd8f0440 tags:
``` python
predict_df = pd.DataFrame({"POP": [1000, 2000, 3000]})
predict_df
```
%% Cell type:code id:5315289f tags:
``` python
# Predict for the new data
```
%% Cell type:code id:0cea4830 tags:
``` python
# Insert a new column called "predicted deaths" with the predictions
predict_df["predicted deaths"] = model.predict(predict_df)
predict_df
```
%% Cell type:markdown id:1c649201 tags:
### How can we visualize model predictions?
- Let's predict deaths for "POP" ranges like 0, 1000, 2000, ..., 20000
%% Cell type:code id:496a67c9 tags:
``` python
predict_df = pd.DataFrame({"POP": range(0, 20000, 1000)})
predict_df
```
%% Cell type:code id:b825412a tags:
``` python
# Insert a new column called "predicted deaths" with the predictions
predict_df["predicted deaths"] = model.predict(predict_df)
predict_df
```
%% Cell type:code id:ca0b47f5 tags:
``` python
# Create a line plot to visualize relationship between "POP" and "predicted deaths"
# Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
```
%% Cell type:markdown id:3bb2d3f8 tags:
### How can we get a formula for the relationship?
- `y=mx+c`, where `y` is our predictions and `x` are the features used for the fit
- Slope of the line (`m`) given by `model.coef_[0]`
- Intercept of the line (`c`) given by `model.intercept_`
%% Cell type:markdown id:a31d19b3 tags:
Model coefficients
%% Cell type:code id:c894c1b1 tags:
``` python
model
```
%% Cell type:code id:4850a47e tags:
``` python
# Slope of the line
model.coef_
```
%% Cell type:code id:32690085 tags:
``` python
# Intercept of the line
model
```
%% Cell type:code id:cfbcee12 tags:
``` python
print(f"deaths ~= {round(model.coef_[0], 4)} * population + {round(model.intercept_, 4)}")
```
%% Cell type:markdown id:738ccba5 tags:
### How well does our model fit the data?
- explained variance score
- R^2 ("r squared")
%% Cell type:markdown id:91e4af81 tags:
#### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
- y_true: actual death values in our example
- y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html
%% Cell type:code id:286e2569 tags:
``` python
xcols, ycol
```
%% Cell type:code id:bb36ca0f tags:
``` python
# Let's now make predictions for the known data
predictions = model
predictions
```
%% Cell type:code id:92dadb4c tags:
``` python
sklearn.metrics.explained_variance_score(, )
```
%% Cell type:markdown id:ebd81950 tags:
#### Explained variance score
- `explained_variance_score = (known_var - explained_variance) / known_var`
- where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`
%% Cell type:markdown id:36bc3bb5 tags:
What is the variation in known deaths?
%% Cell type:code id:55a3dfcd tags:
``` python
# Compute variance of "DTH_CUM_CP" column
known_var = df[ycol]
known_var
```
%% Cell type:code id:c33a6fb1 tags:
``` python
# explained_variance
explained_variance = (df[ycol] - predictions).var()
explained_variance
```
%% Cell type:code id:dfb076b1 tags:
``` python
# explained_variance score
explained_variance_score = (known_var - explained_variance) / known_var
explained_variance_score
```
%% Cell type:code id:73a55a32 tags:
``` python
# For comparison here is the explained variance score from sklearn
sklearn.metrics.explained_variance_score(df[ycol], predictions)
```
%% Cell type:markdown id:547452da tags:
#### `sklearn.metrics.r2_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
- y_true: actual death values in our example
- y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
%% Cell type:code id:d16ba67b tags:
``` python
sklearn.metrics.r2_score(df[ycol], predictions)
```
%% Cell type:markdown id:6e60fed7 tags:
#### R^2 score (aka coefficient of determination) approximation
- `r2_score = (known_var - r2_val) / known_var`
- where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`
%% Cell type:code id:d34ea427 tags:
``` python
# r2_val
r2_val = ((df[ycol] - predictions) ** 2).mean()
r2_val
```
%% Cell type:code id:b1c3574b tags:
``` python
r2_score = (known_var - r2_val) / known_var
r2_score # there might be minor rounding off differences
```
%% Cell type:markdown id:adc33af9 tags:
#### `model.score(X, y)`
- invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y
%% Cell type:code id:b3bde089 tags:
``` python
model
```
%% Cell type:markdown id:1768f9a9 tags:
#### Did our model learn, or just memorize (that is, "overfit")?
- Example: https://www.mathworks.com/discovery/overfitting.html
- Split data into train and test
%% Cell type:code id:87a77fb4 tags:
``` python
# Split the data into two equal parts
len(df) // 2
```
%% Cell type:code id:bdd7cad0 tags:
``` python
# Manual way of splitting train and test data
train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
len(train), len(test)
```
%% Cell type:markdown id:2f45dd74 tags:
Problem with manual splitting is, we need to make sure that the data is not sorted in some way.
%% Cell type:markdown id:3a781391 tags:
#### `train_test_split(<dataframe>, test_size=<val>)`
- requires `from sklearn.model_selection import train_test_split`
- shuffles the data and then splits based on 75%-25% split between train and test
- produces new train and test data every single time
- `test_size` parameter can take two kind of values:
- actual number of rows that we want in test data
- fractional number representing the ratio of train versus test data
- default value is `0.25`
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
%% Cell type:code id:1d3a86c9 tags:
``` python
len(train), len(test)
```
%% Cell type:code id:49f7dfe8 tags:
``` python
# Test size using row count
train, test = train_test_split(df, test_size=120)
len(train), len(test)
```
%% Cell type:code id:1a29cf9d tags:
``` python
# Test size using fraction
train, test = train_test_split(df, test_size=0.5)
len(train), len(test)
```
%% Cell type:code id:7934e9ee tags:
``` python
# Running this cell twice will give you two different train datasets
train, test = train_test_split(df)
train.head()
```
%% Cell type:code id:0fe05a2e tags:
``` python
train, test = train_test_split(df)
# Let's use the train and the test data
model = LinearRegression()
# Fit using training data
model.fit(, )
# Predict using test data
y = model.predict()
# We can use score directly as it automatically invokes predict
model
```
%% Cell type:markdown id:e0f0e21b tags:
Running the above cell again will give you entirely different model and score.
%% Cell type:markdown id:003b1c50 tags:
#### How can we minimize noise due to random train/test splits?
### Cross validation: `cross_val_score(estimator, X, y)`
- requires `from sklearn.model_selection import cross_val_score`
- do many different train/test splits of the values, fitting and scoring the model across each combination
- cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
- function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
%% Cell type:code id:bfa17fce tags:
``` python
train, test = train_test_split(df)
model = LinearRegression()
scores =
scores
```
%% Cell type:code id:284f776f tags:
``` python
# Compute mean of the scores
scores
```
%% Cell type:markdown id:53c9d4d4 tags:
#### How can we compare models?
- model 1: POP => DEATHS
- model 2: CASES (POS_CUM_CP) => DEATHS
%% Cell type:code id:ffd9791b tags:
``` python
model1 = LinearRegression()
model2 = LinearRegression()
model1_scores = cross_val_score(model1, )
model2_scores = cross_val_score(model2, )
```
%% Cell type:code id:60f0bf73 tags:
``` python
model1_scores.mean()
```
%% Cell type:code id:e3070bf6 tags:
``` python
model2_scores.mean()
```
%% Cell type:markdown id:dced0919 tags:
Which of these two models do you think will perform better? Probably model2.
%% Cell type:code id:dfedd8d4 tags:
``` python
means = pd.Series({"model1": model1_scores.mean(),
"model2": model2_scores.mean()})
means.plot.bar(figsize=(3, 3))
```
%% Cell type:markdown id:312aa001 tags:
How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.
%% Cell type:code id:5123c3a9 tags:
``` python
model1_scores.std()
```
%% Cell type:code id:230b9dc9 tags:
``` python
model2_scores.std()
```
%% Cell type:code id:484c7af9 tags:
``` python
err = pd.Series({"model1": model1_scores.std(),
"model2": model2_scores.std()})
err
```
%% Cell type:code id:233cd91d tags:
``` python
# Plot error bar by passing argument to paramenter yerr
means.plot.bar(figsize=(3, 3), )
```
%% Cell type:markdown id:c3b68db6 tags:
Pick a winner and run it one more time against test data.
%% Cell type:markdown id:09f5bce9 tags:
#### How can we use multiple x variables (multiple regression)?
%% Cell type:code id:2538534d tags:
``` python
model = LinearRegression()
xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
'POS_80_89_CP', 'POS_90_CP']
ycol = "DTH_CUM_CP"
model.fit(train[xcols], train[ycol])
model.score(test[xcols], test[ycol])
```
%% Cell type:markdown id:92e4c272 tags:
#### How can we interpret what features the model is relying on?
%% Cell type:code id:68e3d21a tags:
``` python
model.coef_
```
%% Cell type:code id:44bd5b07 tags:
``` python
pd.Series(model.coef_).plot.bar(figsize=(3, 2))
```
%% Cell type:code id:a606f455-c8b4-4d93-a2f4-737c4e799fa0 tags:
``` python
```
......
%% Cell type:markdown id:9e33d56e tags:
# Regression 1
%% Cell type:code id:e6f50cc3 tags:
``` python
import os
import pandas as pd
import geopandas as gpd
# new import statements
```
%% Cell type:markdown id:570b4253 tags:
#### Covid deaths analysis
- Source: https://github.com/cs320-wisc/s22/tree/main/lec/29%20Regression%201
- Specifically, let's analyze "COVID-19 Data by Census Tract V2"
- Status Flag Values: -999: Census tracts, municipalities, school districts, and zip codes with 0–4 aggregate counts for any data have been suppressed. County data with 0-4 aggregate counts by demographic factors (e.g., by age group, race, ethnicity) have been suppressed.
%% Cell type:code id:696eec18 tags:
``` python
# Read the "covid.geojson" file
dataset_file = "covid.geojson"
df =
```
%% Cell type:code id:c3e6454f tags:
``` python
df.head()
```
%% Cell type:code id:905da51e tags:
``` python
# Explore the columns
df
```
%% Cell type:code id:a3434aba tags:
``` python
# Create a geographic plot
df
```
%% Cell type:markdown id:e3e73632 tags:
### Predicting "DTH_CUM_CP"
%% Cell type:markdown id:43cebffa tags:
### How can we get a clean dataset of COVID deaths in WI?
%% Cell type:code id:fa2f30ae tags:
``` python
# Replace -999 with 2; 2 is between 0-4; random choice instead of using 0
df =
# we must communicate in final results what percent of values were guessed (imputed)
```
%% Cell type:markdown id:4cff709d tags:
How would we know if the data is now clean?
%% Cell type:code id:950c2041 tags:
``` python
# Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
df
```
%% Cell type:markdown id:4073a940 tags:
Which points are concerning? Let's take a closer look.
#### Which rows have "DTH_CUM_CP" greater than 300?
%% Cell type:code id:a655c465 tags:
``` python
df["DTH_CUM_CP"]
```
%% Cell type:markdown id:d377143e tags:
#### Valid rows have "GEOID" that only contains digits
Using `str` methods to perform filtering: `str.fullmatch` does a full string match given a reg-ex. Because it does full string match anchor characters (`^`, `$`) won't be needed.
%% Cell type:code id:529781db tags:
``` python
```
%% Cell type:code id:af16925b tags:
``` python
df["GEOID"]
```
%% Cell type:code id:1d583d06 tags:
``` python
df = df[df["GEOID"].str.fullmatch(r"\d+")]
df.plot.scatter(x="POP", y="DTH_CUM_CP")
```
%% Cell type:markdown id:1be50600 tags:
### How can we train/fit models to known data to predict unknowns?
- Feature(s) => Predictions
- Population => Deaths
- Cases => Deaths
- Cases by Age => Deaths
- General structure for fitting models:
```python
model = <some model>
model.fit(X, y)
y = model.predict(X)
```
- where `X` needs to be a matrix or a `DataFrame` and `y` needs to be an array (vector) or a `Series`
- after fitting, `model` object instance stores the information about relationship between features (x values) and predictions (y values)
- `predict` returns a `numpy` array, which can be treated like a list
%% Cell type:markdown id:0d0e65c3 tags:
### Predicting "DTH_CUM_CP" using "POP" as feature.
%% Cell type:code id:3dbdbba4 tags:
``` python
# We must specify a list of columns to make sure we extract a DataFrame and not a Series
# Feature DataFrame
df
```
%% Cell type:code id:22aad05e tags:
``` python
# Label Series: "DTH_CUM_CP"
df
```
%% Cell type:markdown id:797d6831 tags:
### Let's use `LinearRegression` model.
- `from sklearn.linear_model import LinearRegression`
%% Cell type:code id:51ad5b05 tags:
``` python
xcols =
ycol =
model =
model
```
%% Cell type:code id:94c3a0e7-4099-4f46-a8af-b065ae0b6873 tags:
``` python
# Let's now make predictions for the known data
# less interesting because we are predicting what we already know
y = model
y = model.predict(df[xcols])
y
```
%% Cell type:markdown id:e589d923 tags:
Predicting for new values of x.
%% Cell type:code id:dd8f0440 tags:
``` python
predict_df = pd.DataFrame({"POP": [1000, 2000, 3000]})
predict_df
```
%% Cell type:code id:5315289f tags:
``` python
# Predict for the new data
```
%% Cell type:code id:0cea4830 tags:
``` python
# Insert a new column called "predicted deaths" with the predictions
predict_df["predicted deaths"] = model.predict(predict_df)
predict_df
```
%% Cell type:markdown id:1c649201 tags:
### How can we visualize model predictions?
- Let's predict deaths for "POP" ranges like 0, 1000, 2000, ..., 20000
%% Cell type:code id:496a67c9 tags:
``` python
predict_df = pd.DataFrame({"POP": range(0, 20000, 1000)})
predict_df
```
%% Cell type:code id:b825412a tags:
``` python
# Insert a new column called "predicted deaths" with the predictions
predict_df["predicted deaths"] = model.predict(predict_df)
predict_df
```
%% Cell type:code id:ca0b47f5 tags:
``` python
# Create a line plot to visualize relationship between "POP" and "predicted deaths"
# Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
```
%% Cell type:markdown id:3bb2d3f8 tags:
### How can we get a formula for the relationship?
- `y=mx+c`, where `y` is our predictions and `x` are the features used for the fit
- Slope of the line (`m`) given by `model.coef_[0]`
- Intercept of the line (`c`) given by `model.intercept_`
%% Cell type:markdown id:a31d19b3 tags:
Model coefficients
%% Cell type:code id:c894c1b1 tags:
``` python
model
```
%% Cell type:code id:4850a47e tags:
``` python
# Slope of the line
model.coef_
```
%% Cell type:code id:32690085 tags:
``` python
# Intercept of the line
model
```
%% Cell type:code id:cfbcee12 tags:
``` python
print(f"deaths ~= {round(model.coef_[0], 4)} * population + {round(model.intercept_, 4)}")
```
%% Cell type:markdown id:738ccba5 tags:
### How well does our model fit the data?
- explained variance score
- R^2 ("r squared")
%% Cell type:markdown id:91e4af81 tags:
#### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
- y_true: actual death values in our example
- y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html
%% Cell type:code id:286e2569 tags:
``` python
xcols, ycol
```
%% Cell type:code id:bb36ca0f tags:
``` python
# Let's now make predictions for the known data
predictions = model
predictions
```
%% Cell type:code id:92dadb4c tags:
``` python
sklearn.metrics.explained_variance_score(, )
```
%% Cell type:markdown id:ebd81950 tags:
#### Explained variance score
- `explained_variance_score = (known_var - explained_variance) / known_var`
- where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`
%% Cell type:markdown id:36bc3bb5 tags:
What is the variation in known deaths?
%% Cell type:code id:55a3dfcd tags:
``` python
# Compute variance of "DTH_CUM_CP" column
known_var = df[ycol]
known_var
```
%% Cell type:code id:c33a6fb1 tags:
``` python
# explained_variance
explained_variance = (df[ycol] - predictions).var()
explained_variance
```
%% Cell type:code id:dfb076b1 tags:
``` python
# explained_variance score
explained_variance_score = (known_var - explained_variance) / known_var
explained_variance_score
```
%% Cell type:code id:73a55a32 tags:
``` python
# For comparison here is the explained variance score from sklearn
sklearn.metrics.explained_variance_score(df[ycol], predictions)
```
%% Cell type:markdown id:547452da tags:
#### `sklearn.metrics.r2_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
- y_true: actual death values in our example
- y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
%% Cell type:code id:d16ba67b tags:
``` python
sklearn.metrics.r2_score(df[ycol], predictions)
```
%% Cell type:markdown id:6e60fed7 tags:
#### R^2 score (aka coefficient of determination) approximation
- `r2_score = (known_var - r2_val) / known_var`
- where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`
%% Cell type:code id:d34ea427 tags:
``` python
# r2_val
r2_val = ((df[ycol] - predictions) ** 2).mean()
r2_val
```
%% Cell type:code id:b1c3574b tags:
``` python
r2_score = (known_var - r2_val) / known_var
r2_score # there might be minor rounding off differences
```
%% Cell type:markdown id:adc33af9 tags:
#### `model.score(X, y)`
- invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y
%% Cell type:code id:b3bde089 tags:
``` python
model
```
%% Cell type:markdown id:1768f9a9 tags:
#### Did our model learn, or just memorize (that is, "overfit")?
- Example: https://www.mathworks.com/discovery/overfitting.html
- Split data into train and test
%% Cell type:code id:87a77fb4 tags:
``` python
# Split the data into two equal parts
len(df) // 2
```
%% Cell type:code id:bdd7cad0 tags:
``` python
# Manual way of splitting train and test data
train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
len(train), len(test)
```
%% Cell type:markdown id:2f45dd74 tags:
Problem with manual splitting is, we need to make sure that the data is not sorted in some way.
%% Cell type:markdown id:3a781391 tags:
#### `train_test_split(<dataframe>, test_size=<val>)`
- requires `from sklearn.model_selection import train_test_split`
- shuffles the data and then splits based on 75%-25% split between train and test
- produces new train and test data every single time
- `test_size` parameter can take two kind of values:
- actual number of rows that we want in test data
- fractional number representing the ratio of train versus test data
- default value is `0.25`
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
%% Cell type:code id:1d3a86c9 tags:
``` python
len(train), len(test)
```
%% Cell type:code id:49f7dfe8 tags:
``` python
# Test size using row count
train, test = train_test_split(df, test_size=120)
len(train), len(test)
```
%% Cell type:code id:1a29cf9d tags:
``` python
# Test size using fraction
train, test = train_test_split(df, test_size=0.5)
len(train), len(test)
```
%% Cell type:code id:7934e9ee tags:
``` python
# Running this cell twice will give you two different train datasets
train, test = train_test_split(df)
train.head()
```
%% Cell type:code id:0fe05a2e tags:
``` python
train, test = train_test_split(df)
# Let's use the train and the test data
model = LinearRegression()
# Fit using training data
model.fit(, )
# Predict using test data
y = model.predict()
# We can use score directly as it automatically invokes predict
model
```
%% Cell type:markdown id:e0f0e21b tags:
Running the above cell again will give you entirely different model and score.
%% Cell type:markdown id:003b1c50 tags:
#### How can we minimize noise due to random train/test splits?
### Cross validation: `cross_val_score(estimator, X, y)`
- requires `from sklearn.model_selection import cross_val_score`
- do many different train/test splits of the values, fitting and scoring the model across each combination
- cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
- function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
%% Cell type:code id:bfa17fce tags:
``` python
train, test = train_test_split(df)
model = LinearRegression()
scores =
scores
```
%% Cell type:code id:284f776f tags:
``` python
# Compute mean of the scores
scores
```
%% Cell type:markdown id:53c9d4d4 tags:
#### How can we compare models?
- model 1: POP => DEATHS
- model 2: CASES (POS_CUM_CP) => DEATHS
%% Cell type:code id:ffd9791b tags:
``` python
model1 = LinearRegression()
model2 = LinearRegression()
model1_scores = cross_val_score(model1, )
model2_scores = cross_val_score(model2, )
```
%% Cell type:code id:60f0bf73 tags:
``` python
model1_scores.mean()
```
%% Cell type:code id:e3070bf6 tags:
``` python
model2_scores.mean()
```
%% Cell type:markdown id:dced0919 tags:
Which of these two models do you think will perform better? Probably model2.
%% Cell type:code id:dfedd8d4 tags:
``` python
means = pd.Series({"model1": model1_scores.mean(),
"model2": model2_scores.mean()})
means.plot.bar(figsize=(3, 3))
```
%% Cell type:markdown id:312aa001 tags:
How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.
%% Cell type:code id:5123c3a9 tags:
``` python
model1_scores.std()
```
%% Cell type:code id:230b9dc9 tags:
``` python
model2_scores.std()
```
%% Cell type:code id:484c7af9 tags:
``` python
err = pd.Series({"model1": model1_scores.std(),
"model2": model2_scores.std()})
err
```
%% Cell type:code id:233cd91d tags:
``` python
# Plot error bar by passing argument to paramenter yerr
means.plot.bar(figsize=(3, 3), )
```
%% Cell type:markdown id:c3b68db6 tags:
Pick a winner and run it one more time against test data.
%% Cell type:markdown id:09f5bce9 tags:
#### How can we use multiple x variables (multiple regression)?
%% Cell type:code id:2538534d tags:
``` python
model = LinearRegression()
xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
'POS_80_89_CP', 'POS_90_CP']
ycol = "DTH_CUM_CP"
model.fit(train[xcols], train[ycol])
model.score(test[xcols], test[ycol])
```
%% Cell type:markdown id:92e4c272 tags:
#### How can we interpret what features the model is relying on?
%% Cell type:code id:68e3d21a tags:
``` python
model.coef_
```
%% Cell type:code id:44bd5b07 tags:
``` python
pd.Series(model.coef_).plot.bar(figsize=(3, 2))
```
%% Cell type:code id:a606f455-c8b4-4d93-a2f4-737c4e799fa0 tags:
``` python
```
......
File added
%% Cell type:markdown id:9e33d56e tags:
# Regression 2
%% Cell type:code id:e6f50cc3 tags:
``` python
import os
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
# new import statements
from sklearn.linear_model import LinearRegression
import sklearn
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
import numpy as np
```
%% Cell type:code id:696eec18 tags:
%% Cell type:markdown id:8d0baac0-fcc2-4cdf-9276-9db86ffacd10 tags:
``` python
# Read the "covid.geojson" file
dataset_file = "covid.geojson"
df = gpd.read_file(dataset_file)
```
%% Cell type:markdown id:b37d6f28-fabd-4418-8c85-8e7cd24331f5 tags:
### How well does our model fit the data?
- explained variance score
- R^2 ("r squared")
%% Cell type:markdown id:a2bf3bc0-03b8-4c41-93a9-bef8a76bf3de tags:
#### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
- y_true: actual death values in our example
- y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html
%% Cell type:markdown id:c7fbd149-0c50-46ec-bfb1-f33e9fece60e tags:
### Predicting "DTH_CUM_CP"
%% Cell type:code id:cebb790b-9374-4d02-afe0-40e1efc8f176 tags:
``` python
df = df.replace(-999, 2)
# Removing outlier point
df = df[df["GEOID"].str.fullmatch(r"\d+")]
```
%% Cell type:code id:f4a4dd37-d7d3-41ca-8319-1442b56b2a24 tags:
``` python
xcols = ["POP"]
ycol = "DTH_CUM_CP"
```
%% Cell type:markdown id:87c4b295-39ca-445c-865d-4829d77f55ce tags:
### Let's use `LinearRegression` model.
- `from sklearn.linear_model import LinearRegression`
%% Cell type:code id:78c2c891-caed-4359-bc91-a97de21db812 tags:
``` python
model = LinearRegression()
model.fit(df[xcols], df[ycol])
```
%% Cell type:code id:7787a889-c693-40b3-8e66-9a266bb0c683 tags:
``` python
# Let's now make predictions for the known data
predictions = model.predict(df[xcols])
predictions
```
%% Cell type:code id:7952f67c-57b3-4ca8-bb42-b6287f012157 tags:
``` python
sklearn.metrics.explained_variance_score(df[ycol], predictions)
```
%% Cell type:markdown id:39caf7a5-c782-4c02-8c3e-8385eddb108f tags:
#### Explained variance score
- `explained_variance_score = (known_var - explained_variance) / known_var`
- where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`
%% Cell type:markdown id:908b8af9-6ba1-4e92-9569-36168b253fae tags:
What is the variation in known deaths?
%% Cell type:code id:5fc126c4-de9a-4a35-abe6-8d5820227f6f tags:
``` python
# Compute variance of "DTH_CUM_CP" column
known_var = df[ycol].var()
known_var
```
%% Cell type:code id:a2f489d5-1325-4870-b6f8-d2b7949f12cf tags:
``` python
# explained_variance
explained_variance = (df[ycol] - predictions).var()
explained_variance
```
%% Cell type:code id:e049b2bd-35e8-4a75-9d6d-d70c2b998541 tags:
``` python
# explained_variance score
explained_variance_score = (known_var - explained_variance) / known_var
explained_variance_score
```
%% Cell type:code id:ced64de4-d1dc-464e-8855-3341a8f3b15a tags:
``` python
# For comparison here is the explained variance score from sklearn
sklearn.metrics.explained_variance_score(df[ycol], predictions)
```
%% Cell type:markdown id:e85ad41f-8d4b-471f-b689-333346d7d660 tags:
#### `sklearn.metrics.r2_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
- y_true: actual death values in our example
- y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
%% Cell type:code id:557a7646-7843-4956-8d72-9104aa7113de tags:
``` python
sklearn.metrics.r2_score(df[ycol], predictions)
```
%% Cell type:markdown id:3ceeca6f-e7d4-47fa-b53e-8c3bbba41594 tags:
#### R^2 score (aka coefficient of determination) approximation
- `r2_score = (known_var - r2_val) / known_var`
- where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`
%% Cell type:code id:c1ea5c04-3dd7-412c-9cbf-32f4a69bdb20 tags:
``` python
# r2_val
r2_val = ((df[ycol] - predictions) ** 2).mean()
r2_val
```
%% Cell type:code id:e3e83156-8e02-4216-9738-bed6b0052608 tags:
``` python
r2_score = (known_var - r2_val) / known_var
r2_score # there might be minor rounding off differences
```
%% Cell type:markdown id:d1795241-d9d7-42c5-8bf4-614ec7d972d7 tags:
#### `model.score(X, y)`
- invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y
%% Cell type:code id:a50b84d0-5c1b-4f84-9f41-d6157057d645 tags:
``` python
model.score(df[xcols], df[ycol])
```
%% Cell type:markdown id:1768f9a9 tags:
#### Did our model learn, or just memorize (that is, "overfit")?
- Split data into train and test
%% Cell type:code id:80cef390-831c-443a-9149-9141c8d34ef2 tags:
``` python
# Split the data into two equal parts
len(df) // 2
```
%% Cell type:code id:68c16194-1c7b-4f3a-81a3-4b0352f36872 tags:
``` python
# Manual way of splitting train and test data
train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
len(train), len(test)
```
%% Cell type:markdown id:3a781391 tags:
#### `train_test_split(<dataframe>, test_size=<val>)`
- requires `from sklearn.model_selection import train_test_split`
- shuffles the data and then splits based on 75%-25% split between train and test
- produces new train and test data every single time
- `test_size` parameter can take two kind of values:
- actual number of rows that we want in test data
- fractional number representing the ratio of train versus test data
- default value is `0.25`
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
%% Cell type:code id:5c3de0c2 tags:
``` python
xcols, ycol
```
%% Cell type:code id:c52c86c4-8afe-4d49-97a5-e2fcb18242c5 tags:
``` python
train, test = train_test_split(df)
len(train), len(test)
```
%% Cell type:code id:b3ccdf1a-0958-402c-80e1-2eed12abdf88 tags:
``` python
# Test size using row count
train, test = train_test_split(df, test_size=120)
len(train), len(test)
```
%% Cell type:code id:adeea246-f80a-46c2-b62b-9e95bef5d02f tags:
``` python
# Test size using fraction
train, test = train_test_split(df, test_size=0.5)
len(train), len(test)
```
%% Cell type:code id:ba4c6310-0f35-44ac-9250-79b6eb04f7cc tags:
``` python
# Running this cell twice will give you two different train datasets
train, test = train_test_split(df)
train.head()
```
%% Cell type:code id:0fe05a2e tags:
``` python
train, test = train_test_split(df)
# Let's use the train and the test data
model = LinearRegression()
# Fit using training data
model.fit(train[xcols], train[ycol])
# Predict using test data
y = model.predict(test[xcols])
# We can use score directly as it automatically invokes predict
model.score(test[xcols], test[ycol])
```
%% Cell type:markdown id:e0f0e21b tags:
Running the above cell again will give you entirely different model and score.
%% Cell type:markdown id:003b1c50 tags:
#### How can we minimize noise due to random train/test splits?
### Cross validation: `cross_val_score(estimator, X, y)`
- requires `from sklearn.model_selection import cross_val_score`
- do many different train/test splits of the values, fitting and scoring the model across each combination
- cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
- function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
%% Cell type:code id:bfa17fce tags:
``` python
train, test = train_test_split(df)
model = LinearRegression()
scores = cross_val_score(???)
scores
```
%% Cell type:code id:284f776f tags:
``` python
# Compute mean of the scores
scores.mean()
```
%% Cell type:markdown id:53c9d4d4 tags:
#### How can we compare models?
- model 1: POP => DEATHS
- model 2: CASES (POS_CUM_CP) => DEATHS
%% Cell type:code id:ffd9791b tags:
``` python
model1 = LinearRegression()
model2 = LinearRegression()
model1_scores = cross_val_score(model1, train[["POP"]], train[ycol])
model2_scores = cross_val_score(model2, train[["POS_CUM_CP"]], train[ycol])
```
%% Cell type:code id:60f0bf73 tags:
``` python
model1_scores.mean()
```
%% Cell type:code id:e3070bf6 tags:
``` python
model2_scores.mean()
```
%% Cell type:markdown id:dced0919 tags:
Which of these two models do you think will perform better? Probably model2.
%% Cell type:code id:dfedd8d4 tags:
``` python
means = pd.Series({"model1": model1_scores.mean(),
"model2": model2_scores.mean()})
means.plot.bar(figsize=(3, 3))
```
%% Cell type:markdown id:312aa001 tags:
How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.
%% Cell type:code id:5123c3a9 tags:
``` python
model1_scores.std()
```
%% Cell type:code id:230b9dc9 tags:
``` python
model2_scores.std()
```
%% Cell type:code id:484c7af9 tags:
``` python
err = pd.Series({"model1": model1_scores.std(),
"model2": model2_scores.std()})
err
```
%% Cell type:code id:233cd91d tags:
``` python
# Plot error bar by passing argument to paramenter yerr
means.plot.bar(figsize=(3, 3))
```
%% Cell type:markdown id:c3b68db6 tags:
Pick a winner and run it one more time against test data.
%% Cell type:markdown id:09f5bce9 tags:
#### How can we use multiple x variables (multiple regression)?
- Features: Positive cases per age range
%% Cell type:code id:d3016079 tags:
``` python
df.columns
```
%% Cell type:code id:2538534d tags:
``` python
xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
'POS_80_89_CP', 'POS_90_CP']
ycol = "DTH_CUM_CP"
model = LinearRegression()
model.fit(train[xcols], train[ycol])
model.score(test[xcols], test[ycol])
```
%% Cell type:markdown id:92e4c272 tags:
#### How can we interpret what features the model is relying on?
%% Cell type:code id:68e3d21a tags:
``` python
```
%% Cell type:code id:98fa5442 tags:
``` python
pd.Series(model.coef_)
```
%% Cell type:code id:44bd5b07 tags:
``` python
pd.Series(model.coef_, index=???).plot.bar(figsize=(3, 2))
```
### Pipeline, Transformer and Estimator
%% Cell type:markdown id:93be2ad2 tags:
<div>
<img src="attachment:Simple_model.png" width="300"/>
</div>
%% Cell type:markdown id:c854dab5 tags:
### Lake Michigan Waves
Source: https://data.cityofchicago.org/Parks-Recreation/Beach-Water-Quality-Automated-Sensors/qmqz-2xku
%% Cell type:code id:a3109957 tags:
``` python
# Read the "waves.csv" file
df = pd.read_csv("waves.csv")
df.head()
```
%% Cell type:code id:18d71eb8 tags:
``` python
df.columns
```
%% Cell type:markdown id:d3f9dc12 tags:
#### Can we predict wave height based on wave period (time between waves)?
%% Cell type:code id:149aa67c tags:
``` python
df = df[(df["Wave Period"] > 0) & (df["Wave Height"] > 0)]
df.plot.scatter(x="Wave Period", y="Wave Height", color="k", alpha=0.05)
```
%% Cell type:markdown id:4da30be1 tags:
Observation: non-linear relationship => fitting straight line will not work.
%% Cell type:markdown id:62bd175c tags:
#### Can we predict wave height based on beach name and wave period?
%% Cell type:code id:3f045b08 tags:
``` python
beach_names = sorted(set(df["Beach Name"]))
beach_names
```
%% Cell type:code id:bc626c9d tags:
``` python
fig, axes = plt.subplots(2, 3, figsize=(12, 8), sharey=True)
plt.subplots_adjust(hspace=0.3)
axes = list(axes.reshape(-1))
for b in beach_names:
ax = axes.pop(0)
ax.set_title(b)
beach_df = df[df["Beach Name"] == b]
beach_df.plot.scatter(x="Wave Period", y="Wave Height",
color="k", alpha=0.1, ax=ax)
```
%% Cell type:markdown id:ed02085d tags:
Obversation: which beach (categorical feature) is important.
%% Cell type:markdown id:ceebe2c1 tags:
### Four Models
1. wave period (linear)
2. wave period (polynomial)
3. beach
4. beach and wave period
%% Cell type:markdown id:2d4fe892 tags:
`train_test_split(<dataframe>, random_state=<some number>)`
- `random_state` enables us to control the randomization
- when we pass the same number, we will get the same training and test data (psuedo randomization)
%% Cell type:code id:a454b7ec tags:
``` python
train, test = train_test_split(df, random_state=320)
train.head()
```
%% Cell type:markdown id:d6f3871a tags:
#### Model 1: Wave Period (Linear)
%% Cell type:markdown id:4c15ac91 tags:
`cross_val_score(estimator, X, y, cv=<number>)`
- `cv` enables us mention how many folds we want for the cross-validation
%% Cell type:code id:199a5ca6 tags:
``` python
xcols = ["Wave Period"]
ycol = "Wave Height"
m1 = ???
scores = ???
scores
```
%% Cell type:code id:15bec5d5 tags:
``` python
scores.mean()
```
%% Cell type:markdown id:aa5a272e tags:
We want the mean score to be large.
%% Cell type:code id:306e21bd tags:
``` python
scores.std()
```
%% Cell type:markdown id:ea190e50 tags:
We want the standard deviation to be small, to make sure that our model isn't too sensitive to changes in training and test data.
%% Cell type:markdown id:b126b01f tags:
#### Model 2: Wave Period (Polynomial)
LinearRegression can do this:
```
y = 3*x1 + 5*x2
```
It CANNOT do this:
```
y = 3*x + 5*x^2
```
TRICK:
```
x1 = x
x2 = x^2
```
<div>
<img src="attachment:Pipeline.png" width="500"/>
</div>
%% Cell type:code id:5b6f8a12 tags:
``` python
# Let's make a copy before we add new columns
# Recommendation: don't change the data referred to by a specific variable
# throughout your code
train2 = train[xcols].copy()
train2.head()
```
%% Cell type:markdown id:6be8df92 tags:
Manually adding columns for x^2, x^3, etc.,
%% Cell type:code id:54f8d249 tags:
``` python
train2["Wave Period ^ 2"] = train2["Wave Period"] ** 2
train2["Wave Period ^ 3"] = train2["Wave Period"] ** 3
train2["sqrt(Wave Period"] = train2["Wave Period"] ** 0.5
train2["sqrt(Wave Period)"] = train2["Wave Period"] ** 0.5
train2.head()
```
%% Cell type:markdown id:155f7e7a tags:
#### `PolynomialFeatures(degree=<val>, include_bias=False)`
- `degree` enables us to mention how many degrees we need
- `include_bias` default value is True, which will add a column of 1s - we typically don't use that.
- returns an object instance on which we can invoke `fit` and `transform`:
- `transform(X, columns=<col names>)`: transform data to polynomial features`
- `fit_transform(X[, y])`: fit to data, then transform it.
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
%% Cell type:code id:e6c43483 tags:
``` python
pf =
```
%% Cell type:code id:6f81a519 tags:
``` python
# Fit the data
pf
```
%% Cell type:code id:8d6d2e1c tags:
``` python
# Transform the data
pf
```
%% Cell type:code id:ecfc2493 tags:
``` python
# Convert result of transformation into a DataFrame - step 1
pd.DataFrame(pf.transform(train[xcols]))
```
%% Cell type:markdown id:129b21dc tags:
How can we add meaningful column names? `pf.get_feature_names_out()`
%% Cell type:code id:1ab1eaa3 tags:
``` python
pf
```
%% Cell type:code id:2d62b3d3 tags:
``` python
# Convert result of transformation into a DataFrame - step 2
pd.DataFrame(pf.transform(train[xcols]), columns=???)
```
%% Cell type:code id:4beebe68 tags:
``` python
# Putting all the steps together
pf = PolynomialFeatures(degree=4, include_bias=False)
pf.fit(train[xcols])
pd.DataFrame(pf.transform(train[xcols]), columns=pf.get_feature_names_out()).head()
```
%% Cell type:markdown id:29d11ffb tags:
`fit_transform(X[, y])`: fit to data, then transform it.
%% Cell type:code id:91c3a781 tags:
``` python
new_data = ???
pd.DataFrame(new_data, columns=pf.get_feature_names_out()).head()
```
%% Cell type:markdown id:86de680e tags:
#### `Pipeline(...)`
- Argument: list of steps in the pipeline:
- each step represented as a tuple with name of the step and the object instance
- last step will be the estimator
- documention: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
%% Cell type:code id:bb509930 tags:
``` python
m2 = Pipeline([
#PolynomialFeatures(degree=2, include_bias=False)
# LinearRegression()
])
m2
```
%% Cell type:code id:ccf76663 tags:
``` python
scores = cross_val_score(m2, train[xcols], train[ycol], cv=10)
scores.mean()
```
%% Cell type:markdown id:b7acca2c tags:
**Conclusion:** mean R^2 score increased from 0.0029 (linear model) to 0.0489 (polynomial) - not bad!
%% Cell type:markdown id:7dff1435 tags:
#### Model 3: Beach Name (Categorical)
%% Cell type:code id:76a31355 tags:
``` python
train["Beach Name"].unique()
```
%% Cell type:markdown id:61644774 tags:
Naive way of assigning numerial values to a categorical column:
- `Ohio Street Beach`: 1
- `Calumet Beach`: 2
- `Rainbow Beach`: 3, etc.,
- Problem: `Calumet Beach` will become an average of `Ohio Street Beach` and `Rainbow Beach` => this doesn't make any sense!
%% Cell type:markdown id:41c6be06 tags:
#### `OneHotEncoder()`
- encodes categorical features as a one-hot numeric array
- returns a "sparse matrix", which needs to be explicitly converted into an `array` using `to_array()` method, before `DataFrame` conversion
- documention: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
%% Cell type:code id:55ec83e1-2800-41a2-80ce-f05a1de54c8a tags:
``` python
xcols = ["Beach Name"]
```
%% Cell type:code id:ef069c7b tags:
``` python
```
%% Cell type:code id:5cb24e10 tags:
``` python
# fit_transform
```
%% Cell type:code id:d5303636 tags:
``` python
# All steps
oh = OneHotEncoder()
pd.DataFrame(oh.fit_transform(train[xcols]).toarray(), columns=oh.get_feature_names_out())
```
%% Cell type:code id:625a8edc tags:
``` python
m3 = Pipeline([
("oh", OneHotEncoder()),
("lr", LinearRegression())
])
m3
```
%% Cell type:code id:5494c1fa tags:
``` python
scores = cross_val_score(m3, train[xcols], train[ycol], cv=10)
scores.mean()
```
%% Cell type:markdown id:2ce15dac tags:
**Conclusion:** mean R^2 score is slightly lower than 0.0489 (polynomial).
%% Cell type:markdown id:e4bf025e tags:
<div>
<img src="attachment:ColumnTransformer.png" width="500"/>
</div>
%% Cell type:markdown id:0f5a1a98 tags:
#### `make_column_transformer(...)`
- Argument: transformations
- each transformer argument will be a `tuple` with object instance as first item and list of feature columns as the second
- documention: https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html
%% Cell type:markdown id:a41f509f tags:
#### Model 4: Beach Name (Categorical) and Wave Period (Polynomial)
%% Cell type:code id:40378fb3 tags:
``` python
custom_trans = make_column_transformer(
(PolynomialFeatures(), ["Wave Period"]),
(OneHotEncoder(), ["Beach Name"]),
)
custom_trans
```
%% Cell type:code id:2d6e0ba2 tags:
``` python
m4 = Pipeline([
("transformers", custom_trans),
("lr", LinearRegression()),
])
m4
```
%% Cell type:code id:4248010b tags:
``` python
xcols = ["Beach Name", "Wave Period"]
```
%% Cell type:code id:57b9ffc8 tags:
``` python
scores = cross_val_score(m4, train[xcols], train[ycol], cv=10)
scores.mean()
```
%% Cell type:markdown id:73ade800 tags:
**Conclusion:** mean R^2 score increased to 0.0885 when compared to 0.0489 (polynomial).
%% Cell type:markdown id:247947ef tags:
### Let's evaluate this model by running it against the test data
%% Cell type:code id:e7e58157 tags:
``` python
m4.fit(train[xcols], train[ycol])
m4.score(test[xcols], test[ycol])
```
%% Cell type:markdown id:cd689ae1 tags:
How to extract `coef_` for this `Pipeline`?
%% Cell type:code id:50c7e102 tags:
``` python
m4["transformers"].get_feature_names_out()
```
%% Cell type:code id:28ccbf92 tags:
``` python
m4["lr"].coef_
```
%% Cell type:code id:09bc63d9-ab2b-4723-94cf-94ebac037d8d tags:
``` python
s = pd.Series(m4["lr"].coef_, index=m4["transformers"].get_feature_names_out())
s.plot.bar()
```
%% Cell type:code id:acd00dc3 tags:
``` python
s = pd.Series(m4["lr"].coef_, index=m4["transformers"].get_feature_names_out())
s.plot.barh()
```
%% Cell type:markdown id:279ae068-9fc8-45d3-9bcf-c9ae9252f19f tags:
What are the beaches that have the biggest waves?
%% Cell type:code id:2ff1d03b-48bb-4150-bc27-42f5064f88df tags:
``` python
```
......
%% Cell type:markdown id:9e33d56e tags:
# Regression 2
%% Cell type:code id:e6f50cc3 tags:
``` python
import os
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
# new import statements
from sklearn.linear_model import LinearRegression
import sklearn
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
import numpy as np
```
%% Cell type:code id:696eec18 tags:
%% Cell type:markdown id:8d0baac0-fcc2-4cdf-9276-9db86ffacd10 tags:
``` python
# Read the "covid.geojson" file
dataset_file = "covid.geojson"
df = gpd.read_file(dataset_file)
```
%% Cell type:markdown id:b37d6f28-fabd-4418-8c85-8e7cd24331f5 tags:
### How well does our model fit the data?
- explained variance score
- R^2 ("r squared")
%% Cell type:markdown id:a2bf3bc0-03b8-4c41-93a9-bef8a76bf3de tags:
#### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
- y_true: actual death values in our example
- y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html
%% Cell type:markdown id:c7fbd149-0c50-46ec-bfb1-f33e9fece60e tags:
### Predicting "DTH_CUM_CP"
%% Cell type:code id:cebb790b-9374-4d02-afe0-40e1efc8f176 tags:
``` python
df = df.replace(-999, 2)
# Removing outlier point
df = df[df["GEOID"].str.fullmatch(r"\d+")]
```
%% Cell type:code id:f4a4dd37-d7d3-41ca-8319-1442b56b2a24 tags:
``` python
xcols = ["POP"]
ycol = "DTH_CUM_CP"
```
%% Cell type:markdown id:87c4b295-39ca-445c-865d-4829d77f55ce tags:
### Let's use `LinearRegression` model.
- `from sklearn.linear_model import LinearRegression`
%% Cell type:code id:78c2c891-caed-4359-bc91-a97de21db812 tags:
``` python
model = LinearRegression()
model.fit(df[xcols], df[ycol])
```
%% Cell type:code id:7787a889-c693-40b3-8e66-9a266bb0c683 tags:
``` python
# Let's now make predictions for the known data
predictions = model.predict(df[xcols])
predictions
```
%% Cell type:code id:7952f67c-57b3-4ca8-bb42-b6287f012157 tags:
``` python
sklearn.metrics.explained_variance_score(df[ycol], predictions)
```
%% Cell type:markdown id:39caf7a5-c782-4c02-8c3e-8385eddb108f tags:
#### Explained variance score
- `explained_variance_score = (known_var - explained_variance) / known_var`
- where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`
%% Cell type:markdown id:908b8af9-6ba1-4e92-9569-36168b253fae tags:
What is the variation in known deaths?
%% Cell type:code id:5fc126c4-de9a-4a35-abe6-8d5820227f6f tags:
``` python
# Compute variance of "DTH_CUM_CP" column
known_var = df[ycol].var()
known_var
```
%% Cell type:code id:a2f489d5-1325-4870-b6f8-d2b7949f12cf tags:
``` python
# explained_variance
explained_variance = (df[ycol] - predictions).var()
explained_variance
```
%% Cell type:code id:e049b2bd-35e8-4a75-9d6d-d70c2b998541 tags:
``` python
# explained_variance score
explained_variance_score = (known_var - explained_variance) / known_var
explained_variance_score
```
%% Cell type:code id:ced64de4-d1dc-464e-8855-3341a8f3b15a tags:
``` python
# For comparison here is the explained variance score from sklearn
sklearn.metrics.explained_variance_score(df[ycol], predictions)
```
%% Cell type:markdown id:e85ad41f-8d4b-471f-b689-333346d7d660 tags:
#### `sklearn.metrics.r2_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
- y_true: actual death values in our example
- y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
%% Cell type:code id:557a7646-7843-4956-8d72-9104aa7113de tags:
``` python
sklearn.metrics.r2_score(df[ycol], predictions)
```
%% Cell type:markdown id:3ceeca6f-e7d4-47fa-b53e-8c3bbba41594 tags:
#### R^2 score (aka coefficient of determination) approximation
- `r2_score = (known_var - r2_val) / known_var`
- where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`
%% Cell type:code id:c1ea5c04-3dd7-412c-9cbf-32f4a69bdb20 tags:
``` python
# r2_val
r2_val = ((df[ycol] - predictions) ** 2).mean()
r2_val
```
%% Cell type:code id:e3e83156-8e02-4216-9738-bed6b0052608 tags:
``` python
r2_score = (known_var - r2_val) / known_var
r2_score # there might be minor rounding off differences
```
%% Cell type:markdown id:d1795241-d9d7-42c5-8bf4-614ec7d972d7 tags:
#### `model.score(X, y)`
- invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y
%% Cell type:code id:a50b84d0-5c1b-4f84-9f41-d6157057d645 tags:
``` python
model.score(df[xcols], df[ycol])
```
%% Cell type:markdown id:1768f9a9 tags:
#### Did our model learn, or just memorize (that is, "overfit")?
- Split data into train and test
%% Cell type:code id:80cef390-831c-443a-9149-9141c8d34ef2 tags:
``` python
# Split the data into two equal parts
len(df) // 2
```
%% Cell type:code id:68c16194-1c7b-4f3a-81a3-4b0352f36872 tags:
``` python
# Manual way of splitting train and test data
train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
len(train), len(test)
```
%% Cell type:markdown id:3a781391 tags:
#### `train_test_split(<dataframe>, test_size=<val>)`
- requires `from sklearn.model_selection import train_test_split`
- shuffles the data and then splits based on 75%-25% split between train and test
- produces new train and test data every single time
- `test_size` parameter can take two kind of values:
- actual number of rows that we want in test data
- fractional number representing the ratio of train versus test data
- default value is `0.25`
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
%% Cell type:code id:5c3de0c2 tags:
``` python
xcols, ycol
```
%% Cell type:code id:c52c86c4-8afe-4d49-97a5-e2fcb18242c5 tags:
``` python
train, test = train_test_split(df)
len(train), len(test)
```
%% Cell type:code id:b3ccdf1a-0958-402c-80e1-2eed12abdf88 tags:
``` python
# Test size using row count
train, test = train_test_split(df, test_size=120)
len(train), len(test)
```
%% Cell type:code id:adeea246-f80a-46c2-b62b-9e95bef5d02f tags:
``` python
# Test size using fraction
train, test = train_test_split(df, test_size=0.5)
len(train), len(test)
```
%% Cell type:code id:ba4c6310-0f35-44ac-9250-79b6eb04f7cc tags:
``` python
# Running this cell twice will give you two different train datasets
train, test = train_test_split(df)
train.head()
```
%% Cell type:code id:0fe05a2e tags:
``` python
train, test = train_test_split(df)
# Let's use the train and the test data
model = LinearRegression()
# Fit using training data
model.fit(train[xcols], train[ycol])
# Predict using test data
y = model.predict(test[xcols])
# We can use score directly as it automatically invokes predict
model.score(test[xcols], test[ycol])
```
%% Cell type:markdown id:e0f0e21b tags:
Running the above cell again will give you entirely different model and score.
%% Cell type:markdown id:003b1c50 tags:
#### How can we minimize noise due to random train/test splits?
### Cross validation: `cross_val_score(estimator, X, y)`
- requires `from sklearn.model_selection import cross_val_score`
- do many different train/test splits of the values, fitting and scoring the model across each combination
- cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
- function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
%% Cell type:code id:bfa17fce tags:
``` python
train, test = train_test_split(df)
model = LinearRegression()
scores = cross_val_score(???)
scores
```
%% Cell type:code id:284f776f tags:
``` python
# Compute mean of the scores
scores.mean()
```
%% Cell type:markdown id:53c9d4d4 tags:
#### How can we compare models?
- model 1: POP => DEATHS
- model 2: CASES (POS_CUM_CP) => DEATHS
%% Cell type:code id:ffd9791b tags:
``` python
model1 = LinearRegression()
model2 = LinearRegression()
model1_scores = cross_val_score(model1, train[["POP"]], train[ycol])
model2_scores = cross_val_score(model2, train[["POS_CUM_CP"]], train[ycol])
```
%% Cell type:code id:60f0bf73 tags:
``` python
model1_scores.mean()
```
%% Cell type:code id:e3070bf6 tags:
``` python
model2_scores.mean()
```
%% Cell type:markdown id:dced0919 tags:
Which of these two models do you think will perform better? Probably model2.
%% Cell type:code id:dfedd8d4 tags:
``` python
means = pd.Series({"model1": model1_scores.mean(),
"model2": model2_scores.mean()})
means.plot.bar(figsize=(3, 3))
```
%% Cell type:markdown id:312aa001 tags:
How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.
%% Cell type:code id:5123c3a9 tags:
``` python
model1_scores.std()
```
%% Cell type:code id:230b9dc9 tags:
``` python
model2_scores.std()
```
%% Cell type:code id:484c7af9 tags:
``` python
err = pd.Series({"model1": model1_scores.std(),
"model2": model2_scores.std()})
err
```
%% Cell type:code id:233cd91d tags:
``` python
# Plot error bar by passing argument to paramenter yerr
means.plot.bar(figsize=(3, 3))
```
%% Cell type:markdown id:c3b68db6 tags:
Pick a winner and run it one more time against test data.
%% Cell type:markdown id:09f5bce9 tags:
#### How can we use multiple x variables (multiple regression)?
- Features: Positive cases per age range
%% Cell type:code id:d3016079 tags:
``` python
df.columns
```
%% Cell type:code id:2538534d tags:
``` python
xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
'POS_80_89_CP', 'POS_90_CP']
ycol = "DTH_CUM_CP"
model = LinearRegression()
model.fit(train[xcols], train[ycol])
model.score(test[xcols], test[ycol])
```
%% Cell type:markdown id:92e4c272 tags:
#### How can we interpret what features the model is relying on?
%% Cell type:code id:68e3d21a tags:
``` python
```
%% Cell type:code id:98fa5442 tags:
``` python
pd.Series(model.coef_)
```
%% Cell type:code id:44bd5b07 tags:
``` python
pd.Series(model.coef_, index=???).plot.bar(figsize=(3, 2))
```
### Pipeline, Transformer and Estimator
%% Cell type:markdown id:93be2ad2 tags:
<div>
<img src="attachment:Simple_model.png" width="300"/>
</div>
%% Cell type:markdown id:c854dab5 tags:
### Lake Michigan Waves
Source: https://data.cityofchicago.org/Parks-Recreation/Beach-Water-Quality-Automated-Sensors/qmqz-2xku
%% Cell type:code id:a3109957 tags:
``` python
# Read the "waves.csv" file
df = pd.read_csv("waves.csv")
df.head()
```
%% Cell type:code id:18d71eb8 tags:
``` python
df.columns
```
%% Cell type:markdown id:d3f9dc12 tags:
#### Can we predict wave height based on wave period (time between waves)?
%% Cell type:code id:149aa67c tags:
``` python
df = df[(df["Wave Period"] > 0) & (df["Wave Height"] > 0)]
df.plot.scatter(x="Wave Period", y="Wave Height", color="k", alpha=0.05)
```
%% Cell type:markdown id:4da30be1 tags:
Observation: non-linear relationship => fitting straight line will not work.
%% Cell type:markdown id:62bd175c tags:
#### Can we predict wave height based on beach name and wave period?
%% Cell type:code id:3f045b08 tags:
``` python
beach_names = sorted(set(df["Beach Name"]))
beach_names
```
%% Cell type:code id:bc626c9d tags:
``` python
fig, axes = plt.subplots(2, 3, figsize=(12, 8), sharey=True)
plt.subplots_adjust(hspace=0.3)
axes = list(axes.reshape(-1))
for b in beach_names:
ax = axes.pop(0)
ax.set_title(b)
beach_df = df[df["Beach Name"] == b]
beach_df.plot.scatter(x="Wave Period", y="Wave Height",
color="k", alpha=0.1, ax=ax)
```
%% Cell type:markdown id:ed02085d tags:
Obversation: which beach (categorical feature) is important.
%% Cell type:markdown id:ceebe2c1 tags:
### Four Models
1. wave period (linear)
2. wave period (polynomial)
3. beach
4. beach and wave period
%% Cell type:markdown id:2d4fe892 tags:
`train_test_split(<dataframe>, random_state=<some number>)`
- `random_state` enables us to control the randomization
- when we pass the same number, we will get the same training and test data (psuedo randomization)
%% Cell type:code id:a454b7ec tags:
``` python
train, test = train_test_split(df, random_state=320)
train.head()
```
%% Cell type:markdown id:d6f3871a tags:
#### Model 1: Wave Period (Linear)
%% Cell type:markdown id:4c15ac91 tags:
`cross_val_score(estimator, X, y, cv=<number>)`
- `cv` enables us mention how many folds we want for the cross-validation
%% Cell type:code id:199a5ca6 tags:
``` python
xcols = ["Wave Period"]
ycol = "Wave Height"
m1 = ???
scores = ???
scores
```
%% Cell type:code id:15bec5d5 tags:
``` python
scores.mean()
```
%% Cell type:markdown id:aa5a272e tags:
We want the mean score to be large.
%% Cell type:code id:306e21bd tags:
``` python
scores.std()
```
%% Cell type:markdown id:ea190e50 tags:
We want the standard deviation to be small, to make sure that our model isn't too sensitive to changes in training and test data.
%% Cell type:markdown id:b126b01f tags:
#### Model 2: Wave Period (Polynomial)
LinearRegression can do this:
```
y = 3*x1 + 5*x2
```
It CANNOT do this:
```
y = 3*x + 5*x^2
```
TRICK:
```
x1 = x
x2 = x^2
```
<div>
<img src="attachment:Pipeline.png" width="500"/>
</div>
%% Cell type:code id:5b6f8a12 tags:
``` python
# Let's make a copy before we add new columns
# Recommendation: don't change the data referred to by a specific variable
# throughout your code
train2 = train[xcols].copy()
train2.head()
```
%% Cell type:markdown id:6be8df92 tags:
Manually adding columns for x^2, x^3, etc.,
%% Cell type:code id:54f8d249 tags:
``` python
train2["Wave Period ^ 2"] = train2["Wave Period"] ** 2
train2["Wave Period ^ 3"] = train2["Wave Period"] ** 3
train2["sqrt(Wave Period"] = train2["Wave Period"] ** 0.5
train2["sqrt(Wave Period)"] = train2["Wave Period"] ** 0.5
train2.head()
```
%% Cell type:markdown id:155f7e7a tags:
#### `PolynomialFeatures(degree=<val>, include_bias=False)`
- `degree` enables us to mention how many degrees we need
- `include_bias` default value is True, which will add a column of 1s - we typically don't use that.
- returns an object instance on which we can invoke `fit` and `transform`:
- `transform(X, columns=<col names>)`: transform data to polynomial features`
- `fit_transform(X[, y])`: fit to data, then transform it.
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
%% Cell type:code id:e6c43483 tags:
``` python
pf =
```
%% Cell type:code id:6f81a519 tags:
``` python
# Fit the data
pf
```
%% Cell type:code id:8d6d2e1c tags:
``` python
# Transform the data
pf
```
%% Cell type:code id:ecfc2493 tags:
``` python
# Convert result of transformation into a DataFrame - step 1
pd.DataFrame(pf.transform(train[xcols]))
```
%% Cell type:markdown id:129b21dc tags:
How can we add meaningful column names? `pf.get_feature_names_out()`
%% Cell type:code id:1ab1eaa3 tags:
``` python
pf
```
%% Cell type:code id:2d62b3d3 tags:
``` python
# Convert result of transformation into a DataFrame - step 2
pd.DataFrame(pf.transform(train[xcols]), columns=???)
```
%% Cell type:code id:4beebe68 tags:
``` python
# Putting all the steps together
pf = PolynomialFeatures(degree=4, include_bias=False)
pf.fit(train[xcols])
pd.DataFrame(pf.transform(train[xcols]), columns=pf.get_feature_names_out()).head()
```
%% Cell type:markdown id:29d11ffb tags:
`fit_transform(X[, y])`: fit to data, then transform it.
%% Cell type:code id:91c3a781 tags:
``` python
new_data = ???
pd.DataFrame(new_data, columns=pf.get_feature_names_out()).head()
```
%% Cell type:markdown id:86de680e tags:
#### `Pipeline(...)`
- Argument: list of steps in the pipeline:
- each step represented as a tuple with name of the step and the object instance
- last step will be the estimator
- documention: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
%% Cell type:code id:bb509930 tags:
``` python
m2 = Pipeline([
#PolynomialFeatures(degree=2, include_bias=False)
# LinearRegression()
])
m2
```
%% Cell type:code id:ccf76663 tags:
``` python
scores = cross_val_score(m2, train[xcols], train[ycol], cv=10)
scores.mean()
```
%% Cell type:markdown id:b7acca2c tags:
**Conclusion:** mean R^2 score increased from 0.0029 (linear model) to 0.0489 (polynomial) - not bad!
%% Cell type:markdown id:7dff1435 tags:
#### Model 3: Beach Name (Categorical)
%% Cell type:code id:76a31355 tags:
``` python
train["Beach Name"].unique()
```
%% Cell type:markdown id:61644774 tags:
Naive way of assigning numerial values to a categorical column:
- `Ohio Street Beach`: 1
- `Calumet Beach`: 2
- `Rainbow Beach`: 3, etc.,
- Problem: `Calumet Beach` will become an average of `Ohio Street Beach` and `Rainbow Beach` => this doesn't make any sense!
%% Cell type:markdown id:41c6be06 tags:
#### `OneHotEncoder()`
- encodes categorical features as a one-hot numeric array
- returns a "sparse matrix", which needs to be explicitly converted into an `array` using `to_array()` method, before `DataFrame` conversion
- documention: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
%% Cell type:code id:55ec83e1-2800-41a2-80ce-f05a1de54c8a tags:
``` python
xcols = ["Beach Name"]
```
%% Cell type:code id:ef069c7b tags:
``` python
```
%% Cell type:code id:5cb24e10 tags:
``` python
# fit_transform
```
%% Cell type:code id:d5303636 tags:
``` python
# All steps
oh = OneHotEncoder()
pd.DataFrame(oh.fit_transform(train[xcols]).toarray(), columns=oh.get_feature_names_out())
```
%% Cell type:code id:625a8edc tags:
``` python
m3 = Pipeline([
("oh", OneHotEncoder()),
("lr", LinearRegression())
])
m3
```
%% Cell type:code id:5494c1fa tags:
``` python
scores = cross_val_score(m3, train[xcols], train[ycol], cv=10)
scores.mean()
```
%% Cell type:markdown id:2ce15dac tags:
**Conclusion:** mean R^2 score is slightly lower than 0.0489 (polynomial).
%% Cell type:markdown id:e4bf025e tags:
<div>
<img src="attachment:ColumnTransformer.png" width="500"/>
</div>
%% Cell type:markdown id:0f5a1a98 tags:
#### `make_column_transformer(...)`
- Argument: transformations
- each transformer argument will be a `tuple` with object instance as first item and list of feature columns as the second
- documention: https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html
%% Cell type:markdown id:a41f509f tags:
#### Model 4: Beach Name (Categorical) and Wave Period (Polynomial)
%% Cell type:code id:40378fb3 tags:
``` python
custom_trans = make_column_transformer(
(PolynomialFeatures(), ["Wave Period"]),
(OneHotEncoder(), ["Beach Name"]),
)
custom_trans
```
%% Cell type:code id:2d6e0ba2 tags:
``` python
m4 = Pipeline([
("transformers", custom_trans),
("lr", LinearRegression()),
])
m4
```
%% Cell type:code id:4248010b tags:
``` python
xcols = ["Beach Name", "Wave Period"]
```
%% Cell type:code id:57b9ffc8 tags:
``` python
scores = cross_val_score(m4, train[xcols], train[ycol], cv=10)
scores.mean()
```
%% Cell type:markdown id:73ade800 tags:
**Conclusion:** mean R^2 score increased to 0.0885 when compared to 0.0489 (polynomial).
%% Cell type:markdown id:247947ef tags:
### Let's evaluate this model by running it against the test data
%% Cell type:code id:e7e58157 tags:
``` python
m4.fit(train[xcols], train[ycol])
m4.score(test[xcols], test[ycol])
```
%% Cell type:markdown id:cd689ae1 tags:
How to extract `coef_` for this `Pipeline`?
%% Cell type:code id:50c7e102 tags:
``` python
m4["transformers"].get_feature_names_out()
```
%% Cell type:code id:28ccbf92 tags:
``` python
m4["lr"].coef_
```
%% Cell type:code id:09bc63d9-ab2b-4723-94cf-94ebac037d8d tags:
``` python
s = pd.Series(m4["lr"].coef_, index=m4["transformers"].get_feature_names_out())
s.plot.bar()
```
%% Cell type:code id:acd00dc3 tags:
``` python
s = pd.Series(m4["lr"].coef_, index=m4["transformers"].get_feature_names_out())
s.plot.barh()
```
%% Cell type:markdown id:279ae068-9fc8-45d3-9bcf-c9ae9252f19f tags:
What are the beaches that have the biggest waves?
%% Cell type:code id:2ff1d03b-48bb-4150-bc27-42f5064f88df tags:
``` python
```
......
This diff is collapsed.