Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • cdis/cs/courses/cs320/s24
  • EBBARTELS/s24
  • kenninger/s24
  • hbartle/s24
  • jvoegeli/s24
  • chin6/s24
  • lallo/s24
  • cbjensen/s24
  • bjhicks/s24
  • JPERLOFF/s24
  • RMILLER56/s24
  • sswain2/s24
  • SHINEGEORGE/s24
  • SKALMAZROUEI/s24
  • nkempf2/s24
  • kmalovrh/s24
  • alagiriswamy/s24
  • SWEINGARTEN2/s24
  • SKALMAZROUEI/s-24-fork
  • jchasco/s24
20 results
Show changes
Showing
with 62525 additions and 0 deletions
File added
File added
Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:markdown id:1986bd55 tags:
# Linear Algebra 1
- Installation requirements: `pip3 install rasterio Pillow`
%% Cell type:code id:e6f50cc3 tags:
``` python
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# new import statements
from sklearn.linear_model import LinearRegression
```
%% Cell type:markdown id:7736923a tags:
### Where do numpy arrays show up in ML?
- A `DataFrame` is just a matrix wihout column names or row indices
%% Cell type:code id:327c8314 tags:
``` python
df = pd.DataFrame([[0, 2, 1], [2, 3, 4], [8, 5, 6]], columns=["x1", "x2", "y"])
df
```
%% Cell type:markdown id:c6af6a3e tags:
`df.values` gives us a `numpy.ndarray` of all the values.
`nd` stands for n-dimensional:
- 2-dimensional for matrix
- 1-dimensional for vector
%% Cell type:code id:4b416b92 tags:
``` python
print(type(df.values))
df.values
```
%% Cell type:code id:156f0722 tags:
``` python
model = LinearRegression()
model.fit(df[["x1", "x2"]], df["y"])
model.coef_
```
%% Cell type:code id:069a423e tags:
``` python
model.predict(df[["x1", "x2"]])
```
%% Cell type:markdown id:30e8f41d tags:
#### How does `predict` actually work?
- Matrix multiplication with coefficients (`@`) and add intercept
%% Cell type:code id:5809ea1b tags:
``` python
df[["x1", "x2"]].values @ model.coef_ + model.intercept_
```
%% Cell type:markdown id:dcd299a3 tags:
### How to create numpy arrays from scratch?
- requires `import numpy as np`
- `np.array(<object>)`: creates numpy array from object instance; documentation: https://numpy.org/doc/stable/reference/generated/numpy.array.html
- `np.ones(<shape>)`: creates an array of ones; documentation: https://numpy.org/doc/stable/reference/generated/numpy.ones.html
- `np.zeros(<shape>)`: creates an array of zeros; documentation: https://numpy.org/doc/stable/reference/generated/numpy.zeros.html
%% Cell type:code id:cae8622e tags:
``` python
# Creating numpy array using np.array
[7, 8, 9]
```
%% Cell type:code id:a3a7c724 tags:
``` python
# Creating numpy array of 8 1's
```
%% Cell type:code id:64d5a747 tags:
``` python
# Creating numpy array of 8 0's
```
%% Cell type:markdown id:77dba590 tags:
#### Review: `range()`
%% Cell type:code id:c9d01064 tags:
``` python
# 0 to exclusive end
# range(END)
list(range(10))
```
%% Cell type:code id:625055f6 tags:
``` python
# inclusive start to exclusive end
# range(START, END)
list(range(-4, 10))
```
%% Cell type:code id:138db7bd tags:
``` python
# inclusive start to exclusive end with a step between values
# default STEP is 1
# range(START, END, STEP)
list(range(-4, 10, 2))
```
%% Cell type:code id:2ad2439f tags:
``` python
# range cannot have floats for the STEP
list(range(-4, 10, 0.5))
```
%% Cell type:markdown id:82d9884b tags:
#### Back to `numpy`
- `np.arange([start, ]stop, [step, ])`: gives us an array based on range; documentation: https://numpy.org/doc/stable/reference/generated/numpy.arange.html
%% Cell type:code id:c7546ba3 tags:
``` python
# array range
np.arange(-4, 10, 0.5)
```
%% Cell type:markdown id:3b2907d7 tags:
#### Review: Slicing
- `seq_object[<START>:<exclusive_END>:<STEP>]`
- `<START>` is optional; default is index 0
- `<END>` is optional; default is `len` of the sequence
- slicing creates a brand new object instance
%% Cell type:code id:dc6c1b73 tags:
``` python
# REVIEW: Python slicing of lists
a = [7, 8, 9, 10]
# slice out 8 and 10
b = a[1::2]
b
```
%% Cell type:code id:0fc53657 tags:
``` python
b[1] = 100
b
```
%% Cell type:code id:1fb36b9f tags:
``` python
# original object instance doesn't change
a
```
%% Cell type:markdown id:a79b7bb6 tags:
Slicing is slow because of creating a new object instance.
%% Cell type:markdown id:ce7a536a tags:
#### How to slice `numpy` arrays?
- Unlike regular slicing `numpy` slicing is very efficient - doesn't do a copy
%% Cell type:code id:290899b7 tags:
``` python
a = np.array([7, 8, 9, 10])
# slice out 8 and 10
b = a[1::2]
b
```
%% Cell type:code id:34d9d044 tags:
``` python
b[1] = 100
a
```
%% Cell type:markdown id:27341ca0 tags:
How can you ensure that changes to a slice don't affect original `numpy.array`? Use `copy` method.
%% Cell type:code id:fe625eae tags:
``` python
a = np.array([7, 8, 9, 10])
b = a.copy() # copy everything instead of sharing
b = a[1::2]
b[1] = 100
b, a
```
%% Cell type:markdown id:b5b407e8 tags:
#### Creating Multi-Dimensional Arrays
- using nested data structures like list of lists
- `shape` gives us the dimension of the `numpy.array`
- `len()` gives the first dimension, that is `shape[0]`
%% Cell type:code id:c44b5951 tags:
``` python
a = np.array([1, 2, 3])
a, len(a)
```
%% Cell type:markdown id:99456855 tags:
How many numbers are there in the below `tuple`?
%% Cell type:code id:8b9fe7e2 tags:
``` python
# shape of numpy array
```
%% Cell type:markdown id:40b1d392 tags:
One number in this `tuple`, and it is 3.
%% Cell type:code id:96716ca6 tags:
``` python
# 2-D array using list of lists
b = np.array([[1, 2, 3], [4, 5, 6]])
b
```
%% Cell type:code id:3053945b tags:
``` python
b.shape
```
%% Cell type:markdown id:7e08dd1a tags:
2 dimensional (because two numbers are there in this `tuple`). sizes 2 and 3 along those dimensions.
%% Cell type:code id:2340a5e6 tags:
``` python
# gives shape[0]
len(b)
```
%% Cell type:markdown id:ed5979bc tags:
#### How to reshape a `numpy.array`?
- `<obj>.reshape(<newshape>)`: reshapes the dimension of the array; documentation: https://numpy.org/doc/stable/reference/generated/numpy.reshape.html
%% Cell type:code id:f3d08197 tags:
``` python
b
```
%% Cell type:code id:b7d18682 tags:
``` python
# Use .reshape to change the dimensions to 3 x 2
```
%% Cell type:code id:2189e09b tags:
``` python
# Use .reshape to change to 1-dimensional array
```
%% Cell type:markdown id:5588c5f8 tags:
We cannot add/remove values while reshaping.
%% Cell type:code id:75106197 tags:
``` python
b.reshape(5)
```
%% Cell type:code id:6bb8c765 tags:
``` python
b.reshape(7)
```
%% Cell type:markdown id:3d56dad4 tags:
-1 means whatever size is the necessary shape for the remaining values. Enables us to just control one of the dimensions.
%% Cell type:code id:f3c16ff5 tags:
``` python
# Use .reshape to change the dimensions to 3 x something valid
```
%% Cell type:code id:22cc87bc tags:
``` python
# Use .reshape to change the dimensions to 1-dimensionl using -1
```
%% Cell type:markdown id:1b02eb22 tags:
Generate a 10*10 with numbers from 0 to 99.
%% Cell type:code id:87ca1111 tags:
``` python
# Use arange and then reshape it to 10 x something valid
```
%% Cell type:markdown id:01e7bf19 tags:
### Vocabulary
- scalar: 0 dimensional array
- vector: 1 dimensional array
- matrix: 2 dimensional array
- tensor: n dimensional (0, 1, 2, 3, ...) array
%% Cell type:markdown id:63ca66d0 tags:
### Images as Tensors
- `wget` command:
- `wget <url> -O <local file name>`
%% Cell type:code id:9469b3b8 tags:
``` python
# Only run this cell once
!wget "https://upload.wikimedia.org/wikipedia/commons/f/f2/Coccinella_magnifica01.jpg" -O bug.jpg
```
%% Cell type:markdown id:b9d16352 tags:
#### How to read an image file?
- required `import matplotlib.pyplot as plt`
- `plt.imread(<fname>)`: reads an image file into a 3-dimensional array --- rows(pixels), columns(pixels), colors (red/green/blue)
- `plt.imshow(<array>, cmap=<color map>)`: displays the image
%% Cell type:code id:6269ce42 tags:
``` python
a = plt.imread("bug.jpg")
type(a)
```
%% Cell type:code id:9f455a42 tags:
``` python
# 3-dimensional array
# rows(pixels), columns(pixels), colors (red/green/blue)
a.shape
```
%% Cell type:code id:44973823 tags:
``` python
plt.imshow(a)
```
%% Cell type:code id:e973dd28 tags:
``` python
a
# each inner array has 3-color representation R, G, B
# two color scales: floats (0.0 to 1.0) OR ints (0 to 255)
```
%% Cell type:markdown id:59023dad tags:
#### GOAL: crop down just to the bug using slicing
- `<array>[ROW SLICE, COLUMN SLICE, COLOR SLICE]`
%% Cell type:code id:4a1f059f tags:
``` python
plt.imshow(a[???, ???, :])
```
%% Cell type:markdown id:c41c6a04 tags:
#### GOAL: show clearly where RED is high on the image
- two formats:
- 3D (row, column, color)
- 2D (row, column) => black/white (red/blue)
%% Cell type:code id:9924cb83 tags:
``` python
a.shape
```
%% Cell type:markdown id:21c34e89 tags:
Pull out only layer 0, which is the red layer.
- 0 is red
- 1 is green
- 2 is blue
Use index only for the color dimension and slices for row and column dimensions
%% Cell type:code id:50b32a7a tags:
``` python
a[:, :, 0].shape
```
%% Cell type:code id:0d4c8b80 tags:
``` python
# instead of using black and white,
# it is just assigning some color for light and dark
plt.imshow(a[:, :, 0])
```
%% Cell type:code id:71efa2de tags:
``` python
# better to use grayscale
plt.imshow(a[:, :, 0], ???)
```
%% Cell type:markdown id:9bad669a tags:
Wherever there was red, the image is bright. The bug is very bright because of that. There are other places in the image that are bright but were not red. This is because when we mix RGB, we get white. Any color that was light will also have a lot of RED.
This could be a pre-processing step for some ML algorithm that can identify RED bugs.
%% Cell type:markdown id:e8a4f511 tags:
#### GOAL: show a grayscale that considers the average of all colors
- `<array>.mean(axis=<val>)`:
- `axis` should be 0 for 1st dimension, 1 for 2nd dimension, 2 for 3rd dimension
%% Cell type:code id:0235c6b7 tags:
``` python
# average over all the numbers
# gives a measure of how bright the image is overall
a.mean()
```
%% Cell type:code id:78e5988f tags:
``` python
a.shape
```
%% Cell type:code id:de0a1eae tags:
``` python
# average over each column and color combination
a.mean(axis=0).shape
```
%% Cell type:code id:c658655d tags:
``` python
# average over each row and color combination
a.mean(axis=1).shape
```
%% Cell type:code id:85656235 tags:
``` python
# average over each row and column combination
a.mean(axis=2).shape
```
%% Cell type:code id:a57d5077 tags:
``` python
plt.imshow(a.mean(axis=2), cmap="gray")
```
%% Cell type:markdown id:66d7dcef tags:
This could also be a pre-processing step for some ML algorithm that expects black and white images.
%% Cell type:markdown id:82d66d7a tags:
### Vector Multiplication: Overview
#### Elementwise Multiplication
$\begin{bmatrix}
1 \\ 2 \\ 3
\end{bmatrix}
*
\begin{bmatrix}
4 \\ 5 \\ 6
\end{bmatrix}$
$\begin{bmatrix}
1 \\ 2 \\ 3
\end{bmatrix}
*
\begin{bmatrix}
4 & 5 & 6
\end{bmatrix}$
### Dot Product
$\begin{bmatrix}
1 & 2 & 3
\end{bmatrix}
\cdot
\begin{bmatrix}
4 \\ 5 \\ 6
\end{bmatrix}$
$\begin{bmatrix}
1 \\ 2 \\ 3
\end{bmatrix}
\cdot
\begin{bmatrix}
4 & 5 & 6
\end{bmatrix}$
%% Cell type:code id:61dca7b3 tags:
``` python
# Use .reshape to change the dimensions to something valid x 1
# vertical shape
v1 = np.array([1, 2, 3])
v1
```
%% Cell type:code id:f4a3167b tags:
``` python
v2 = np.array([4, 5, 6]).reshape(-1, 1)
v2
```
%% Cell type:markdown id:888d7d1b tags:
#### Elementwise Multiplication
$\begin{bmatrix}
1 \\ 2 \\ 3
\end{bmatrix}
*
\begin{bmatrix}
4 \\ 5 \\ 6
\end{bmatrix}$
\=
$\begin{bmatrix}
4 \\ 10 \\ 18
\end{bmatrix}$
%% Cell type:code id:9c7b9d31 tags:
``` python
v1 * v2 # [1*4, 2*5, 3*6]
```
%% Cell type:markdown id:2bd28f4f tags:
#### Transpose
- flips the x and y
%% Cell type:code id:c973a6d9 tags:
``` python
v2
```
%% Cell type:code id:db61f49e tags:
``` python
v2.T # horizontal
```
%% Cell type:code id:a95fc59a tags:
``` python
v2.T.T # vertical
```
%% Cell type:code id:1d11b30c tags:
``` python
v1.shape
```
%% Cell type:code id:278c5332 tags:
``` python
v2.T.shape
```
%% Cell type:markdown id:f9314ce6-aad0-4a5e-a03d-b5d4b8f9de3d tags:
#### Elementwise Multiplication
$\begin{bmatrix}
1 \\ 2 \\ 3
\end{bmatrix}
*
\begin{bmatrix}
4 & 5 & 6
\end{bmatrix}$
\=
?
%% Cell type:code id:8c4bb10f tags:
``` python
v1 * v2.T # how is this working?
```
%% Cell type:markdown id:ce6ee037-9479-4313-8c85-df7e75d8918b tags:
### Broadcast
Two use cases:
1. "stretch" 1 => N along any dimension to make shapes compatible
2. add dimensions of size 1 to the beginning of a shape
%% Cell type:markdown id:13b3d4cd-b1c2-41ba-aeb1-548a180b0a9b tags:
Element-wise operation between `v1 * v2.T` will automatically "Broadcast" v1 to 3 x 3 (stretching the second dimension) and "Broadcast" v2.T to 3 x 3 (stretching the first dimension).
%% Cell type:code id:4578b5d0-b511-4277-ac8b-edb738c22cd7 tags:
``` python
v1.shape
```
%% Cell type:code id:1344b009-c62f-462e-ba20-613275e3f31d tags:
``` python
v2.T.shape
```
%% Cell type:markdown id:03cc16fe-ddd8-495c-8f88-af491bbd0f25 tags:
How can we manually replicate that?
#### `np.concatenate([a1, a2, ...], axis=0)`.
- `a1, a2, …`: sequence of arrays
- `axis`: the dimension along with we want to join the arrays
- default value is 0, which is for row dimension (down)
- value of 1 is for column dimension (across)
%% Cell type:code id:9cf0b4c5-9b2d-44df-ae1c-964fdf3f0991 tags:
``` python
v1
```
%% Cell type:code id:0f02a463-c802-4c71-bf04-22ed704e5078 tags:
``` python
v1.shape
```
%% Cell type:code id:4df19236-3d86-46b5-9702-80963b4707eb tags:
``` python
# Broadcast v1 to 3 x 3 (stretching the second dimension)
v1_broadcast =
v1_broadcast
```
%% Cell type:code id:2496189e-29a1-458c-ac97-3c68a18e4106 tags:
``` python
v2.T
```
%% Cell type:code id:ed1d5db1-ee94-4d30-84e9-4b991f4c52d5 tags:
``` python
v2.T.shape
```
%% Cell type:code id:9cf1fe4f-3846-4312-a14b-48acb83fdacb tags:
``` python
# Broadcast v2.T to 3 x 3 (stretching the second dimension)
v2t_broadcast =
v2t_broadcast
```
%% Cell type:code id:a2917d2c-7e82-4134-b91d-b4dcb2c2f13a tags:
``` python
v1_broadcast * v2t_broadcast # same as v1 * v2.T
```
%% Cell type:code id:d076bc12-0023-4c1f-8487-75f804a61057 tags:
``` python
v1 * v2.T
```
%% Cell type:markdown id:9ca5d5eb-d3dd-417b-b871-dc6efb3c2966 tags:
#### Generate a multiplication table from 1 to 10
%% Cell type:code id:91570a33-f578-4547-b26d-e5cf3c2d87bc tags:
``` python
# 1. generate a range of numbers from 1 to 10
# 2. reshape that to a vertical numpy array
digits =
digits
```
%% Cell type:code id:5c46433c-a4b2-4946-857f-228eb9f9dc55 tags:
``` python
digits * digits.T
```
%% Cell type:code id:09452fd8-e225-4a01-a891-4cc03c559c64 tags:
``` python
# Convert the multiplication table into a DataFrame
```
%% Cell type:markdown id:4b8bf928-d85c-4817-9819-f66e7c0105c9 tags:
#### Back to bug example
Let's do more complex broadcasting example
%% Cell type:code id:29e739e6-5fce-4367-ab0f-d0a809a3b9f9 tags:
``` python
# Read "bug.jpg" into a numpy array
a =
a.shape
```
%% Cell type:code id:56fe8d69-8a33-4b0b-bffe-a8016c8e93b0 tags:
``` python
# Display "bug.jpg"
plt.imshow(a)
```
%% Cell type:markdown id:9fec8809-a32e-4466-b7df-b16e9489caaf tags:
#### GOAL: create a fade effect (full color on the left, to black on the right)
- To achieve this, we need to:
1. multiply the left most columns with numbers close to 1's (retains the original color)
2. the rightmost columns with numbers close to 0's (0 will give us black color)
3. the middle columns with numbers close to 0.5's
%% Cell type:code id:50e52114-3e64-4395-9c2c-65bb8ef46078 tags:
``` python
a.shape
```
%% Cell type:code id:681c7d5d-3308-4299-9dbe-f993c9ffe859 tags:
``` python
# Create an array called fade with 2521 numbers
fade =
print(fade.shape)
fade
# How many dimensions does fade have? 1
```
%% Cell type:code id:f6fdb073-78fd-451a-985f-cac8835fd871 tags:
``` python
a.shape
```
%% Cell type:markdown id:09d8bc07-79bc-48aa-98ac-6904acd2bc86 tags:
How can we multiply `a` and `fade`? That is how do we `reshape` `fade`?
%% Cell type:markdown id:950e479c-a1aa-4beb-b1e8-756cb32a65fb tags:
Can we reshape fade to 1688 x 2521 x 3?
%% Cell type:code id:8fe7a385-0f53-4183-aea8-cb8461ca3bbf tags:
``` python
```
%% Cell type:markdown id:1c75f601-1906-4768-b7f3-ea247299ba71 tags:
The answer is no - because `reshape` can never add new values / delete values. Meaning after `reshape`, we need to exactly have 2521 values.
%% Cell type:code id:eec5aabe-2749-468a-a76d-de4e8f5f8df4 tags:
``` python
# Keep in mind that we need to multiple each column by a number, so which dimension should
# be 2521?
```
%% Cell type:code id:ee3dc975-749d-4077-b47b-36e101ff1901 tags:
``` python
# Let's multiple a by reshaped fade
plt.imshow(???)
```
%% Cell type:markdown id:7abfc514-1130-4844-9df9-439d783d2f73 tags:
Why doesn't this work? Remember pixels can be either represented using the values 0 to 255 or 0 to 1. `a` has the scale 0 to 255 and `fade.reshape(...)` has the scale 0 to 1.
%% Cell type:code id:d0f07972-a874-4764-93f1-a56ebbf84136 tags:
``` python
plt.imshow(a * fade.reshape(1, 2521, 1))
```
%% Cell type:markdown id:1a027c40-5602-4bea-85ec-84c74a0cab3a tags:
### Broadcast
Two use cases:
1. "stretch" 1 => N along any dimension to make shapes compatible
2. add dimensions of size 1 to the beginning of a shape
%% Cell type:code id:25ca7735-b6c3-4524-8336-70173cfc5be1 tags:
``` python
a.shape
```
%% Cell type:code id:0f6dd991-1644-4ad6-b702-4e8c0b74e0c0 tags:
``` python
plt.imshow(a / 255.0 * fade.reshape(2521, 1))
# BROADCAST: (2521, 1) => (1, 2521, 1) => (1688, 2521, 3)
```
%% Cell type:markdown id:e96bd769-07ea-461e-8af1-91216adbd2ab tags:
### Dot Product
$\begin{bmatrix}
1 & 2 & 3
\end{bmatrix}
\cdot
\begin{bmatrix}
4 \\ 5 \\ 6
\end{bmatrix}$
%% Cell type:code id:888f3726-cc86-4fff-bd3c-22183a1d5178 tags:
``` python
v1
```
%% Cell type:code id:bedf9d2e-6ec1-41a3-ab2e-23b56239a0cb tags:
``` python
v2
```
%% Cell type:code id:5c8bb73e-181a-452b-b19e-4b03f7bfc05a tags:
``` python
v1 * v2 # 1*4, 2*5, 3*6
```
%% Cell type:code id:625b136c-7b5e-46cf-89d9-450d3a55ae77 tags:
``` python
v1.T
```
%% Cell type:code id:4b580956-9b7f-4b40-b658-9ee2383d3284 tags:
``` python
v2
```
%% Cell type:markdown id:b1d9e5bb-308f-4540-a9bc-2964f0917823 tags:
#### `np.dot(a1, a2)` or `a1 @ a2`
%% Cell type:code id:5114c384-2836-4ab0-85df-e2910ab53f3c tags:
``` python
# 1*4 + 2*5 + 3*6
```
%% Cell type:code id:df408d79-0192-46d3-9448-9286cec11288 tags:
``` python
```
%% Cell type:markdown id:1235fb92-417c-4d37-aba0-d470bddb9627 tags:
#### `.item()` gives you just the values
%% Cell type:code id:1bd3ce8b-25fb-414f-9c8a-3de0296c058d tags:
``` python
(v1.T @ v2)??? # pulls out the only number in the results
```
%% Cell type:code id:cbf53598-a5c0-425f-912b-5bf84b18a4d6 tags:
``` python
np.dot(v1.T, v2)???
```
%% Cell type:markdown id:1986bd55 tags:
# Linear Algebra 1
- Installation requirements: `pip3 install rasterio Pillow`
%% Cell type:code id:e6f50cc3 tags:
``` python
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# new import statements
from sklearn.linear_model import LinearRegression
```
%% Cell type:markdown id:7736923a tags:
### Where do numpy arrays show up in ML?
- A `DataFrame` is just a matrix wihout column names or row indices
%% Cell type:code id:327c8314 tags:
``` python
df = pd.DataFrame([[0, 2, 1], [2, 3, 4], [8, 5, 6]], columns=["x1", "x2", "y"])
df
```
%% Cell type:markdown id:c6af6a3e tags:
`df.values` gives us a `numpy.ndarray` of all the values.
`nd` stands for n-dimensional:
- 2-dimensional for matrix
- 1-dimensional for vector
%% Cell type:code id:4b416b92 tags:
``` python
print(type(df.values))
df.values
```
%% Cell type:code id:156f0722 tags:
``` python
model = LinearRegression()
model.fit(df[["x1", "x2"]], df["y"])
model.coef_
```
%% Cell type:code id:069a423e tags:
``` python
model.predict(df[["x1", "x2"]])
```
%% Cell type:markdown id:30e8f41d tags:
#### How does `predict` actually work?
- Matrix multiplication with coefficients (`@`) and add intercept
%% Cell type:code id:5809ea1b tags:
``` python
df[["x1", "x2"]].values @ model.coef_ + model.intercept_
```
%% Cell type:markdown id:dcd299a3 tags:
### How to create numpy arrays from scratch?
- requires `import numpy as np`
- `np.array(<object>)`: creates numpy array from object instance; documentation: https://numpy.org/doc/stable/reference/generated/numpy.array.html
- `np.ones(<shape>)`: creates an array of ones; documentation: https://numpy.org/doc/stable/reference/generated/numpy.ones.html
- `np.zeros(<shape>)`: creates an array of zeros; documentation: https://numpy.org/doc/stable/reference/generated/numpy.zeros.html
%% Cell type:code id:cae8622e tags:
``` python
# Creating numpy array using np.array
[7, 8, 9]
```
%% Cell type:code id:a3a7c724 tags:
``` python
# Creating numpy array of 8 1's
```
%% Cell type:code id:64d5a747 tags:
``` python
# Creating numpy array of 8 0's
```
%% Cell type:markdown id:77dba590 tags:
#### Review: `range()`
%% Cell type:code id:c9d01064 tags:
``` python
# 0 to exclusive end
# range(END)
list(range(10))
```
%% Cell type:code id:625055f6 tags:
``` python
# inclusive start to exclusive end
# range(START, END)
list(range(-4, 10))
```
%% Cell type:code id:138db7bd tags:
``` python
# inclusive start to exclusive end with a step between values
# default STEP is 1
# range(START, END, STEP)
list(range(-4, 10, 2))
```
%% Cell type:code id:2ad2439f tags:
``` python
# range cannot have floats for the STEP
list(range(-4, 10, 0.5))
```
%% Cell type:markdown id:82d9884b tags:
#### Back to `numpy`
- `np.arange([start, ]stop, [step, ])`: gives us an array based on range; documentation: https://numpy.org/doc/stable/reference/generated/numpy.arange.html
%% Cell type:code id:c7546ba3 tags:
``` python
# array range
np.arange(-4, 10, 0.5)
```
%% Cell type:markdown id:3b2907d7 tags:
#### Review: Slicing
- `seq_object[<START>:<exclusive_END>:<STEP>]`
- `<START>` is optional; default is index 0
- `<END>` is optional; default is `len` of the sequence
- slicing creates a brand new object instance
%% Cell type:code id:dc6c1b73 tags:
``` python
# REVIEW: Python slicing of lists
a = [7, 8, 9, 10]
# slice out 8 and 10
b = a[1::2]
b
```
%% Cell type:code id:0fc53657 tags:
``` python
b[1] = 100
b
```
%% Cell type:code id:1fb36b9f tags:
``` python
# original object instance doesn't change
a
```
%% Cell type:markdown id:a79b7bb6 tags:
Slicing is slow because of creating a new object instance.
%% Cell type:markdown id:ce7a536a tags:
#### How to slice `numpy` arrays?
- Unlike regular slicing `numpy` slicing is very efficient - doesn't do a copy
%% Cell type:code id:290899b7 tags:
``` python
a = np.array([7, 8, 9, 10])
# slice out 8 and 10
b = a[1::2]
b
```
%% Cell type:code id:34d9d044 tags:
``` python
b[1] = 100
a
```
%% Cell type:markdown id:27341ca0 tags:
How can you ensure that changes to a slice don't affect original `numpy.array`? Use `copy` method.
%% Cell type:code id:fe625eae tags:
``` python
a = np.array([7, 8, 9, 10])
b = a.copy() # copy everything instead of sharing
b = a[1::2]
b[1] = 100
b, a
```
%% Cell type:markdown id:b5b407e8 tags:
#### Creating Multi-Dimensional Arrays
- using nested data structures like list of lists
- `shape` gives us the dimension of the `numpy.array`
- `len()` gives the first dimension, that is `shape[0]`
%% Cell type:code id:c44b5951 tags:
``` python
a = np.array([1, 2, 3])
a, len(a)
```
%% Cell type:markdown id:99456855 tags:
How many numbers are there in the below `tuple`?
%% Cell type:code id:8b9fe7e2 tags:
``` python
# shape of numpy array
```
%% Cell type:markdown id:40b1d392 tags:
One number in this `tuple`, and it is 3.
%% Cell type:code id:96716ca6 tags:
``` python
# 2-D array using list of lists
b = np.array([[1, 2, 3], [4, 5, 6]])
b
```
%% Cell type:code id:3053945b tags:
``` python
b.shape
```
%% Cell type:markdown id:7e08dd1a tags:
2 dimensional (because two numbers are there in this `tuple`). sizes 2 and 3 along those dimensions.
%% Cell type:code id:2340a5e6 tags:
``` python
# gives shape[0]
len(b)
```
%% Cell type:markdown id:ed5979bc tags:
#### How to reshape a `numpy.array`?
- `<obj>.reshape(<newshape>)`: reshapes the dimension of the array; documentation: https://numpy.org/doc/stable/reference/generated/numpy.reshape.html
%% Cell type:code id:f3d08197 tags:
``` python
b
```
%% Cell type:code id:b7d18682 tags:
``` python
# Use .reshape to change the dimensions to 3 x 2
```
%% Cell type:code id:2189e09b tags:
``` python
# Use .reshape to change to 1-dimensional array
```
%% Cell type:markdown id:5588c5f8 tags:
We cannot add/remove values while reshaping.
%% Cell type:code id:75106197 tags:
``` python
b.reshape(5)
```
%% Cell type:code id:6bb8c765 tags:
``` python
b.reshape(7)
```
%% Cell type:markdown id:3d56dad4 tags:
-1 means whatever size is the necessary shape for the remaining values. Enables us to just control one of the dimensions.
%% Cell type:code id:f3c16ff5 tags:
``` python
# Use .reshape to change the dimensions to 3 x something valid
```
%% Cell type:code id:22cc87bc tags:
``` python
# Use .reshape to change the dimensions to 1-dimensionl using -1
```
%% Cell type:markdown id:1b02eb22 tags:
Generate a 10*10 with numbers from 0 to 99.
%% Cell type:code id:87ca1111 tags:
``` python
# Use arange and then reshape it to 10 x something valid
```
%% Cell type:markdown id:01e7bf19 tags:
### Vocabulary
- scalar: 0 dimensional array
- vector: 1 dimensional array
- matrix: 2 dimensional array
- tensor: n dimensional (0, 1, 2, 3, ...) array
%% Cell type:markdown id:63ca66d0 tags:
### Images as Tensors
- `wget` command:
- `wget <url> -O <local file name>`
%% Cell type:code id:9469b3b8 tags:
``` python
# Only run this cell once
!wget "https://upload.wikimedia.org/wikipedia/commons/f/f2/Coccinella_magnifica01.jpg" -O bug.jpg
```
%% Cell type:markdown id:b9d16352 tags:
#### How to read an image file?
- required `import matplotlib.pyplot as plt`
- `plt.imread(<fname>)`: reads an image file into a 3-dimensional array --- rows(pixels), columns(pixels), colors (red/green/blue)
- `plt.imshow(<array>, cmap=<color map>)`: displays the image
%% Cell type:code id:6269ce42 tags:
``` python
a = plt.imread("bug.jpg")
type(a)
```
%% Cell type:code id:9f455a42 tags:
``` python
# 3-dimensional array
# rows(pixels), columns(pixels), colors (red/green/blue)
a.shape
```
%% Cell type:code id:44973823 tags:
``` python
plt.imshow(a)
```
%% Cell type:code id:e973dd28 tags:
``` python
a
# each inner array has 3-color representation R, G, B
# two color scales: floats (0.0 to 1.0) OR ints (0 to 255)
```
%% Cell type:markdown id:59023dad tags:
#### GOAL: crop down just to the bug using slicing
- `<array>[ROW SLICE, COLUMN SLICE, COLOR SLICE]`
%% Cell type:code id:4a1f059f tags:
``` python
plt.imshow(a[???, ???, :])
```
%% Cell type:markdown id:c41c6a04 tags:
#### GOAL: show clearly where RED is high on the image
- two formats:
- 3D (row, column, color)
- 2D (row, column) => black/white (red/blue)
%% Cell type:code id:9924cb83 tags:
``` python
a.shape
```
%% Cell type:markdown id:21c34e89 tags:
Pull out only layer 0, which is the red layer.
- 0 is red
- 1 is green
- 2 is blue
Use index only for the color dimension and slices for row and column dimensions
%% Cell type:code id:50b32a7a tags:
``` python
a[:, :, 0].shape
```
%% Cell type:code id:0d4c8b80 tags:
``` python
# instead of using black and white,
# it is just assigning some color for light and dark
plt.imshow(a[:, :, 0])
```
%% Cell type:code id:71efa2de tags:
``` python
# better to use grayscale
plt.imshow(a[:, :, 0], ???)
```
%% Cell type:markdown id:9bad669a tags:
Wherever there was red, the image is bright. The bug is very bright because of that. There are other places in the image that are bright but were not red. This is because when we mix RGB, we get white. Any color that was light will also have a lot of RED.
This could be a pre-processing step for some ML algorithm that can identify RED bugs.
%% Cell type:markdown id:e8a4f511 tags:
#### GOAL: show a grayscale that considers the average of all colors
- `<array>.mean(axis=<val>)`:
- `axis` should be 0 for 1st dimension, 1 for 2nd dimension, 2 for 3rd dimension
%% Cell type:code id:0235c6b7 tags:
``` python
# average over all the numbers
# gives a measure of how bright the image is overall
a.mean()
```
%% Cell type:code id:78e5988f tags:
``` python
a.shape
```
%% Cell type:code id:de0a1eae tags:
``` python
# average over each column and color combination
a.mean(axis=0).shape
```
%% Cell type:code id:c658655d tags:
``` python
# average over each row and color combination
a.mean(axis=1).shape
```
%% Cell type:code id:85656235 tags:
``` python
# average over each row and column combination
a.mean(axis=2).shape
```
%% Cell type:code id:a57d5077 tags:
``` python
plt.imshow(a.mean(axis=2), cmap="gray")
```
%% Cell type:markdown id:66d7dcef tags:
This could also be a pre-processing step for some ML algorithm that expects black and white images.
%% Cell type:markdown id:82d66d7a tags:
### Vector Multiplication: Overview
#### Elementwise Multiplication
$\begin{bmatrix}
1 \\ 2 \\ 3
\end{bmatrix}
*
\begin{bmatrix}
4 \\ 5 \\ 6
\end{bmatrix}$
$\begin{bmatrix}
1 \\ 2 \\ 3
\end{bmatrix}
*
\begin{bmatrix}
4 & 5 & 6
\end{bmatrix}$
### Dot Product
$\begin{bmatrix}
1 & 2 & 3
\end{bmatrix}
\cdot
\begin{bmatrix}
4 \\ 5 \\ 6
\end{bmatrix}$
$\begin{bmatrix}
1 \\ 2 \\ 3
\end{bmatrix}
\cdot
\begin{bmatrix}
4 & 5 & 6
\end{bmatrix}$
%% Cell type:code id:61dca7b3 tags:
``` python
# Use .reshape to change the dimensions to something valid x 1
# vertical shape
v1 = np.array([1, 2, 3])
v1
```
%% Cell type:code id:f4a3167b tags:
``` python
v2 = np.array([4, 5, 6]).reshape(-1, 1)
v2
```
%% Cell type:markdown id:888d7d1b tags:
#### Elementwise Multiplication
$\begin{bmatrix}
1 \\ 2 \\ 3
\end{bmatrix}
*
\begin{bmatrix}
4 \\ 5 \\ 6
\end{bmatrix}$
\=
$\begin{bmatrix}
4 \\ 10 \\ 18
\end{bmatrix}$
%% Cell type:code id:9c7b9d31 tags:
``` python
v1 * v2 # [1*4, 2*5, 3*6]
```
%% Cell type:markdown id:2bd28f4f tags:
#### Transpose
- flips the x and y
%% Cell type:code id:c973a6d9 tags:
``` python
v2
```
%% Cell type:code id:db61f49e tags:
``` python
v2.T # horizontal
```
%% Cell type:code id:a95fc59a tags:
``` python
v2.T.T # vertical
```
%% Cell type:code id:1d11b30c tags:
``` python
v1.shape
```
%% Cell type:code id:278c5332 tags:
``` python
v2.T.shape
```
%% Cell type:markdown id:f9314ce6-aad0-4a5e-a03d-b5d4b8f9de3d tags:
#### Elementwise Multiplication
$\begin{bmatrix}
1 \\ 2 \\ 3
\end{bmatrix}
*
\begin{bmatrix}
4 & 5 & 6
\end{bmatrix}$
\=
?
%% Cell type:code id:8c4bb10f tags:
``` python
v1 * v2.T # how is this working?
```
%% Cell type:markdown id:ce6ee037-9479-4313-8c85-df7e75d8918b tags:
### Broadcast
Two use cases:
1. "stretch" 1 => N along any dimension to make shapes compatible
2. add dimensions of size 1 to the beginning of a shape
%% Cell type:markdown id:13b3d4cd-b1c2-41ba-aeb1-548a180b0a9b tags:
Element-wise operation between `v1 * v2.T` will automatically "Broadcast" v1 to 3 x 3 (stretching the second dimension) and "Broadcast" v2.T to 3 x 3 (stretching the first dimension).
%% Cell type:code id:4578b5d0-b511-4277-ac8b-edb738c22cd7 tags:
``` python
v1.shape
```
%% Cell type:code id:1344b009-c62f-462e-ba20-613275e3f31d tags:
``` python
v2.T.shape
```
%% Cell type:markdown id:03cc16fe-ddd8-495c-8f88-af491bbd0f25 tags:
How can we manually replicate that?
#### `np.concatenate([a1, a2, ...], axis=0)`.
- `a1, a2, …`: sequence of arrays
- `axis`: the dimension along with we want to join the arrays
- default value is 0, which is for row dimension (down)
- value of 1 is for column dimension (across)
%% Cell type:code id:9cf0b4c5-9b2d-44df-ae1c-964fdf3f0991 tags:
``` python
v1
```
%% Cell type:code id:0f02a463-c802-4c71-bf04-22ed704e5078 tags:
``` python
v1.shape
```
%% Cell type:code id:4df19236-3d86-46b5-9702-80963b4707eb tags:
``` python
# Broadcast v1 to 3 x 3 (stretching the second dimension)
v1_broadcast =
v1_broadcast
```
%% Cell type:code id:2496189e-29a1-458c-ac97-3c68a18e4106 tags:
``` python
v2.T
```
%% Cell type:code id:ed1d5db1-ee94-4d30-84e9-4b991f4c52d5 tags:
``` python
v2.T.shape
```
%% Cell type:code id:9cf1fe4f-3846-4312-a14b-48acb83fdacb tags:
``` python
# Broadcast v2.T to 3 x 3 (stretching the second dimension)
v2t_broadcast =
v2t_broadcast
```
%% Cell type:code id:a2917d2c-7e82-4134-b91d-b4dcb2c2f13a tags:
``` python
v1_broadcast * v2t_broadcast # same as v1 * v2.T
```
%% Cell type:code id:d076bc12-0023-4c1f-8487-75f804a61057 tags:
``` python
v1 * v2.T
```
%% Cell type:markdown id:9ca5d5eb-d3dd-417b-b871-dc6efb3c2966 tags:
#### Generate a multiplication table from 1 to 10
%% Cell type:code id:91570a33-f578-4547-b26d-e5cf3c2d87bc tags:
``` python
# 1. generate a range of numbers from 1 to 10
# 2. reshape that to a vertical numpy array
digits =
digits
```
%% Cell type:code id:5c46433c-a4b2-4946-857f-228eb9f9dc55 tags:
``` python
digits * digits.T
```
%% Cell type:code id:09452fd8-e225-4a01-a891-4cc03c559c64 tags:
``` python
# Convert the multiplication table into a DataFrame
```
%% Cell type:markdown id:4b8bf928-d85c-4817-9819-f66e7c0105c9 tags:
#### Back to bug example
Let's do more complex broadcasting example
%% Cell type:code id:29e739e6-5fce-4367-ab0f-d0a809a3b9f9 tags:
``` python
# Read "bug.jpg" into a numpy array
a =
a.shape
```
%% Cell type:code id:56fe8d69-8a33-4b0b-bffe-a8016c8e93b0 tags:
``` python
# Display "bug.jpg"
plt.imshow(a)
```
%% Cell type:markdown id:9fec8809-a32e-4466-b7df-b16e9489caaf tags:
#### GOAL: create a fade effect (full color on the left, to black on the right)
- To achieve this, we need to:
1. multiply the left most columns with numbers close to 1's (retains the original color)
2. the rightmost columns with numbers close to 0's (0 will give us black color)
3. the middle columns with numbers close to 0.5's
%% Cell type:code id:50e52114-3e64-4395-9c2c-65bb8ef46078 tags:
``` python
a.shape
```
%% Cell type:code id:681c7d5d-3308-4299-9dbe-f993c9ffe859 tags:
``` python
# Create an array called fade with 2521 numbers
fade =
print(fade.shape)
fade
# How many dimensions does fade have? 1
```
%% Cell type:code id:f6fdb073-78fd-451a-985f-cac8835fd871 tags:
``` python
a.shape
```
%% Cell type:markdown id:09d8bc07-79bc-48aa-98ac-6904acd2bc86 tags:
How can we multiply `a` and `fade`? That is how do we `reshape` `fade`?
%% Cell type:markdown id:950e479c-a1aa-4beb-b1e8-756cb32a65fb tags:
Can we reshape fade to 1688 x 2521 x 3?
%% Cell type:code id:8fe7a385-0f53-4183-aea8-cb8461ca3bbf tags:
``` python
```
%% Cell type:markdown id:1c75f601-1906-4768-b7f3-ea247299ba71 tags:
The answer is no - because `reshape` can never add new values / delete values. Meaning after `reshape`, we need to exactly have 2521 values.
%% Cell type:code id:eec5aabe-2749-468a-a76d-de4e8f5f8df4 tags:
``` python
# Keep in mind that we need to multiple each column by a number, so which dimension should
# be 2521?
```
%% Cell type:code id:ee3dc975-749d-4077-b47b-36e101ff1901 tags:
``` python
# Let's multiple a by reshaped fade
plt.imshow(???)
```
%% Cell type:markdown id:7abfc514-1130-4844-9df9-439d783d2f73 tags:
Why doesn't this work? Remember pixels can be either represented using the values 0 to 255 or 0 to 1. `a` has the scale 0 to 255 and `fade.reshape(...)` has the scale 0 to 1.
%% Cell type:code id:d0f07972-a874-4764-93f1-a56ebbf84136 tags:
``` python
plt.imshow(a * fade.reshape(1, 2521, 1))
```
%% Cell type:markdown id:1a027c40-5602-4bea-85ec-84c74a0cab3a tags:
### Broadcast
Two use cases:
1. "stretch" 1 => N along any dimension to make shapes compatible
2. add dimensions of size 1 to the beginning of a shape
%% Cell type:code id:25ca7735-b6c3-4524-8336-70173cfc5be1 tags:
``` python
a.shape
```
%% Cell type:code id:0f6dd991-1644-4ad6-b702-4e8c0b74e0c0 tags:
``` python
plt.imshow(a / 255.0 * fade.reshape(2521, 1))
# BROADCAST: (2521, 1) => (1, 2521, 1) => (1688, 2521, 3)
```
%% Cell type:markdown id:e96bd769-07ea-461e-8af1-91216adbd2ab tags:
### Dot Product
$\begin{bmatrix}
1 & 2 & 3
\end{bmatrix}
\cdot
\begin{bmatrix}
4 \\ 5 \\ 6
\end{bmatrix}$
%% Cell type:code id:888f3726-cc86-4fff-bd3c-22183a1d5178 tags:
``` python
v1
```
%% Cell type:code id:bedf9d2e-6ec1-41a3-ab2e-23b56239a0cb tags:
``` python
v2
```
%% Cell type:code id:5c8bb73e-181a-452b-b19e-4b03f7bfc05a tags:
``` python
v1 * v2 # 1*4, 2*5, 3*6
```
%% Cell type:code id:625b136c-7b5e-46cf-89d9-450d3a55ae77 tags:
``` python
v1.T
```
%% Cell type:code id:4b580956-9b7f-4b40-b658-9ee2383d3284 tags:
``` python
v2
```
%% Cell type:markdown id:b1d9e5bb-308f-4540-a9bc-2964f0917823 tags:
#### `np.dot(a1, a2)` or `a1 @ a2`
%% Cell type:code id:5114c384-2836-4ab0-85df-e2910ab53f3c tags:
``` python
# 1*4 + 2*5 + 3*6
```
%% Cell type:code id:df408d79-0192-46d3-9448-9286cec11288 tags:
``` python
```
%% Cell type:markdown id:1235fb92-417c-4d37-aba0-d470bddb9627 tags:
#### `.item()` gives you just the values
%% Cell type:code id:1bd3ce8b-25fb-414f-9c8a-3de0296c058d tags:
``` python
(v1.T @ v2)??? # pulls out the only number in the results
```
%% Cell type:code id:cbf53598-a5c0-425f-912b-5bf84b18a4d6 tags:
``` python
np.dot(v1.T, v2)???
```
lecture_material/18-linalg-1/bug.jpg

2.28 MiB

Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:markdown id:9e33d56e tags:
# Regression 1
%% Cell type:code id:e6f50cc3 tags:
``` python
import os
import pandas as pd
import geopandas as gpd
# new import statements
```
%% Cell type:markdown id:570b4253 tags:
#### Covid deaths analysis
- Source: https://github.com/cs320-wisc/s22/tree/main/lec/29%20Regression%201
- Specifically, let's analyze "COVID-19 Data by Census Tract V2"
- Status Flag Values: -999: Census tracts, municipalities, school districts, and zip codes with 0–4 aggregate counts for any data have been suppressed. County data with 0-4 aggregate counts by demographic factors (e.g., by age group, race, ethnicity) have been suppressed.
%% Cell type:code id:696eec18 tags:
``` python
# Read the "covid.geojson" file
dataset_file = "covid.geojson"
df =
```
%% Cell type:code id:c3e6454f tags:
``` python
df.head()
```
%% Cell type:code id:905da51e tags:
``` python
# Explore the columns
df
```
%% Cell type:code id:a3434aba tags:
``` python
# Create a geographic plot
df
```
%% Cell type:markdown id:e3e73632 tags:
### Predicting "DTH_CUM_CP"
%% Cell type:markdown id:43cebffa tags:
### How can we get a clean dataset of COVID deaths in WI?
%% Cell type:code id:fa2f30ae tags:
``` python
# Replace -999 with 2; 2 is between 0-4; random choice instead of using 0
df =
# we must communicate in final results what percent of values were guessed (imputed)
```
%% Cell type:markdown id:4cff709d tags:
How would we know if the data is now clean?
%% Cell type:code id:950c2041 tags:
``` python
# Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
df
```
%% Cell type:markdown id:4073a940 tags:
Which points are concerning? Let's take a closer look.
#### Which rows have "DTH_CUM_CP" greater than 300?
%% Cell type:code id:a655c465 tags:
``` python
df["DTH_CUM_CP"]
```
%% Cell type:markdown id:d377143e tags:
#### Valid rows have "GEOID" that only contains digits
Using `str` methods to perform filtering: `str.fullmatch` does a full string match given a reg-ex. Because it does full string match anchor characters (`^`, `$`) won't be needed.
%% Cell type:code id:529781db tags:
``` python
```
%% Cell type:code id:af16925b tags:
``` python
df["GEOID"]
```
%% Cell type:code id:1d583d06 tags:
``` python
df = df[df["GEOID"].str.fullmatch(r"\d+")]
df.plot.scatter(x="POP", y="DTH_CUM_CP")
```
%% Cell type:markdown id:1be50600 tags:
### How can we train/fit models to known data to predict unknowns?
- Feature(s) => Predictions
- Population => Deaths
- Cases => Deaths
- Cases by Age => Deaths
- General structure for fitting models:
```python
model = <some model>
model.fit(X, y)
y = model.predict(X)
```
- where `X` needs to be a matrix or a `DataFrame` and `y` needs to be an array (vector) or a `Series`
- after fitting, `model` object instance stores the information about relationship between features (x values) and predictions (y values)
- `predict` returns a `numpy` array, which can be treated like a list
%% Cell type:markdown id:0d0e65c3 tags:
### Predicting "DTH_CUM_CP" using "POP" as feature.
%% Cell type:code id:3dbdbba4 tags:
``` python
# We must specify a list of columns to make sure we extract a DataFrame and not a Series
# Feature DataFrame
df
```
%% Cell type:code id:22aad05e tags:
``` python
# Label Series: "DTH_CUM_CP"
df
```
%% Cell type:markdown id:797d6831 tags:
### Let's use `LinearRegression` model.
- `from sklearn.linear_model import LinearRegression`
%% Cell type:code id:51ad5b05 tags:
``` python
xcols =
ycol =
model =
model
# less interesting because we are predicting what we already know
y = model
```
%% Cell type:markdown id:e589d923 tags:
Predicting for new values of x.
%% Cell type:code id:dd8f0440 tags:
``` python
predict_df = pd.DataFrame({"POP": [1000, 2000, 3000]})
predict_df
```
%% Cell type:code id:5315289f tags:
``` python
# Predict for the new data
```
%% Cell type:code id:0cea4830 tags:
``` python
# Insert a new column called "predicted deaths" with the predictions
predict_df["predicted deaths"] = model.predict(predict_df)
predict_df
```
%% Cell type:markdown id:1c649201 tags:
### How can we visualize model predictions?
- Let's predict deaths for "POP" ranges like 0, 1000, 2000, ..., 20000
%% Cell type:code id:496a67c9 tags:
``` python
predict_df = pd.DataFrame({"POP": range(0, 20000, 1000)})
predict_df
```
%% Cell type:code id:b825412a tags:
``` python
# Insert a new column called "predicted deaths" with the predictions
predict_df["predicted deaths"] = model.predict(predict_df)
predict_df
```
%% Cell type:code id:ca0b47f5 tags:
``` python
# Create a line plot to visualize relationship between "POP" and "predicted deaths"
# Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
```
%% Cell type:markdown id:3bb2d3f8 tags:
### How can we get a formula for the relationship?
- `y=mx+c`, where `y` is our predictions and `x` are the features used for the fit
- Slope of the line (`m`) given by `model.coef_[0]`
- Intercept of the line (`c`) given by `model.intercept_`
%% Cell type:markdown id:a31d19b3 tags:
Model coefficients
%% Cell type:code id:c894c1b1 tags:
``` python
model
```
%% Cell type:code id:4850a47e tags:
``` python
# Slope of the line
model.coef_
```
%% Cell type:code id:32690085 tags:
``` python
# Intercept of the line
model
```
%% Cell type:code id:cfbcee12 tags:
``` python
print(f"deaths ~= {round(model.coef_[0], 4)} * population + {round(model.intercept_, 4)}")
```
%% Cell type:markdown id:738ccba5 tags:
### How well does our model fit the data?
- explained variance score
- R^2 ("r squared")
%% Cell type:markdown id:91e4af81 tags:
#### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
- y_true: actual death values in our example
- y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html
%% Cell type:code id:286e2569 tags:
``` python
xcols, ycol
```
%% Cell type:code id:bb36ca0f tags:
``` python
# Let's now make predictions for the known data
predictions = model
predictions
```
%% Cell type:code id:92dadb4c tags:
``` python
sklearn.metrics.explained_variance_score(, )
```
%% Cell type:markdown id:ebd81950 tags:
#### Explained variance score
- `explained_variance_score = (known_var - explained_variance) / known_var`
- where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`
%% Cell type:markdown id:36bc3bb5 tags:
What is the variation in known deaths?
%% Cell type:code id:55a3dfcd tags:
``` python
# Compute variance of "DTH_CUM_CP" column
known_var = df[ycol]
known_var
```
%% Cell type:code id:c33a6fb1 tags:
``` python
# explained_variance
explained_variance = (df[ycol] - predictions).var()
explained_variance
```
%% Cell type:code id:dfb076b1 tags:
``` python
# explained_variance score
explained_variance_score = (known_var - explained_variance) / known_var
explained_variance_score
```
%% Cell type:code id:73a55a32 tags:
``` python
# For comparison here is the explained variance score from sklearn
sklearn.metrics.explained_variance_score(df[ycol], predictions)
```
%% Cell type:markdown id:547452da tags:
#### `sklearn.metrics.r2_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
- y_true: actual death values in our example
- y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
%% Cell type:code id:d16ba67b tags:
``` python
sklearn.metrics.r2_score(df[ycol], predictions)
```
%% Cell type:markdown id:6e60fed7 tags:
#### R^2 score (aka coefficient of determination) approximation
- `r2_score = (known_var - r2_val) / known_var`
- where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`
%% Cell type:code id:d34ea427 tags:
``` python
# r2_val
r2_val = ((df[ycol] - predictions) ** 2).mean()
r2_val
```
%% Cell type:code id:b1c3574b tags:
``` python
r2_score = (known_var - r2_val) / known_var
r2_score # there might be minor rounding off differences
```
%% Cell type:markdown id:adc33af9 tags:
#### `model.score(X, y)`
- invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y
%% Cell type:code id:b3bde089 tags:
``` python
model
```
%% Cell type:markdown id:1768f9a9 tags:
#### Did our model learn, or just memorize (that is, "overfit")?
- Split data into train and test
%% Cell type:code id:87a77fb4 tags:
``` python
# Split the data into two equal parts
len(df) // 2
```
%% Cell type:code id:bdd7cad0 tags:
``` python
# Manual way of splitting train and test data
train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
len(train), len(test)
```
%% Cell type:markdown id:2f45dd74 tags:
Problem with manual splitting is, we need to make sure that the data is not sorted in some way.
%% Cell type:markdown id:3a781391 tags:
#### `train_test_split(<dataframe>, test_size=<val>)`
- requires `from sklearn.model_selection import train_test_split`
- shuffles the data and then splits based on 75%-25% split between train and test
- produces new train and test data every single time
- `test_size` parameter can take two kind of values:
- actual number of rows that we want in test data
- fractional number representing the ratio of train versus test data
- default value is `0.25`
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
%% Cell type:code id:1d3a86c9 tags:
``` python
len(train), len(test)
```
%% Cell type:code id:49f7dfe8 tags:
``` python
# Test size using row count
train, test = train_test_split(df, test_size=120)
len(train), len(test)
```
%% Cell type:code id:1a29cf9d tags:
``` python
# Test size using fraction
train, test = train_test_split(df, test_size=0.5)
len(train), len(test)
```
%% Cell type:code id:7934e9ee tags:
``` python
# Running this cell twice will give you two different train datasets
train, test = train_test_split(df)
train.head()
```
%% Cell type:code id:0fe05a2e tags:
``` python
train, test = train_test_split(df)
# Let's use the train and the test data
model = LinearRegression()
# Fit using training data
model.fit(, )
# Predict using test data
y = model.predict()
# We can use score directly as it automatically invokes predict
model
```
%% Cell type:markdown id:e0f0e21b tags:
Running the above cell again will give you entirely different model and score.
%% Cell type:markdown id:003b1c50 tags:
#### How can we minimize noise due to random train/test splits?
### Cross validation: `cross_val_score(estimator, X, y)`
- requires `from sklearn.model_selection import cross_val_score`
- do many different train/test splits of the values, fitting and scoring the model across each combination
- cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
- function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
%% Cell type:code id:bfa17fce tags:
``` python
train, test = train_test_split(df)
model = LinearRegression()
scores =
scores
```
%% Cell type:code id:284f776f tags:
``` python
# Compute mean of the scores
scores
```
%% Cell type:markdown id:53c9d4d4 tags:
#### How can we compare models?
- model 1: POP => DEATHS
- model 2: CASES (POS_CUM_CP) => DEATHS
%% Cell type:code id:ffd9791b tags:
``` python
model1 = LinearRegression()
model2 = LinearRegression()
model1_scores = cross_val_score(model1, )
model2_scores = cross_val_score(model2, )
```
%% Cell type:code id:60f0bf73 tags:
``` python
model1_scores.mean()
```
%% Cell type:code id:e3070bf6 tags:
``` python
model2_scores.mean()
```
%% Cell type:markdown id:dced0919 tags:
Which of these two models do you think will perform better? Probably model2.
%% Cell type:code id:dfedd8d4 tags:
``` python
means = pd.Series({"model1": model1_scores.mean(),
"model2": model2_scores.mean()})
means.plot.bar(figsize=(3, 3))
```
%% Cell type:markdown id:312aa001 tags:
How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.
%% Cell type:code id:5123c3a9 tags:
``` python
model1_scores.std()
```
%% Cell type:code id:230b9dc9 tags:
``` python
model2_scores.std()
```
%% Cell type:code id:484c7af9 tags:
``` python
err = pd.Series({"model1": model1_scores.std(),
"model2": model2_scores.std()})
err
```
%% Cell type:code id:233cd91d tags:
``` python
# Plot error bar by passing argument to paramenter yerr
means.plot.bar(figsize=(3, 3), )
```
%% Cell type:markdown id:c3b68db6 tags:
Pick a winner and run it one more time against test data.
%% Cell type:markdown id:09f5bce9 tags:
#### How can we use multiple x variables (multiple regression)?
%% Cell type:code id:2538534d tags:
``` python
model = LinearRegression()
xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
'POS_80_89_CP', 'POS_90_CP']
ycol = "DTH_CUM_CP"
model.fit(train[xcols], train[ycol])
model.score(test[xcols], test[ycol])
```
%% Cell type:markdown id:92e4c272 tags:
#### How can we interpret what features the model is relying on?
%% Cell type:code id:68e3d21a tags:
``` python
model.coef_
```
%% Cell type:code id:44bd5b07 tags:
``` python
pd.Series(model.coef_).plot.bar(figsize=(3, 2))
```
%% Cell type:markdown id:9e33d56e tags:
# Regression 1
%% Cell type:code id:e6f50cc3 tags:
``` python
import os
import pandas as pd
import geopandas as gpd
# new import statements
```
%% Cell type:markdown id:570b4253 tags:
#### Covid deaths analysis
- Source: https://github.com/cs320-wisc/s22/tree/main/lec/29%20Regression%201
- Specifically, let's analyze "COVID-19 Data by Census Tract V2"
- Status Flag Values: -999: Census tracts, municipalities, school districts, and zip codes with 0–4 aggregate counts for any data have been suppressed. County data with 0-4 aggregate counts by demographic factors (e.g., by age group, race, ethnicity) have been suppressed.
%% Cell type:code id:696eec18 tags:
``` python
# Read the "covid.geojson" file
dataset_file = "covid.geojson"
df =
```
%% Cell type:code id:c3e6454f tags:
``` python
df.head()
```
%% Cell type:code id:905da51e tags:
``` python
# Explore the columns
df
```
%% Cell type:code id:a3434aba tags:
``` python
# Create a geographic plot
df
```
%% Cell type:markdown id:e3e73632 tags:
### Predicting "DTH_CUM_CP"
%% Cell type:markdown id:43cebffa tags:
### How can we get a clean dataset of COVID deaths in WI?
%% Cell type:code id:fa2f30ae tags:
``` python
# Replace -999 with 2; 2 is between 0-4; random choice instead of using 0
df =
# we must communicate in final results what percent of values were guessed (imputed)
```
%% Cell type:markdown id:4cff709d tags:
How would we know if the data is now clean?
%% Cell type:code id:950c2041 tags:
``` python
# Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
df
```
%% Cell type:markdown id:4073a940 tags:
Which points are concerning? Let's take a closer look.
#### Which rows have "DTH_CUM_CP" greater than 300?
%% Cell type:code id:a655c465 tags:
``` python
df["DTH_CUM_CP"]
```
%% Cell type:markdown id:d377143e tags:
#### Valid rows have "GEOID" that only contains digits
Using `str` methods to perform filtering: `str.fullmatch` does a full string match given a reg-ex. Because it does full string match anchor characters (`^`, `$`) won't be needed.
%% Cell type:code id:529781db tags:
``` python
```
%% Cell type:code id:af16925b tags:
``` python
df["GEOID"]
```
%% Cell type:code id:1d583d06 tags:
``` python
df = df[df["GEOID"].str.fullmatch(r"\d+")]
df.plot.scatter(x="POP", y="DTH_CUM_CP")
```
%% Cell type:markdown id:1be50600 tags:
### How can we train/fit models to known data to predict unknowns?
- Feature(s) => Predictions
- Population => Deaths
- Cases => Deaths
- Cases by Age => Deaths
- General structure for fitting models:
```python
model = <some model>
model.fit(X, y)
y = model.predict(X)
```
- where `X` needs to be a matrix or a `DataFrame` and `y` needs to be an array (vector) or a `Series`
- after fitting, `model` object instance stores the information about relationship between features (x values) and predictions (y values)
- `predict` returns a `numpy` array, which can be treated like a list
%% Cell type:markdown id:0d0e65c3 tags:
### Predicting "DTH_CUM_CP" using "POP" as feature.
%% Cell type:code id:3dbdbba4 tags:
``` python
# We must specify a list of columns to make sure we extract a DataFrame and not a Series
# Feature DataFrame
df
```
%% Cell type:code id:22aad05e tags:
``` python
# Label Series: "DTH_CUM_CP"
df
```
%% Cell type:markdown id:797d6831 tags:
### Let's use `LinearRegression` model.
- `from sklearn.linear_model import LinearRegression`
%% Cell type:code id:51ad5b05 tags:
``` python
xcols =
ycol =
model =
model
# less interesting because we are predicting what we already know
y = model
```
%% Cell type:markdown id:e589d923 tags:
Predicting for new values of x.
%% Cell type:code id:dd8f0440 tags:
``` python
predict_df = pd.DataFrame({"POP": [1000, 2000, 3000]})
predict_df
```
%% Cell type:code id:5315289f tags:
``` python
# Predict for the new data
```
%% Cell type:code id:0cea4830 tags:
``` python
# Insert a new column called "predicted deaths" with the predictions
predict_df["predicted deaths"] = model.predict(predict_df)
predict_df
```
%% Cell type:markdown id:1c649201 tags:
### How can we visualize model predictions?
- Let's predict deaths for "POP" ranges like 0, 1000, 2000, ..., 20000
%% Cell type:code id:496a67c9 tags:
``` python
predict_df = pd.DataFrame({"POP": range(0, 20000, 1000)})
predict_df
```
%% Cell type:code id:b825412a tags:
``` python
# Insert a new column called "predicted deaths" with the predictions
predict_df["predicted deaths"] = model.predict(predict_df)
predict_df
```
%% Cell type:code id:ca0b47f5 tags:
``` python
# Create a line plot to visualize relationship between "POP" and "predicted deaths"
# Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
```
%% Cell type:markdown id:3bb2d3f8 tags:
### How can we get a formula for the relationship?
- `y=mx+c`, where `y` is our predictions and `x` are the features used for the fit
- Slope of the line (`m`) given by `model.coef_[0]`
- Intercept of the line (`c`) given by `model.intercept_`
%% Cell type:markdown id:a31d19b3 tags:
Model coefficients
%% Cell type:code id:c894c1b1 tags:
``` python
model
```
%% Cell type:code id:4850a47e tags:
``` python
# Slope of the line
model.coef_
```
%% Cell type:code id:32690085 tags:
``` python
# Intercept of the line
model
```
%% Cell type:code id:cfbcee12 tags:
``` python
print(f"deaths ~= {round(model.coef_[0], 4)} * population + {round(model.intercept_, 4)}")
```
%% Cell type:markdown id:738ccba5 tags:
### How well does our model fit the data?
- explained variance score
- R^2 ("r squared")
%% Cell type:markdown id:91e4af81 tags:
#### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
- y_true: actual death values in our example
- y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html
%% Cell type:code id:286e2569 tags:
``` python
xcols, ycol
```
%% Cell type:code id:bb36ca0f tags:
``` python
# Let's now make predictions for the known data
predictions = model
predictions
```
%% Cell type:code id:92dadb4c tags:
``` python
sklearn.metrics.explained_variance_score(, )
```
%% Cell type:markdown id:ebd81950 tags:
#### Explained variance score
- `explained_variance_score = (known_var - explained_variance) / known_var`
- where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`
%% Cell type:markdown id:36bc3bb5 tags:
What is the variation in known deaths?
%% Cell type:code id:55a3dfcd tags:
``` python
# Compute variance of "DTH_CUM_CP" column
known_var = df[ycol]
known_var
```
%% Cell type:code id:c33a6fb1 tags:
``` python
# explained_variance
explained_variance = (df[ycol] - predictions).var()
explained_variance
```
%% Cell type:code id:dfb076b1 tags:
``` python
# explained_variance score
explained_variance_score = (known_var - explained_variance) / known_var
explained_variance_score
```
%% Cell type:code id:73a55a32 tags:
``` python
# For comparison here is the explained variance score from sklearn
sklearn.metrics.explained_variance_score(df[ycol], predictions)
```
%% Cell type:markdown id:547452da tags:
#### `sklearn.metrics.r2_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
- y_true: actual death values in our example
- y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
%% Cell type:code id:d16ba67b tags:
``` python
sklearn.metrics.r2_score(df[ycol], predictions)
```
%% Cell type:markdown id:6e60fed7 tags:
#### R^2 score (aka coefficient of determination) approximation
- `r2_score = (known_var - r2_val) / known_var`
- where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`
%% Cell type:code id:d34ea427 tags:
``` python
# r2_val
r2_val = ((df[ycol] - predictions) ** 2).mean()
r2_val
```
%% Cell type:code id:b1c3574b tags:
``` python
r2_score = (known_var - r2_val) / known_var
r2_score # there might be minor rounding off differences
```
%% Cell type:markdown id:adc33af9 tags:
#### `model.score(X, y)`
- invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y
%% Cell type:code id:b3bde089 tags:
``` python
model
```
%% Cell type:markdown id:1768f9a9 tags:
#### Did our model learn, or just memorize (that is, "overfit")?
- Split data into train and test
%% Cell type:code id:87a77fb4 tags:
``` python
# Split the data into two equal parts
len(df) // 2
```
%% Cell type:code id:bdd7cad0 tags:
``` python
# Manual way of splitting train and test data
train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
len(train), len(test)
```
%% Cell type:markdown id:2f45dd74 tags:
Problem with manual splitting is, we need to make sure that the data is not sorted in some way.
%% Cell type:markdown id:3a781391 tags:
#### `train_test_split(<dataframe>, test_size=<val>)`
- requires `from sklearn.model_selection import train_test_split`
- shuffles the data and then splits based on 75%-25% split between train and test
- produces new train and test data every single time
- `test_size` parameter can take two kind of values:
- actual number of rows that we want in test data
- fractional number representing the ratio of train versus test data
- default value is `0.25`
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
%% Cell type:code id:1d3a86c9 tags:
``` python
len(train), len(test)
```
%% Cell type:code id:49f7dfe8 tags:
``` python
# Test size using row count
train, test = train_test_split(df, test_size=120)
len(train), len(test)
```
%% Cell type:code id:1a29cf9d tags:
``` python
# Test size using fraction
train, test = train_test_split(df, test_size=0.5)
len(train), len(test)
```
%% Cell type:code id:7934e9ee tags:
``` python
# Running this cell twice will give you two different train datasets
train, test = train_test_split(df)
train.head()
```
%% Cell type:code id:0fe05a2e tags:
``` python
train, test = train_test_split(df)
# Let's use the train and the test data
model = LinearRegression()
# Fit using training data
model.fit(, )
# Predict using test data
y = model.predict()
# We can use score directly as it automatically invokes predict
model
```
%% Cell type:markdown id:e0f0e21b tags:
Running the above cell again will give you entirely different model and score.
%% Cell type:markdown id:003b1c50 tags:
#### How can we minimize noise due to random train/test splits?
### Cross validation: `cross_val_score(estimator, X, y)`
- requires `from sklearn.model_selection import cross_val_score`
- do many different train/test splits of the values, fitting and scoring the model across each combination
- cross validation documentation: https://scikit-learn.org/stable/modules/cross_validation.html
- function documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
%% Cell type:code id:bfa17fce tags:
``` python
train, test = train_test_split(df)
model = LinearRegression()
scores =
scores
```
%% Cell type:code id:284f776f tags:
``` python
# Compute mean of the scores
scores
```
%% Cell type:markdown id:53c9d4d4 tags:
#### How can we compare models?
- model 1: POP => DEATHS
- model 2: CASES (POS_CUM_CP) => DEATHS
%% Cell type:code id:ffd9791b tags:
``` python
model1 = LinearRegression()
model2 = LinearRegression()
model1_scores = cross_val_score(model1, )
model2_scores = cross_val_score(model2, )
```
%% Cell type:code id:60f0bf73 tags:
``` python
model1_scores.mean()
```
%% Cell type:code id:e3070bf6 tags:
``` python
model2_scores.mean()
```
%% Cell type:markdown id:dced0919 tags:
Which of these two models do you think will perform better? Probably model2.
%% Cell type:code id:dfedd8d4 tags:
``` python
means = pd.Series({"model1": model1_scores.mean(),
"model2": model2_scores.mean()})
means.plot.bar(figsize=(3, 3))
```
%% Cell type:markdown id:312aa001 tags:
How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.
%% Cell type:code id:5123c3a9 tags:
``` python
model1_scores.std()
```
%% Cell type:code id:230b9dc9 tags:
``` python
model2_scores.std()
```
%% Cell type:code id:484c7af9 tags:
``` python
err = pd.Series({"model1": model1_scores.std(),
"model2": model2_scores.std()})
err
```
%% Cell type:code id:233cd91d tags:
``` python
# Plot error bar by passing argument to paramenter yerr
means.plot.bar(figsize=(3, 3), )
```
%% Cell type:markdown id:c3b68db6 tags:
Pick a winner and run it one more time against test data.
%% Cell type:markdown id:09f5bce9 tags:
#### How can we use multiple x variables (multiple regression)?
%% Cell type:code id:2538534d tags:
``` python
model = LinearRegression()
xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
'POS_80_89_CP', 'POS_90_CP']
ycol = "DTH_CUM_CP"
model.fit(train[xcols], train[ycol])
model.score(test[xcols], test[ycol])
```
%% Cell type:markdown id:92e4c272 tags:
#### How can we interpret what features the model is relying on?
%% Cell type:code id:68e3d21a tags:
``` python
model.coef_
```
%% Cell type:code id:44bd5b07 tags:
``` python
pd.Series(model.coef_).plot.bar(figsize=(3, 2))
```
Source diff could not be displayed: it is too large. Options to address this: view the blob.
File added
File added
Source diff could not be displayed: it is too large. Options to address this: view the blob.
Source diff could not be displayed: it is too large. Options to address this: view the blob.
Source diff could not be displayed: it is too large. Options to address this: view the blob.
Source diff could not be displayed: it is too large. Options to address this: view the blob.
Source diff could not be displayed: it is too large. Options to address this: view the blob.
Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:markdown id:dc708de6 tags:
# Linear Algebra 2
%% Cell type:code id:cbd48a28 tags:
``` python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```
%% Cell type:markdown id:dc8a126e tags:
### Problem 1: Predicting with dot product (given `X` and `c`, compute `y`)
1. use case for dot product:
- `y = Xc + b`
2. one's column
3. matrix dot vector
$\begin{bmatrix}
1 & 2 \\ 3 & 4\\
\end{bmatrix}
\cdot
\begin{bmatrix}
10 \\ 1 \\
\end{bmatrix}$
%% Cell type:code id:15b52f7e tags:
``` python
houses = pd.DataFrame([[2, 1, 1985],
[3, 1, 1998],
[4, 3, 2005],
[4, 2, 2020]],
columns=["beds", "baths", "year"])
houses
```
%% Cell type:code id:dd024e62-9e23-4221-839d-2694f1d22d5e tags:
``` python
def predict_price(house):
"""
Takes row (as Series) as argument,
returns estimated price (in thousands)
"""
return ((house["beds"]*42.3) + (house["baths"]*10) +
(house["year"]*1.67) - 3213)
predict_price(houses.iloc[0])
```
%% Cell type:code id:aa256e1c tags:
``` python
# How do we convert a DataFrame into a numpy array?
X = houses.values
X
```
%% Cell type:markdown id:1ece6b08 tags:
Simplifying intercept addition by including intercept inside `c` vector.
%% Cell type:code id:eb2997b1-9650-4514-8d63-2155df96ec0e tags:
``` python
# Extract just first row of data
house0 =
house0
```
%% Cell type:code id:352304a8-7596-4b86-ae66-e7cda1d2f570 tags:
``` python
# Create a vertical array (3 x 1) with the co-efficients
c = np.array([42.3, 10, 1.67])???
c
```
%% Cell type:code id:3da7fdf1-37ed-4438-9241-5a8c296ca377 tags:
``` python
# horizontal @ vertical
```
%% Cell type:markdown id:d7966ada-bb6b-4a5b-84c0-6fb00288d607 tags:
`y = Xc + b`
%% Cell type:code id:fdac0958-be47-4cdf-9cf3-bc323f2567f6 tags:
``` python
```
%% Cell type:markdown id:6301b51c-263c-486a-8608-5d5263540e58 tags:
Let's add the intercept to the c vector for ease.
%% Cell type:code id:f4865788-d8b6-4dd1-8d64-58ce5114a178 tags:
``` python
c = np.array([42.3, 10, 1.67, ???]).reshape(-1, 1)
c
```
%% Cell type:markdown id:f4aa7d1f-8d41-4d66-8364-1e0aea5c1b60 tags:
If we directly try dot product now, it won't work because of difference in dimensions.
%% Cell type:code id:edb0a6aa-0362-47a2-b60c-8a476a431b60 tags:
``` python
house0 @ c
```
%% Cell type:code id:0834af63-713d-4b98-a7e4-d10211ed2cf2 tags:
``` python
house0.shape
```
%% Cell type:code id:37cc5e0d-4caf-4a8c-a0e1-77270308d413 tags:
``` python
c.shape
```
%% Cell type:markdown id:0b87c930-c373-45ae-abcc-fb7d2e4e2adc tags:
#### One's column
- Solution, add a 1's column to `X` using `np.concatenate`
%% Cell type:code id:2d69ab5d-0756-4c34-9bf3-5276ca43a8e4 tags:
``` python
# How can we generate an array of 1's using numpy?
ones_column =
ones_column
```
%% Cell type:code id:c2ea3a16-8cc1-43b1-9e1d-9e9e22abb0c7 tags:
``` python
X =
X
```
%% Cell type:code id:36db9451-c3da-4a14-91a2-8aee734147fb tags:
``` python
# Let's extract house0 again
house0 = X[0:1, :]
house0
```
%% Cell type:code id:ddf993e4-9d73-472b-92a5-9cf1231baa5f tags:
``` python
# Let's try house0 @ c now
house0 @ c
```
%% Cell type:code id:2d2112b1-8155-48bb-abfe-b6167fb4bc3b tags:
``` python
# Extracting each house and doing the prediction with dot product
# Cumbersome
house0 = X[0:1, :]
print(house0 @ c)
house1 = X[1:2, :]
print(house1 @ c)
house2 = X[2:3, :]
print(house2 @ c)
house3 = X[3:4, :]
print(house3 @ c)
```
%% Cell type:markdown id:ed5ccacc-c777-42b5-97f7-e0f235b94e5e tags:
### `@` use cases
loops over each row of the firt array and computes dot product, which is ROW @ COEFs, that is, `X @ c`
%% Cell type:code id:f2488b35 tags:
``` python
X @ c
```
%% Cell type:markdown id:9d37347d tags:
### Problem 2: Fitting with `np.linalg.solve` (given `X` and `y`, find `c`)
%% Cell type:markdown id:fe3174f5 tags:
**Above:** we estimated house prices using a linear model based on the dot product as follows:
$Xc = y$
* $X$ (known) is a matrix with house features (from DataFrame)
* $c$ (known) is a vector of coefficients (our model parameters)
* $y$ (computed) are the prices
**Below:** what if X and y are known, and we want to find c?
%% Cell type:code id:4572020d tags:
``` python
houses = pd.DataFrame([[2, 1, 1985, 196.55],
[3, 1, 1998, 260.56],
[4, 3, 2005, 334.55],
[4, 2, 2020, 349.60]],
columns=["beds", "baths", "year", "price"])
houses
```
%% Cell type:markdown id:c3f5e71a tags:
If we assume price is linearly based on the features, with this equation:
* $beds*c_0 + baths*c_1 + year*c_2 + 1*c_3 = price$
Then we get four equations:
* $2*c_0 + 1*c_1 + 1985*c_2 + 1*c_3 = 196.55$
* $3*c_0 + 1*c_1 + 1998*c_2 + 1*c_3 = 260.56$
* $4*c_0 + 3*c_1 + 2005*c_2 + 1*c_3 = 334.55$
* $4*c_0 + 2*c_1 + 2020*c_2 + 1*c_3 = 349.60$
%% Cell type:markdown id:fb7d682a tags:
#### `c = np.linalg.solve(X, y)`
- documentation: https://numpy.org/doc/stable/reference/generated/numpy.linalg.solve.html
%% Cell type:code id:ba119868-f8d3-4ff7-851e-9d2c431eafa3 tags:
``` python
# Add a column of 1s to this DataFrame
houses
```
%% Cell type:code id:459a4245-4d29-4ad6-9baa-31c3979ed32c tags:
``` python
# Extract X ---> features: ["beds", "baths", "year", "ones"]
X =
X
```
%% Cell type:code id:86df8279-6d9f-434b-b7ef-8c654fc93c80 tags:
``` python
# Extract y ---> prediction value: ["price"]
# Unlike predict method argument, we need a DataFrame here,
# Reason: so that we can convert that into numpy array
y =
y
```
%% Cell type:code id:218371ec tags:
``` python
# Let's take a look at the co-efficients which we were using for our prediction
c
```
%% Cell type:code id:76506e39-7d24-462b-a73f-1e7f4c748b38 tags:
``` python
c =
c
```
%% Cell type:code id:34353088-f51d-45a6-90c0-940b6429d3f8 tags:
``` python
X @ c
```
%% Cell type:markdown id:ec46453a tags:
What is the predicted price of a 6-bedroom 5-bathroom house built in 2024?
%% Cell type:code id:b734ec2f-ec9c-4a44-b92f-7efc78c92e6d tags:
``` python
dream_house = np.array([[6, 5, 2024, 1]])
dream_house
```
%% Cell type:code id:5f5add54-1a60-484c-b7d0-ba9d1335072d tags:
``` python
dream_house @ c
```
%% Cell type:markdown id:3e282e0c tags:
### Two Perspectives on `Matrix @ vector`
$\begin{bmatrix}
4&5\\6&7\\8&9\\
\end{bmatrix}
\cdot
\begin{bmatrix}
2\\3\\
\end{bmatrix}
= ????
$
%% Cell type:code id:a6979874 tags:
``` python
X = np.array([[4, 5], [6, 7], [8, 9]])
c = np.array([2, 3]).reshape(-1, 1)
X @ c
```
%% Cell type:markdown id:3bd4097b tags:
### Row Picture
Do dot product one row at a time.
$\begin{bmatrix}
4&5\\6&7\\8&9\\
\end{bmatrix}
\cdot
\begin{bmatrix}
2\\3\\
\end{bmatrix}
=
\begin{bmatrix}
(4*2)+(5*3)\\
(6*2)+(7*3)\\
(8*2)+(9*3)\\
\end{bmatrix}
=
\begin{bmatrix}
23\\
33\\
43\\
\end{bmatrix}
$
%% Cell type:code id:0524fc15 tags:
``` python
def row_dot_product(X, c):
"""
function that performs same action as @ operator
"""
result = []
print(X)
print(c)
# loop over each row index of X
# extract each row using slicing
# why slicing? we want two dimensional array
# DOT PRODUCT the row with c
# convert result into a vertical numpy array
row_dot_product(X, c)
```
%% Cell type:code id:daefbfb9 tags:
``` python
X.shape
```
%% Cell type:markdown id:0ad5d3dc tags:
### Column Picture
$\begin{bmatrix}
c_0&c_1&c_2\\
\end{bmatrix}
\cdot
\begin{bmatrix}
x\\y\\z\\
\end{bmatrix}
=(c_0*x) + (c_1*y) + (c_2*z)
$
Dot product takes a **linear combination** of columns.
$\begin{bmatrix}
4&5\\6&7\\8&9\\
\end{bmatrix}
\cdot
\begin{bmatrix}
2\\3\\
\end{bmatrix}
=
\begin{bmatrix}
4\\6\\8\\
\end{bmatrix}*2
+
\begin{bmatrix}
5\\7\\9\\
\end{bmatrix}*3
=
\begin{bmatrix}
23\\
33\\
43\\
\end{bmatrix}
$
%% Cell type:code id:4d6f0113 tags:
``` python
def col_dot_product(X, c):
"""
same result as row_dot_product above,
but different definition / code
"""
# initialize a vertical vector of zeros
# loop over each col index of X
# extract each column using slicing
# extract weight for the column using indexing
# add weighted column to total
return total
col_dot_product(X, c)
```
%% Cell type:code id:66fae5a4 tags:
``` python
X.shape
```
%% Cell type:code id:638b7071 tags:
``` python
# Create a vertical vector / array containing 3 0's
```
%% Cell type:markdown id:bbee3c66 tags:
### Part 1: Column Space of a Matrix
Definition: the *column space* of a matrix is the set of all linear combinations of that matrix's columns.
%% Cell type:code id:aa4aa683 tags:
``` python
A = np.array([
[1, 100],
[2, 10],
[3, 0]
])
B = np.array([
[1, 0],
[0, 2],
[0, 3],
[0, 0]
])
```
%% Cell type:markdown id:fa66980f tags:
$A = \begin{bmatrix}
1&100\\
2&10\\
3&0\\
\end{bmatrix}$
%% Cell type:code id:df474f66 tags:
``` python
# this is in the column space of A (it's a weighted mix of the columns)
A @ np.array([1, 1]).reshape(-1, 1)
```
%% Cell type:code id:2b659e82 tags:
``` python
# this is in the column space of A (it's a weighted mix of the columns)
A @ np.array([-1, 0]).reshape(-1, 1)
```
%% Cell type:code id:d1862fac tags:
``` python
# this is in the column space of A (it's a weighted mix of the columns)
A @ np.array([0, 2]).reshape(-1, 1)
```
%% Cell type:code id:c4914320 tags:
``` python
# this is in the column space of A (it's a weighted mix of the columns)
A @ np.array([0, 0]).reshape(-1, 1)
```
%% Cell type:markdown id:03fb6c44 tags:
A right-sized zero vector will always be in the column space.
%% Cell type:markdown id:d1bbe5aa tags:
What vectors are in the column space of B?
$B = \begin{bmatrix}
1&0\\
0&2\\
0&3\\
0&0\\
\end{bmatrix}$
$a=\begin{bmatrix}
2\\
2\\
3\\
0
\end{bmatrix}, b=\begin{bmatrix}
0\\
0\\
0\\
1
\end{bmatrix}, c=\begin{bmatrix}
-10\\
0\\
0\\
0
\end{bmatrix}, d=\begin{bmatrix}
0\\
-2\\
3\\
0
\end{bmatrix}, e=\begin{bmatrix}
-1\\
2\\
3\\
0
\end{bmatrix}$
%% Cell type:code id:69b66edd tags:
``` python
c = np.array([-1, 1]).reshape(-1, 1) # coef
B @ c
```
%% Cell type:markdown id:8135c8d7 tags:
### Solution
- in the column space of B:
-
-
-
- not in the column space:
-
-
%% Cell type:markdown id:d83b40b1 tags:
### Part 2: When can we solve for c?
Suppose $Xc = y$.
$X$ and $y$ are known, and we want to solve for $c$.
When does `c = np.linalg.solve(X, y)` work?
#### Fruit Sales Example
##### Data
* `10 apples and 0 bananas sold for $7`
* `2 apples and 8 bananas sold for $5`
* `4 apples and 4 bananas sold for $5`
##### Equations
* `10*apple + basket = 7`
* `2*apple + 8*banana + basket = 5`
* `4*apple + 4*banana + basket = 5`
%% Cell type:markdown id:52f21d69 tags:
#### There is a solution for the system of equations and `np.linalg.solve` can find it.
%% Cell type:code id:8163bbcc tags:
``` python
X = np.array([
[10, 0, 1],
[2, 8, 1],
[4, 4, 1],
])
y = np.array([7, 5, 5]).reshape(-1, 1)
c = np.linalg.solve(X, y)
c
```
%% Cell type:code id:902440c0 tags:
``` python
X
```
%% Cell type:code id:c0d632cc tags:
``` python
# Solve for 4, 4, 1
```
%% Cell type:code id:d026099e tags:
``` python
# Solve for 5, 5, 1
```
%% Cell type:markdown id:4357aa78 tags:
#### There is a solution for $c$ (in $Xc = y$), even if `np.linalg.solve` can't find it.
- mathematically solvable
%% Cell type:code id:5b2b7ed0 tags:
``` python
X = np.array([
[10, 0, 1],
[2, 8, 1],
[4, 4, 1],
# adding the new combination
???
])
y = np.array([7, 5, 5, ???]).reshape(-1, 1)
c = np.linalg.solve(X, y)
c
```
%% Cell type:markdown id:9ab49f76 tags:
### Equivalent statements
* there is a solution for the system of equations and `np.linalg.solve` can find it
* there is a solution for $c$ (in $Xc = y$), even if `np.linalg.solve` can't find it
* $y$ is in the column space of $X$
%% Cell type:markdown id:36411a9d tags:
### Problem with most tables
More rows than columns in our dataset means more equations than variables.
This *usually* means that:
The equations aren't solvable, and y isn't in the column space of X.
%% Cell type:code id:f719b210 tags:
``` python
X
```
%% Cell type:code id:73e680f3 tags:
``` python
y
```
%% Cell type:markdown id:d5c036f5 tags:
Dot product both sides by `X.T` ---> this will usually make it solvable.
%% Cell type:code id:bf3ec400 tags:
``` python
c = np.linalg.solve(X, y)
c
```
%% Cell type:markdown id:d2e64af4 tags:
What is special about dot product of a matrix with its transpose? Resultant shape is always a square.
%% Cell type:code id:b4c45f0e tags:
``` python
(X.T @ X).shape
```
%% Cell type:markdown id:0d087fa2 tags:
**IMPORTANT**: We are not going to discuss how dot product works between two matrices. That is beyond the scope of CS320.
%% Cell type:markdown id:c382fa8e tags:
### Part 3: Projection Matrix
Say X and y are known, but we can't solve for c because X has more rows than columns:
### <font color='red'>$Xc = y$</font>
We can, however, usually (unless there are multiple equally good solutions) solve the following, which we get by multiplying both sides by $X^T$:
### <font color='red'>$X^TXc = X^Ty$</font>
If we can find a c to make the above true, we can multiple both sides by $(X^TX)^{-1}$ (which generally exists unless X columns are redundant) to get this equation:
$(X^TX)^{-1}X^TXc = (X^TX)^{-1}X^Ty$
Simplify:
$c = (X^TX)^{-1}X^Ty$
Multiply both sides by X:
### <font color='red'>$Xc = X(X^TX)^{-1}X^Ty$</font>
### Note we started with an unsolveable $Xc = ????$ problem but multiplied $y$ by something to get a different $Xc = ????$ that is solveable.
Define <font color="red">$P = X(X^TX)^{-1}X^T$</font>. This is a **projection matrix**. If you multiply a vector by $P$, you get back another vector of the same size, with two properties:
1. it will be in the column space of $X$
2. the new vector will be as "close" as possible to the original vector
Note: computing P is generally very expensive.
### Fruit Sales Example
%% Cell type:code id:3506a459 tags:
``` python
X = np.array([
[10, 0, 1],
[2, 8, 1],
[4, 4, 1],
[10, 4, 1],
[10, 4, 1]
])
y = np.array([7, 5, 5, 8, 8.5]).reshape(-1, 1)
y
```
%% Cell type:markdown id:d08e23e1 tags:
Let's compute $P = X(X^TX)^{-1}X^T$.
- **IMPORTANT**: We are not going to discuss how inverse works. That is beyond the scope of CS320.
### `np.linalg.inv(a)`
- computes the (multiplicative) inverse of a matrix.
- documentation: https://numpy.org/doc/stable/reference/generated/numpy.linalg.inv.html
%% Cell type:code id:243c2127 tags:
``` python
P =
P
```
%% Cell type:code id:1e0681e9 tags:
``` python
X
```
%% Cell type:code id:3e881e59 tags:
``` python
y
```
%% Cell type:markdown id:1d8e442c tags:
The new vector will be as "close" as possible to the original vector.
%% Cell type:code id:33aeabe1 tags:
``` python
P @ y
```
%% Cell type:markdown id:b36c1335 tags:
#### Scatter plot visualization
%% Cell type:markdown id:2f27f3bc tags:
**IMPORTANT**: We are not going to discuss how `np.random.normal` works. You can look up the documentation if you are interested.
%% Cell type:code id:921938e2 tags:
``` python
x = np.random.normal(5, 2, size=(10, 1))
y = 2*x + np.random.normal(size=x.shape)
df = pd.DataFrame({"x": x.reshape(-1), "y": y.reshape(-1)})
df
```
%% Cell type:code id:f6052130 tags:
``` python
df.plot.scatter(x="x", y="y", figsize=(3, 2))
```
%% Cell type:code id:8304b0fb tags:
``` python
# Extract X ---> features: ["x"]
X = ???
X
```
%% Cell type:code id:b9f4403e tags:
``` python
P = X @ np.linalg.inv(X.T @ X) @ X.T
P
```
%% Cell type:code id:db976c33 tags:
``` python
# Extract y ---> prediction value: ["y"]
df["p"] = P @ ???
df
```
%% Cell type:code id:9aab4539 tags:
``` python
ax = df.plot.scatter(x="x", y="y", figsize=(3, 2), color="k")
df.plot.scatter(x="x", y="p", color="r", ax=ax)
```
%% Cell type:markdown id:d1182ba3 tags:
### Euclidean Distance between columns
- how close is the new vector (`P @ y`) to the original vector (`y`)?
- $dist$ = $\sqrt{(x2 - x1)^2 + (y2 - y1)^2}$
%% Cell type:code id:bfd1ce1d tags:
``` python
coords = pd.DataFrame({
"v1": [1, 8],
"v2": [4, 12],
}, index=["x", "y"])
coords
```
%% Cell type:code id:cae18757 tags:
``` python
# distance between v1 and v2 is 5
```
%% Cell type:code id:1e096c19 tags:
``` python
# this is the smallest possible distance between y and p, such
# that X @ c = p is solveable
((df["y"] - df["p"]) ** 2).sum() ** 0.5
```
%% Cell type:markdown id:42945982 tags:
### Lab review
%% Cell type:code id:6758c077 tags:
``` python
# As an exception, I am providing all the relevant import statements in this cell
import numpy as np
import rasterio
from rasterio.mask import mask
from shapely.geometry import box
import geopandas as gpd
land = rasterio.open("zip://land.zip!wi.tif")
# a = land.read()
window = gpd.GeoSeries([box(-89.5, 43, -89.2, 43.2)]).set_crs("epsg:4326").to_crs(land.crs)
plt.imshow(mask(land, window, crop=True)[0][0])
```
%% Cell type:code id:5f143a2d tags:
``` python
```
%% Cell type:markdown id:dc708de6 tags:
# Linear Algebra 2
%% Cell type:code id:cbd48a28 tags:
``` python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```
%% Cell type:markdown id:dc8a126e tags:
### Problem 1: Predicting with dot product (given `X` and `c`, compute `y`)
1. use case for dot product:
- `y = Xc + b`
2. one's column
3. matrix dot vector
$\begin{bmatrix}
1 & 2 \\ 3 & 4\\
\end{bmatrix}
\cdot
\begin{bmatrix}
10 \\ 1 \\
\end{bmatrix}$
%% Cell type:code id:15b52f7e tags:
``` python
houses = pd.DataFrame([[2, 1, 1985],
[3, 1, 1998],
[4, 3, 2005],
[4, 2, 2020]],
columns=["beds", "baths", "year"])
houses
```
%% Cell type:code id:dd024e62-9e23-4221-839d-2694f1d22d5e tags:
``` python
def predict_price(house):
"""
Takes row (as Series) as argument,
returns estimated price (in thousands)
"""
return ((house["beds"]*42.3) + (house["baths"]*10) +
(house["year"]*1.67) - 3213)
predict_price(houses.iloc[0])
```
%% Cell type:code id:aa256e1c tags:
``` python
# How do we convert a DataFrame into a numpy array?
X = houses.values
X
```
%% Cell type:markdown id:1ece6b08 tags:
Simplifying intercept addition by including intercept inside `c` vector.
%% Cell type:code id:eb2997b1-9650-4514-8d63-2155df96ec0e tags:
``` python
# Extract just first row of data
house0 =
house0
```
%% Cell type:code id:352304a8-7596-4b86-ae66-e7cda1d2f570 tags:
``` python
# Create a vertical array (3 x 1) with the co-efficients
c = np.array([42.3, 10, 1.67])???
c
```
%% Cell type:code id:3da7fdf1-37ed-4438-9241-5a8c296ca377 tags:
``` python
# horizontal @ vertical
```
%% Cell type:markdown id:d7966ada-bb6b-4a5b-84c0-6fb00288d607 tags:
`y = Xc + b`
%% Cell type:code id:fdac0958-be47-4cdf-9cf3-bc323f2567f6 tags:
``` python
```
%% Cell type:markdown id:6301b51c-263c-486a-8608-5d5263540e58 tags:
Let's add the intercept to the c vector for ease.
%% Cell type:code id:f4865788-d8b6-4dd1-8d64-58ce5114a178 tags:
``` python
c = np.array([42.3, 10, 1.67, ???]).reshape(-1, 1)
c
```
%% Cell type:markdown id:f4aa7d1f-8d41-4d66-8364-1e0aea5c1b60 tags:
If we directly try dot product now, it won't work because of difference in dimensions.
%% Cell type:code id:edb0a6aa-0362-47a2-b60c-8a476a431b60 tags:
``` python
house0 @ c
```
%% Cell type:code id:0834af63-713d-4b98-a7e4-d10211ed2cf2 tags:
``` python
house0.shape
```
%% Cell type:code id:37cc5e0d-4caf-4a8c-a0e1-77270308d413 tags:
``` python
c.shape
```
%% Cell type:markdown id:0b87c930-c373-45ae-abcc-fb7d2e4e2adc tags:
#### One's column
- Solution, add a 1's column to `X` using `np.concatenate`
%% Cell type:code id:2d69ab5d-0756-4c34-9bf3-5276ca43a8e4 tags:
``` python
# How can we generate an array of 1's using numpy?
ones_column =
ones_column
```
%% Cell type:code id:c2ea3a16-8cc1-43b1-9e1d-9e9e22abb0c7 tags:
``` python
X =
X
```
%% Cell type:code id:36db9451-c3da-4a14-91a2-8aee734147fb tags:
``` python
# Let's extract house0 again
house0 = X[0:1, :]
house0
```
%% Cell type:code id:ddf993e4-9d73-472b-92a5-9cf1231baa5f tags:
``` python
# Let's try house0 @ c now
house0 @ c
```
%% Cell type:code id:2d2112b1-8155-48bb-abfe-b6167fb4bc3b tags:
``` python
# Extracting each house and doing the prediction with dot product
# Cumbersome
house0 = X[0:1, :]
print(house0 @ c)
house1 = X[1:2, :]
print(house1 @ c)
house2 = X[2:3, :]
print(house2 @ c)
house3 = X[3:4, :]
print(house3 @ c)
```
%% Cell type:markdown id:ed5ccacc-c777-42b5-97f7-e0f235b94e5e tags:
### `@` use cases
loops over each row of the firt array and computes dot product, which is ROW @ COEFs, that is, `X @ c`
%% Cell type:code id:f2488b35 tags:
``` python
X @ c
```
%% Cell type:markdown id:9d37347d tags:
### Problem 2: Fitting with `np.linalg.solve` (given `X` and `y`, find `c`)
%% Cell type:markdown id:fe3174f5 tags:
**Above:** we estimated house prices using a linear model based on the dot product as follows:
$Xc = y$
* $X$ (known) is a matrix with house features (from DataFrame)
* $c$ (known) is a vector of coefficients (our model parameters)
* $y$ (computed) are the prices
**Below:** what if X and y are known, and we want to find c?
%% Cell type:code id:4572020d tags:
``` python
houses = pd.DataFrame([[2, 1, 1985, 196.55],
[3, 1, 1998, 260.56],
[4, 3, 2005, 334.55],
[4, 2, 2020, 349.60]],
columns=["beds", "baths", "year", "price"])
houses
```
%% Cell type:markdown id:c3f5e71a tags:
If we assume price is linearly based on the features, with this equation:
* $beds*c_0 + baths*c_1 + year*c_2 + 1*c_3 = price$
Then we get four equations:
* $2*c_0 + 1*c_1 + 1985*c_2 + 1*c_3 = 196.55$
* $3*c_0 + 1*c_1 + 1998*c_2 + 1*c_3 = 260.56$
* $4*c_0 + 3*c_1 + 2005*c_2 + 1*c_3 = 334.55$
* $4*c_0 + 2*c_1 + 2020*c_2 + 1*c_3 = 349.60$
%% Cell type:markdown id:fb7d682a tags:
#### `c = np.linalg.solve(X, y)`
- documentation: https://numpy.org/doc/stable/reference/generated/numpy.linalg.solve.html
%% Cell type:code id:ba119868-f8d3-4ff7-851e-9d2c431eafa3 tags:
``` python
# Add a column of 1s to this DataFrame
houses
```
%% Cell type:code id:459a4245-4d29-4ad6-9baa-31c3979ed32c tags:
``` python
# Extract X ---> features: ["beds", "baths", "year", "ones"]
X =
X
```
%% Cell type:code id:86df8279-6d9f-434b-b7ef-8c654fc93c80 tags:
``` python
# Extract y ---> prediction value: ["price"]
# Unlike predict method argument, we need a DataFrame here,
# Reason: so that we can convert that into numpy array
y =
y
```
%% Cell type:code id:218371ec tags:
``` python
# Let's take a look at the co-efficients which we were using for our prediction
c
```
%% Cell type:code id:76506e39-7d24-462b-a73f-1e7f4c748b38 tags:
``` python
c =
c
```
%% Cell type:code id:34353088-f51d-45a6-90c0-940b6429d3f8 tags:
``` python
X @ c
```
%% Cell type:markdown id:ec46453a tags:
What is the predicted price of a 6-bedroom 5-bathroom house built in 2024?
%% Cell type:code id:b734ec2f-ec9c-4a44-b92f-7efc78c92e6d tags:
``` python
dream_house = np.array([[6, 5, 2024, 1]])
dream_house
```
%% Cell type:code id:5f5add54-1a60-484c-b7d0-ba9d1335072d tags:
``` python
dream_house @ c
```
%% Cell type:markdown id:3e282e0c tags:
### Two Perspectives on `Matrix @ vector`
$\begin{bmatrix}
4&5\\6&7\\8&9\\
\end{bmatrix}
\cdot
\begin{bmatrix}
2\\3\\
\end{bmatrix}
= ????
$
%% Cell type:code id:a6979874 tags:
``` python
X = np.array([[4, 5], [6, 7], [8, 9]])
c = np.array([2, 3]).reshape(-1, 1)
X @ c
```
%% Cell type:markdown id:3bd4097b tags:
### Row Picture
Do dot product one row at a time.
$\begin{bmatrix}
4&5\\6&7\\8&9\\
\end{bmatrix}
\cdot
\begin{bmatrix}
2\\3\\
\end{bmatrix}
=
\begin{bmatrix}
(4*2)+(5*3)\\
(6*2)+(7*3)\\
(8*2)+(9*3)\\
\end{bmatrix}
=
\begin{bmatrix}
23\\
33\\
43\\
\end{bmatrix}
$
%% Cell type:code id:0524fc15 tags:
``` python
def row_dot_product(X, c):
"""
function that performs same action as @ operator
"""
result = []
print(X)
print(c)
# loop over each row index of X
# extract each row using slicing
# why slicing? we want two dimensional array
# DOT PRODUCT the row with c
# convert result into a vertical numpy array
row_dot_product(X, c)
```
%% Cell type:code id:daefbfb9 tags:
``` python
X.shape
```
%% Cell type:markdown id:0ad5d3dc tags:
### Column Picture
$\begin{bmatrix}
c_0&c_1&c_2\\
\end{bmatrix}
\cdot
\begin{bmatrix}
x\\y\\z\\
\end{bmatrix}
=(c_0*x) + (c_1*y) + (c_2*z)
$
Dot product takes a **linear combination** of columns.
$\begin{bmatrix}
4&5\\6&7\\8&9\\
\end{bmatrix}
\cdot
\begin{bmatrix}
2\\3\\
\end{bmatrix}
=
\begin{bmatrix}
4\\6\\8\\
\end{bmatrix}*2
+
\begin{bmatrix}
5\\7\\9\\
\end{bmatrix}*3
=
\begin{bmatrix}
23\\
33\\
43\\
\end{bmatrix}
$
%% Cell type:code id:4d6f0113 tags:
``` python
def col_dot_product(X, c):
"""
same result as row_dot_product above,
but different definition / code
"""
# initialize a vertical vector of zeros
# loop over each col index of X
# extract each column using slicing
# extract weight for the column using indexing
# add weighted column to total
return total
col_dot_product(X, c)
```
%% Cell type:code id:66fae5a4 tags:
``` python
X.shape
```
%% Cell type:code id:638b7071 tags:
``` python
# Create a vertical vector / array containing 3 0's
```
%% Cell type:markdown id:bbee3c66 tags:
### Part 1: Column Space of a Matrix
Definition: the *column space* of a matrix is the set of all linear combinations of that matrix's columns.
%% Cell type:code id:aa4aa683 tags:
``` python
A = np.array([
[1, 100],
[2, 10],
[3, 0]
])
B = np.array([
[1, 0],
[0, 2],
[0, 3],
[0, 0]
])
```
%% Cell type:markdown id:fa66980f tags:
$A = \begin{bmatrix}
1&100\\
2&10\\
3&0\\
\end{bmatrix}$
%% Cell type:code id:df474f66 tags:
``` python
# this is in the column space of A (it's a weighted mix of the columns)
A @ np.array([1, 1]).reshape(-1, 1)
```
%% Cell type:code id:2b659e82 tags:
``` python
# this is in the column space of A (it's a weighted mix of the columns)
A @ np.array([-1, 0]).reshape(-1, 1)
```
%% Cell type:code id:d1862fac tags:
``` python
# this is in the column space of A (it's a weighted mix of the columns)
A @ np.array([0, 2]).reshape(-1, 1)
```
%% Cell type:code id:c4914320 tags:
``` python
# this is in the column space of A (it's a weighted mix of the columns)
A @ np.array([0, 0]).reshape(-1, 1)
```
%% Cell type:markdown id:03fb6c44 tags:
A right-sized zero vector will always be in the column space.
%% Cell type:markdown id:d1bbe5aa tags:
What vectors are in the column space of B?
$B = \begin{bmatrix}
1&0\\
0&2\\
0&3\\
0&0\\
\end{bmatrix}$
$a=\begin{bmatrix}
2\\
2\\
3\\
0
\end{bmatrix}, b=\begin{bmatrix}
0\\
0\\
0\\
1
\end{bmatrix}, c=\begin{bmatrix}
-10\\
0\\
0\\
0
\end{bmatrix}, d=\begin{bmatrix}
0\\
-2\\
3\\
0
\end{bmatrix}, e=\begin{bmatrix}
-1\\
2\\
3\\
0
\end{bmatrix}$
%% Cell type:code id:69b66edd tags:
``` python
c = np.array([-1, 1]).reshape(-1, 1) # coef
B @ c
```
%% Cell type:markdown id:8135c8d7 tags:
### Solution
- in the column space of B:
-
-
-
- not in the column space:
-
-
%% Cell type:markdown id:d83b40b1 tags:
### Part 2: When can we solve for c?
Suppose $Xc = y$.
$X$ and $y$ are known, and we want to solve for $c$.
When does `c = np.linalg.solve(X, y)` work?
#### Fruit Sales Example
##### Data
* `10 apples and 0 bananas sold for $7`
* `2 apples and 8 bananas sold for $5`
* `4 apples and 4 bananas sold for $5`
##### Equations
* `10*apple + basket = 7`
* `2*apple + 8*banana + basket = 5`
* `4*apple + 4*banana + basket = 5`
%% Cell type:markdown id:52f21d69 tags:
#### There is a solution for the system of equations and `np.linalg.solve` can find it.
%% Cell type:code id:8163bbcc tags:
``` python
X = np.array([
[10, 0, 1],
[2, 8, 1],
[4, 4, 1],
])
y = np.array([7, 5, 5]).reshape(-1, 1)
c = np.linalg.solve(X, y)
c
```
%% Cell type:code id:902440c0 tags:
``` python
X
```
%% Cell type:code id:c0d632cc tags:
``` python
# Solve for 4, 4, 1
```
%% Cell type:code id:d026099e tags:
``` python
# Solve for 5, 5, 1
```
%% Cell type:markdown id:4357aa78 tags:
#### There is a solution for $c$ (in $Xc = y$), even if `np.linalg.solve` can't find it.
- mathematically solvable
%% Cell type:code id:5b2b7ed0 tags:
``` python
X = np.array([
[10, 0, 1],
[2, 8, 1],
[4, 4, 1],
# adding the new combination
???
])
y = np.array([7, 5, 5, ???]).reshape(-1, 1)
c = np.linalg.solve(X, y)
c
```
%% Cell type:markdown id:9ab49f76 tags:
### Equivalent statements
* there is a solution for the system of equations and `np.linalg.solve` can find it
* there is a solution for $c$ (in $Xc = y$), even if `np.linalg.solve` can't find it
* $y$ is in the column space of $X$
%% Cell type:markdown id:36411a9d tags:
### Problem with most tables
More rows than columns in our dataset means more equations than variables.
This *usually* means that:
The equations aren't solvable, and y isn't in the column space of X.
%% Cell type:code id:f719b210 tags:
``` python
X
```
%% Cell type:code id:73e680f3 tags:
``` python
y
```
%% Cell type:markdown id:d5c036f5 tags:
Dot product both sides by `X.T` ---> this will usually make it solvable.
%% Cell type:code id:bf3ec400 tags:
``` python
c = np.linalg.solve(X, y)
c
```
%% Cell type:markdown id:d2e64af4 tags:
What is special about dot product of a matrix with its transpose? Resultant shape is always a square.
%% Cell type:code id:b4c45f0e tags:
``` python
(X.T @ X).shape
```
%% Cell type:markdown id:0d087fa2 tags:
**IMPORTANT**: We are not going to discuss how dot product works between two matrices. That is beyond the scope of CS320.
%% Cell type:markdown id:c382fa8e tags:
### Part 3: Projection Matrix
Say X and y are known, but we can't solve for c because X has more rows than columns:
### <font color='red'>$Xc = y$</font>
We can, however, usually (unless there are multiple equally good solutions) solve the following, which we get by multiplying both sides by $X^T$:
### <font color='red'>$X^TXc = X^Ty$</font>
If we can find a c to make the above true, we can multiple both sides by $(X^TX)^{-1}$ (which generally exists unless X columns are redundant) to get this equation:
$(X^TX)^{-1}X^TXc = (X^TX)^{-1}X^Ty$
Simplify:
$c = (X^TX)^{-1}X^Ty$
Multiply both sides by X:
### <font color='red'>$Xc = X(X^TX)^{-1}X^Ty$</font>
### Note we started with an unsolveable $Xc = ????$ problem but multiplied $y$ by something to get a different $Xc = ????$ that is solveable.
Define <font color="red">$P = X(X^TX)^{-1}X^T$</font>. This is a **projection matrix**. If you multiply a vector by $P$, you get back another vector of the same size, with two properties:
1. it will be in the column space of $X$
2. the new vector will be as "close" as possible to the original vector
Note: computing P is generally very expensive.
### Fruit Sales Example
%% Cell type:code id:3506a459 tags:
``` python
X = np.array([
[10, 0, 1],
[2, 8, 1],
[4, 4, 1],
[10, 4, 1],
[10, 4, 1]
])
y = np.array([7, 5, 5, 8, 8.5]).reshape(-1, 1)
y
```
%% Cell type:markdown id:d08e23e1 tags:
Let's compute $P = X(X^TX)^{-1}X^T$.
- **IMPORTANT**: We are not going to discuss how inverse works. That is beyond the scope of CS320.
### `np.linalg.inv(a)`
- computes the (multiplicative) inverse of a matrix.
- documentation: https://numpy.org/doc/stable/reference/generated/numpy.linalg.inv.html
%% Cell type:code id:243c2127 tags:
``` python
P =
P
```
%% Cell type:code id:1e0681e9 tags:
``` python
X
```
%% Cell type:code id:3e881e59 tags:
``` python
y
```
%% Cell type:markdown id:1d8e442c tags:
The new vector will be as "close" as possible to the original vector.
%% Cell type:code id:33aeabe1 tags:
``` python
P @ y
```
%% Cell type:markdown id:b36c1335 tags:
#### Scatter plot visualization
%% Cell type:markdown id:2f27f3bc tags:
**IMPORTANT**: We are not going to discuss how `np.random.normal` works. You can look up the documentation if you are interested.
%% Cell type:code id:921938e2 tags:
``` python
x = np.random.normal(5, 2, size=(10, 1))
y = 2*x + np.random.normal(size=x.shape)
df = pd.DataFrame({"x": x.reshape(-1), "y": y.reshape(-1)})
df
```
%% Cell type:code id:f6052130 tags:
``` python
df.plot.scatter(x="x", y="y", figsize=(3, 2))
```
%% Cell type:code id:8304b0fb tags:
``` python
# Extract X ---> features: ["x"]
X = ???
X
```
%% Cell type:code id:b9f4403e tags:
``` python
P = X @ np.linalg.inv(X.T @ X) @ X.T
P
```
%% Cell type:code id:db976c33 tags:
``` python
# Extract y ---> prediction value: ["y"]
df["p"] = P @ ???
df
```
%% Cell type:code id:9aab4539 tags:
``` python
ax = df.plot.scatter(x="x", y="y", figsize=(3, 2), color="k")
df.plot.scatter(x="x", y="p", color="r", ax=ax)
```
%% Cell type:markdown id:d1182ba3 tags:
### Euclidean Distance between columns
- how close is the new vector (`P @ y`) to the original vector (`y`)?
- $dist$ = $\sqrt{(x2 - x1)^2 + (y2 - y1)^2}$
%% Cell type:code id:bfd1ce1d tags:
``` python
coords = pd.DataFrame({
"v1": [1, 8],
"v2": [4, 12],
}, index=["x", "y"])
coords
```
%% Cell type:code id:cae18757 tags:
``` python
# distance between v1 and v2 is 5
```
%% Cell type:code id:1e096c19 tags:
``` python
# this is the smallest possible distance between y and p, such
# that X @ c = p is solveable
((df["y"] - df["p"]) ** 2).sum() ** 0.5
```
%% Cell type:markdown id:42945982 tags:
### Lab review
%% Cell type:code id:6758c077 tags:
``` python
# As an exception, I am providing all the relevant import statements in this cell
import numpy as np
import rasterio
from rasterio.mask import mask
from shapely.geometry import box
import geopandas as gpd
land = rasterio.open("zip://land.zip!wi.tif")
# a = land.read()
window = gpd.GeoSeries([box(-89.5, 43, -89.2, 43.2)]).set_crs("epsg:4326").to_crs(land.crs)
plt.imshow(mask(land, window, crop=True)[0][0])
```
%% Cell type:code id:5f143a2d tags:
``` python
```