Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • cdis/cs/courses/cs320/s24
  • EBBARTELS/s24
  • kenninger/s24
  • hbartle/s24
  • jvoegeli/s24
  • chin6/s24
  • lallo/s24
  • cbjensen/s24
  • bjhicks/s24
  • JPERLOFF/s24
  • RMILLER56/s24
  • sswain2/s24
  • SHINEGEORGE/s24
  • SKALMAZROUEI/s24
  • nkempf2/s24
  • kmalovrh/s24
  • alagiriswamy/s24
  • SWEINGARTEN2/s24
  • SKALMAZROUEI/s-24-fork
  • jchasco/s24
20 results
Show changes
Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:code id: tags:
``` python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA
x = np.random.uniform(0.1,5,100)
noise = np.random.normal(scale=0.3, size=x.size)
```
%% Cell type:markdown id: tags:
## Intuition: factorization
Why is it useful to express something as a few parts multiplied together?
To convey more information
%% Cell type:code id: tags:
``` python
# at what points does y=0?
# y = -x**3 + 7*x**2 - 14*x + 8
y = (4-x) * (2-x) * (1-x)
```
%% Cell type:code id: tags:
``` python
pd.DataFrame({"x": x, "y": y+noise}).plot.scatter(x="x", y="y")
plt.hlines(0, -1, 6, color="k")
```
%% Cell type:markdown id: tags:
## Some cool dimensionality reduction examples:
https://pair-code.github.io/understanding-umap/ \
https://distill.pub/2016/misread-tsne/
%% Cell type:markdown id: tags:
# Matrix Multiplication
%% Cell type:code id: tags:
``` python
A = np.random.normal(size=(9, 7))
B = np.random.normal(size=(6, 14))
C = np.random.normal(size=(14, 3))
D = np.random.normal(size=(3, 10))
```
%% Cell type:markdown id: tags:
# Decomposition with Principal Component Analysis (PCA)
Q: Is it possible to use fewer columns to represent this dataframe?
%% Cell type:code id: tags:
``` python
df = pd.DataFrame(make_blobs(centers=2, random_state=320)[0], columns=["A", "B"])
df["C"] = df["A"] * 2
df["D"] = df["A"] - df["B"]
df.head()
```
%% Cell type:markdown id: tags:
A: Yes. C is two times of A and D is A - B, so we only need A & B and their relationship to C & D to represent the dataframe.
%% Cell type:markdown id: tags:
# PCA on two columns
%% Cell type:code id: tags:
``` python
# plot A & B column
df.plot.scatter("A", "B")
```
%% Cell type:markdown id: tags:
## sklearn.decomposition.PCA
%% Cell type:code id: tags:
``` python
p = PCA()
W = p.fit_transform(df[["A", "B"]])
C = p.components_
```
%% Cell type:code id: tags:
``` python
# PCA will first find the mean
mean_point = p.mean_
mean_point
```
%% Cell type:code id: tags:
``` python
df[["A", "B"]].mean()
```
%% Cell type:code id: tags:
``` python
# plot mean point
df.plot.scatter("A", "B")
plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
```
%% Cell type:markdown id: tags:
C is called the **component matrix** \
first row of C is the most important component \
second row of C is the second most important component \
and so on ...
Each row is in the form of the slope of the componenet
%% Cell type:code id: tags:
``` python
# two components for 2d data
C
```
%% Cell type:markdown id: tags:
For the first component, PCA will try to fit a line that corss the mean point and
has the largest spreadout in terms of points. \
The second component will be prependicular to the first component, corssing the mean point,
and has the largest spreadout in its direction.
%% Cell type:code id: tags:
``` python
# plot first component
df.plot.scatter("A", "B")
plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
span = 6
point2 = [span + mean_point[0], C[0][1] / C[0][0] * span + mean_point[1]]
point3 = [-span + mean_point[0], C[0][1] / C[0][0] * (-span) + mean_point[1]]
x = [point2[0], point3[0]]
y = [point2[1], point3[1]]
plt.plot(x, y, linestyle="-", color="red")
```
%% Cell type:markdown id: tags:
First column of W represents relative positions of points along the first component \
Second column of W represents relative positions of points along the second component \
and so on ...
%% Cell type:code id: tags:
``` python
W[:10]
```
%% Cell type:code id: tags:
``` python
print(W.shape, C.shape)
```
%% Cell type:code id: tags:
``` python
print(df[["A", "B"]].shape)
```
%% Cell type:code id: tags:
``` python
# use W and C to reconstruct the original A & B columns
pd.DataFrame((W @ C) + p.mean_).head()
```
%% Cell type:code id: tags:
``` python
df[["A", "B"]].head()
```
%% Cell type:code id: tags:
``` python
# use only the first component to approximately reconstruct A & B columns
# the first column of W (relative position of W along the first component) multiply the first row of C (the first component)
pd.DataFrame(W[:, :1] @ C[:1, :] + p.mean_).head()
```
%% Cell type:markdown id: tags:
## Explained Variance
%% Cell type:code id: tags:
``` python
a = np.array([1.1, 1.9, 3.2])
a
```
%% Cell type:code id: tags:
``` python
b = np.array([1, 2, 3])
b
```
%% Cell type:code id: tags:
``` python
a - b
```
%% Cell type:code id: tags:
``` python
a.var()
```
%% Cell type:code id: tags:
``` python
(a - b).var()
```
%% Cell type:code id: tags:
``` python
1 - (a - b).var() / a.var()
```
%% Cell type:code id: tags:
``` python
# the amount of variance explained by each components
# the first component has largest explained variance
# the second component has the second largest explained variance
# and so on
explained_variance = p.explained_variance_
explained_variance
```
%% Cell type:code id: tags:
``` python
explained_variance / explained_variance.sum()
```
%% Cell type:code id: tags:
``` python
# explained variance percentage wise
p.explained_variance_ratio_
```
%% Cell type:markdown id: tags:
# PCA on two dependent columns
%% Cell type:code id: tags:
``` python
p = PCA()
W = p.fit_transform(df[["A", "C"]])
C = p.components_
```
%% Cell type:code id: tags:
``` python
mean = p.mean_
```
%% Cell type:code id: tags:
``` python
# plot A & C columns and the mean
df.plot.scatter("A", "C")
mean_point = [mean[0],mean[1]]
plt.plot(mean[0],mean[1], marker="X", markersize=20, color="red")
```
%% Cell type:code id: tags:
``` python
# plot the first component
df.plot.scatter("A", "C")
mean_point = [mean[0],mean[1]]
plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
span = 6
point2 = [span + mean_point[0], C[0][1] / C[0][0] * span + mean_point[1]]
point3 = [-span + mean_point[0], C[0][1] / C[0][0] * (-span) + mean_point[1]]
x = [point2[0], point3[0]]
y = [point2[1], point3[1]]
plt.plot(x, y, linestyle="-", color="red")
```
%% Cell type:code id: tags:
``` python
p.explained_variance_
```
%% Cell type:code id: tags:
``` python
# noted the first component is explianing 100% of the data
# because C is two times of A
# the first component is capturing the 2* relationship using its slope
p.explained_variance_ratio_
```
%% Cell type:code id: tags:
``` python
# we can reconstruct A & C only using one component
pd.DataFrame(W[:, :1] @ C[:1, :] + p.mean_).head()
```
%% Cell type:code id: tags:
``` python
df[["A", "C"]].head()
```
%% Cell type:markdown id: tags:
# PCA on all columns
%% Cell type:code id: tags:
``` python
p = PCA()
W = p.fit_transform(df)
C = p.components_
```
%% Cell type:code id: tags:
``` python
# four components for 4d data
C.shape
```
%% Cell type:code id: tags:
``` python
p.explained_variance_
```
%% Cell type:code id: tags:
``` python
# noted the first two components are explaining 100% of the data
ev_ratio = p.explained_variance_ratio_
ev_ratio
```
%% Cell type:code id: tags:
``` python
# we can reconstruct the original dataframe only using the first two components
pd.DataFrame(W[:, :2] @ C[:2, :] + p.mean_).head()
```
%% Cell type:code id: tags:
``` python
df.head()
```
%% Cell type:markdown id: tags:
### Cumulative plot of explained variance ratio
%% Cell type:code id: tags:
``` python
# cumsum() compute the cumulative sum
s = pd.Series(p.explained_variance_ratio_.cumsum(), index=range(1,5))
ax = s.plot.line(ylim=0)
ax.set_ylabel("Explained Variance")
ax.set_xlabel("Component")
```
%% Cell type:markdown id: tags:
# Dimensionality Reduction on Feature Columns
%% Cell type:code id: tags:
``` python
pipe = Pipeline([
("pca", PCA(2)),
# n_components parameter
# specify an int for number of components to use
# or a float indicates how much variance we want to explain (explained_variance_ratio_)
("km", KMeans(2)),
])
pipe.fit(df) # fit PCA, transform using PCA, fit KMeans using output from PCA
groups = pipe.predict(df) # transform using PCA
```
%% Cell type:code id: tags:
``` python
# -1 is white
pd.DataFrame(pipe["pca"].transform(df)).plot.scatter(x=0, y=1, c=groups, vmin=-1)
```
%% Cell type:markdown id: tags:
# Lossy Compression
Use PCA to extract the most important information and throw away the less important ones
%% Cell type:code id: tags:
``` python
img = plt.imread("bug.jpeg")
plt.imshow(img)
```
%% Cell type:code id: tags:
``` python
img.shape
```
%% Cell type:code id: tags:
``` python
# averaging the color dimension to make it a bit more easy to handle
img = img.mean(axis=2)
img.shape
```
%% Cell type:code id: tags:
``` python
plt.imshow(img, cmap="gray")
```
%% Cell type:code id: tags:
``` python
# we want to explian 95% of the variance
p = PCA(0.95)
W = p.fit_transform(img)
C = p.components_
m = p.mean_
```
%% Cell type:code id: tags:
``` python
original_size = len(img.reshape(-1))
original_size
```
%% Cell type:code id: tags:
``` python
compressed_size = len(W.reshape(-1)) + len(C.reshape(-1)) + len(m.reshape(-1))
compressed_size
```
%% Cell type:code id: tags:
``` python
# compression ratio
original_size / compressed_size
```
%% Cell type:code id: tags:
``` python
plt.imshow(W @ C + m, cmap="gray")
```
%% Cell type:code id: tags:
``` python
# savez saves numpy arrays into .npz format
# use wb to write in binary format
with open("img1.npz", "wb") as f:
np.savez(f, img)
```
%% Cell type:code id: tags:
``` python
with open("img2.npz", "wb") as f:
np.savez(f, W, C, m)
```
%% Cell type:code id: tags:
``` python
with np.load("img2.npz") as f:
W, C, m = f.values()
```
%% Cell type:code id: tags:
``` python
plt.imshow(W @ C + m, cmap="gray")
```
%% Cell type:code id: tags:
``` python
# original plot is 33M vs. the compressed plot is 876K
!ls -lh
```
%% Cell type:code id: tags:
``` python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA
x = np.random.uniform(0.1,5,100)
noise = np.random.normal(scale=0.3, size=x.size)
```
%% Cell type:markdown id: tags:
## Intuition: factorization
Why is it useful to express something as a few parts multiplied together?
To convey more information
%% Cell type:code id: tags:
``` python
# at what points does y=0?
# y = -x**3 + 7*x**2 - 14*x + 8
y = (4-x) * (2-x) * (1-x)
```
%% Cell type:code id: tags:
``` python
pd.DataFrame({"x": x, "y": y+noise}).plot.scatter(x="x", y="y")
plt.hlines(0, -1, 6, color="k")
```
%% Cell type:markdown id: tags:
## Some cool dimensionality reduction examples:
https://pair-code.github.io/understanding-umap/ \
https://distill.pub/2016/misread-tsne/
%% Cell type:markdown id: tags:
# Matrix Multiplication
%% Cell type:code id: tags:
``` python
A = np.random.normal(size=(9, 7))
B = np.random.normal(size=(6, 14))
C = np.random.normal(size=(14, 3))
D = np.random.normal(size=(3, 10))
```
%% Cell type:markdown id: tags:
# Decomposition with Principal Component Analysis (PCA)
Q: Is it possible to use fewer columns to represent this dataframe?
%% Cell type:code id: tags:
``` python
df = pd.DataFrame(make_blobs(centers=2, random_state=320)[0], columns=["A", "B"])
df["C"] = df["A"] * 2
df["D"] = df["A"] - df["B"]
df.head()
```
%% Cell type:markdown id: tags:
A: Yes. C is two times of A and D is A - B, so we only need A & B and their relationship to C & D to represent the dataframe.
%% Cell type:markdown id: tags:
# PCA on two columns
%% Cell type:code id: tags:
``` python
# plot A & B column
df.plot.scatter("A", "B")
```
%% Cell type:markdown id: tags:
## sklearn.decomposition.PCA
%% Cell type:code id: tags:
``` python
p = PCA()
W = p.fit_transform(df[["A", "B"]])
C = p.components_
```
%% Cell type:code id: tags:
``` python
# PCA will first find the mean
mean_point = p.mean_
mean_point
```
%% Cell type:code id: tags:
``` python
df[["A", "B"]].mean()
```
%% Cell type:code id: tags:
``` python
# plot mean point
df.plot.scatter("A", "B")
plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
```
%% Cell type:markdown id: tags:
C is called the **component matrix** \
first row of C is the most important component \
second row of C is the second most important component \
and so on ...
Each row is in the form of the slope of the componenet
%% Cell type:code id: tags:
``` python
# two components for 2d data
C
```
%% Cell type:markdown id: tags:
For the first component, PCA will try to fit a line that corss the mean point and
has the largest spreadout in terms of points. \
The second component will be prependicular to the first component, corssing the mean point,
and has the largest spreadout in its direction.
%% Cell type:code id: tags:
``` python
# plot first component
df.plot.scatter("A", "B")
plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
span = 6
point2 = [span + mean_point[0], C[0][1] / C[0][0] * span + mean_point[1]]
point3 = [-span + mean_point[0], C[0][1] / C[0][0] * (-span) + mean_point[1]]
x = [point2[0], point3[0]]
y = [point2[1], point3[1]]
plt.plot(x, y, linestyle="-", color="red")
```
%% Cell type:markdown id: tags:
First column of W represents relative positions of points along the first component \
Second column of W represents relative positions of points along the second component \
and so on ...
%% Cell type:code id: tags:
``` python
W[:10]
```
%% Cell type:code id: tags:
``` python
print(W.shape, C.shape)
```
%% Cell type:code id: tags:
``` python
print(df[["A", "B"]].shape)
```
%% Cell type:code id: tags:
``` python
# use W and C to reconstruct the original A & B columns
pd.DataFrame((W @ C) + p.mean_).head()
```
%% Cell type:code id: tags:
``` python
df[["A", "B"]].head()
```
%% Cell type:code id: tags:
``` python
# use only the first component to approximately reconstruct A & B columns
# the first column of W (relative position of W along the first component) multiply the first row of C (the first component)
pd.DataFrame(W[:, :1] @ C[:1, :] + p.mean_).head()
```
%% Cell type:markdown id: tags:
## Explained Variance
%% Cell type:code id: tags:
``` python
a = np.array([1.1, 1.9, 3.2])
a
```
%% Cell type:code id: tags:
``` python
b = np.array([1, 2, 3])
b
```
%% Cell type:code id: tags:
``` python
a - b
```
%% Cell type:code id: tags:
``` python
a.var()
```
%% Cell type:code id: tags:
``` python
(a - b).var()
```
%% Cell type:code id: tags:
``` python
1 - (a - b).var() / a.var()
```
%% Cell type:code id: tags:
``` python
# the amount of variance explained by each components
# the first component has largest explained variance
# the second component has the second largest explained variance
# and so on
explained_variance = p.explained_variance_
explained_variance
```
%% Cell type:code id: tags:
``` python
explained_variance / explained_variance.sum()
```
%% Cell type:code id: tags:
``` python
# explained variance percentage wise
p.explained_variance_ratio_
```
%% Cell type:markdown id: tags:
# PCA on two dependent columns
%% Cell type:code id: tags:
``` python
p = PCA()
W = p.fit_transform(df[["A", "C"]])
C = p.components_
```
%% Cell type:code id: tags:
``` python
mean = p.mean_
```
%% Cell type:code id: tags:
``` python
# plot A & C columns and the mean
df.plot.scatter("A", "C")
mean_point = [mean[0],mean[1]]
plt.plot(mean[0],mean[1], marker="X", markersize=20, color="red")
```
%% Cell type:code id: tags:
``` python
# plot the first component
df.plot.scatter("A", "C")
mean_point = [mean[0],mean[1]]
plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
span = 6
point2 = [span + mean_point[0], C[0][1] / C[0][0] * span + mean_point[1]]
point3 = [-span + mean_point[0], C[0][1] / C[0][0] * (-span) + mean_point[1]]
x = [point2[0], point3[0]]
y = [point2[1], point3[1]]
plt.plot(x, y, linestyle="-", color="red")
```
%% Cell type:code id: tags:
``` python
p.explained_variance_
```
%% Cell type:code id: tags:
``` python
# noted the first component is explianing 100% of the data
# because C is two times of A
# the first component is capturing the 2* relationship using its slope
p.explained_variance_ratio_
```
%% Cell type:code id: tags:
``` python
# we can reconstruct A & C only using one component
pd.DataFrame(W[:, :1] @ C[:1, :] + p.mean_).head()
```
%% Cell type:code id: tags:
``` python
df[["A", "C"]].head()
```
%% Cell type:markdown id: tags:
# PCA on all columns
%% Cell type:code id: tags:
``` python
p = PCA()
W = p.fit_transform(df)
C = p.components_
```
%% Cell type:code id: tags:
``` python
# four components for 4d data
C.shape
```
%% Cell type:code id: tags:
``` python
p.explained_variance_
```
%% Cell type:code id: tags:
``` python
# noted the first two components are explaining 100% of the data
ev_ratio = p.explained_variance_ratio_
ev_ratio
```
%% Cell type:code id: tags:
``` python
# we can reconstruct the original dataframe only using the first two components
pd.DataFrame(W[:, :2] @ C[:2, :] + p.mean_).head()
```
%% Cell type:code id: tags:
``` python
df.head()
```
%% Cell type:markdown id: tags:
### Cumulative plot of explained variance ratio
%% Cell type:code id: tags:
``` python
# cumsum() compute the cumulative sum
s = pd.Series(p.explained_variance_ratio_.cumsum(), index=range(1,5))
ax = s.plot.line(ylim=0)
ax.set_ylabel("Explained Variance")
ax.set_xlabel("Component")
```
%% Cell type:markdown id: tags:
# Dimensionality Reduction on Feature Columns
%% Cell type:code id: tags:
``` python
pipe = Pipeline([
("pca", PCA(2)),
# n_components parameter
# specify an int for number of components to use
# or a float indicates how much variance we want to explain (explained_variance_ratio_)
("km", KMeans(2)),
])
pipe.fit(df) # fit PCA, transform using PCA, fit KMeans using output from PCA
groups = pipe.predict(df) # transform using PCA
```
%% Cell type:code id: tags:
``` python
# -1 is white
pd.DataFrame(pipe["pca"].transform(df)).plot.scatter(x=0, y=1, c=groups, vmin=-1)
```
%% Cell type:markdown id: tags:
# Lossy Compression
Use PCA to extract the most important information and throw away the less important ones
%% Cell type:code id: tags:
``` python
img = plt.imread("bug.jpeg")
plt.imshow(img)
```
%% Cell type:code id: tags:
``` python
img.shape
```
%% Cell type:code id: tags:
``` python
# averaging the color dimension to make it a bit more easy to handle
img = img.mean(axis=2)
img.shape
```
%% Cell type:code id: tags:
``` python
plt.imshow(img, cmap="gray")
```
%% Cell type:code id: tags:
``` python
# we want to explian 95% of the variance
p = PCA(0.95)
W = p.fit_transform(img)
C = p.components_
m = p.mean_
```
%% Cell type:code id: tags:
``` python
original_size = len(img.reshape(-1))
original_size
```
%% Cell type:code id: tags:
``` python
compressed_size = len(W.reshape(-1)) + len(C.reshape(-1)) + len(m.reshape(-1))
compressed_size
```
%% Cell type:code id: tags:
``` python
# compression ratio
original_size / compressed_size
```
%% Cell type:code id: tags:
``` python
plt.imshow(W @ C + m, cmap="gray")
```
%% Cell type:code id: tags:
``` python
# savez saves numpy arrays into .npz format
# use wb to write in binary format
with open("img1.npz", "wb") as f:
np.savez(f, img)
```
%% Cell type:code id: tags:
``` python
with open("img2.npz", "wb") as f:
np.savez(f, W, C, m)
```
%% Cell type:code id: tags:
``` python
with np.load("img2.npz") as f:
W, C, m = f.values()
```
%% Cell type:code id: tags:
``` python
plt.imshow(W @ C + m, cmap="gray")
```
%% Cell type:code id: tags:
``` python
# original plot is 33M vs. the compressed plot is 876K
!ls -lh
```
lecture_material/24-decomposition/bug.jpg

284 KiB

File added
File added
File added
File added