Compare revisions

5087efbe · 5087efbe · 5087efbe · 5087efbe · 5087efbe · 5087efbe
--- a/lecture_material/24-decomposition/24-pca.ipynb
+++ b/lecture_material/24-decomposition/24-pca.ipynb
--- a/lecture_material/24-decomposition/24-pca_001.ipynb
+++ b/lecture_material/24-decomposition/24-pca_001.ipynb
+%% Cell type:code id: tags:
+
+``` python
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+from sklearn.pipeline import Pipeline
+from sklearn.cluster import KMeans
+from sklearn.datasets import make_blobs
+from sklearn.decomposition import PCA
+
+x = np.random.uniform(0.1,5,100)
+noise = np.random.normal(scale=0.3, size=x.size)
+```
+
+%% Cell type:markdown id: tags:
+
+## Intuition: factorization
+Why is it useful to express something as a few parts multiplied together?
+To convey more information
+
+%% Cell type:code id: tags:
+
+``` python
+# at what points does y=0?
+# y = -x**3 + 7*x**2 - 14*x + 8
+y = (4-x) * (2-x) * (1-x)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+pd.DataFrame({"x": x, "y": y+noise}).plot.scatter(x="x", y="y")
+plt.hlines(0, -1, 6, color="k")
+```
+
+%% Cell type:markdown id: tags:
+
+## Some cool dimensionality reduction examples:
+https://pair-code.github.io/understanding-umap/ \
+https://distill.pub/2016/misread-tsne/
+
+%% Cell type:markdown id: tags:
+
+# Matrix Multiplication
+
+%% Cell type:code id: tags:
+
+``` python
+A = np.random.normal(size=(9, 7))
+B = np.random.normal(size=(6, 14))
+C = np.random.normal(size=(14, 3))
+D = np.random.normal(size=(3, 10))
+```
+
+%% Cell type:markdown id: tags:
+
+# Decomposition with Principal Component Analysis (PCA)
+Q: Is it possible to use fewer columns to represent this dataframe?
+
+%% Cell type:code id: tags:
+
+``` python
+df = pd.DataFrame(make_blobs(centers=2, random_state=320)[0], columns=["A", "B"])
+df["C"] = df["A"] * 2
+df["D"] = df["A"] - df["B"]
+df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+A: Yes. C is two times of A and D is A - B, so we only need A & B and their relationship to C & D to represent the dataframe.
+
+%% Cell type:markdown id: tags:
+
+# PCA on two columns
+
+%% Cell type:code id: tags:
+
+``` python
+# plot A & B column
+df.plot.scatter("A", "B")
+```
+
+%% Cell type:markdown id: tags:
+
+## sklearn.decomposition.PCA
+
+%% Cell type:code id: tags:
+
+``` python
+p = PCA()
+W = p.fit_transform(df[["A", "B"]])
+C = p.components_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# PCA will first find the mean
+mean_point = p.mean_
+mean_point
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["A", "B"]].mean()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# plot mean point
+df.plot.scatter("A", "B")
+plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
+```
+
+%% Cell type:markdown id: tags:
+
+C is called the **component matrix** \
+first row of C is the most important component \
+second row of C is the second most important component \
+and so on ...
+
+Each row is in the form of the slope of the componenet
+
+%% Cell type:code id: tags:
+
+``` python
+# two components for 2d data
+C
+```
+
+%% Cell type:markdown id: tags:
+
+For the first component, PCA will try to fit a line that corss the mean point and
+has the largest spreadout in terms of points. \
+The second component will be prependicular to the first component, corssing the mean point,
+and has the largest spreadout in its direction.
+
+%% Cell type:code id: tags:
+
+``` python
+# plot first component
+df.plot.scatter("A", "B")
+
+plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
+span = 6
+point2 = [span + mean_point[0], C[0][1] / C[0][0] * span + mean_point[1]]
+point3 = [-span + mean_point[0], C[0][1] / C[0][0] * (-span) + mean_point[1]]
+x = [point2[0], point3[0]]
+y = [point2[1], point3[1]]
+plt.plot(x, y, linestyle="-", color="red")
+```
+
+%% Cell type:markdown id: tags:
+
+First column of W represents relative positions of points along the first component \
+Second column of W represents relative positions of points along the second component \
+and so on ...
+
+%% Cell type:code id: tags:
+
+``` python
+W[:10]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+print(W.shape, C.shape)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+print(df[["A", "B"]].shape)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use W and C to reconstruct the original A & B columns
+pd.DataFrame((W @ C) + p.mean_).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["A", "B"]].head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use only the first component to approximately reconstruct A & B columns
+# the first column of W (relative position of W along the first component) multiply the first row of C (the first component)
+pd.DataFrame(W[:, :1] @ C[:1, :] + p.mean_).head()
+```
+
+%% Cell type:markdown id: tags:
+
+## Explained Variance
+
+%% Cell type:code id: tags:
+
+``` python
+a = np.array([1.1, 1.9, 3.2])
+a
+```
+
+%% Cell type:code id: tags:
+
+``` python
+b = np.array([1, 2, 3])
+b
+```
+
+%% Cell type:code id: tags:
+
+``` python
+a - b
+```
+
+%% Cell type:code id: tags:
+
+``` python
+a.var()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+(a - b).var()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+1 - (a - b).var() / a.var()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# the amount of variance explained by each components
+# the first component has largest explained variance
+# the second component has the second largest explained variance
+# and so on
+explained_variance = p.explained_variance_
+explained_variance
+```
+
+%% Cell type:code id: tags:
+
+``` python
+explained_variance / explained_variance.sum()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# explained variance percentage wise
+p.explained_variance_ratio_
+```
+
+%% Cell type:markdown id: tags:
+
+# PCA on two dependent columns
+
+%% Cell type:code id: tags:
+
+``` python
+p = PCA()
+W = p.fit_transform(df[["A", "C"]])
+C = p.components_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+mean = p.mean_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# plot A & C columns and the mean
+df.plot.scatter("A", "C")
+mean_point = [mean[0],mean[1]]
+plt.plot(mean[0],mean[1], marker="X", markersize=20, color="red")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# plot the first component
+df.plot.scatter("A", "C")
+mean_point = [mean[0],mean[1]]
+plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
+span = 6
+point2 = [span + mean_point[0], C[0][1] / C[0][0] * span + mean_point[1]]
+point3 = [-span + mean_point[0], C[0][1] / C[0][0] * (-span) + mean_point[1]]
+x = [point2[0], point3[0]]
+y = [point2[1], point3[1]]
+plt.plot(x, y, linestyle="-", color="red")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+p.explained_variance_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# noted the first component is explianing 100% of the data
+# because C is two times of A
+# the first component is capturing the 2* relationship using its slope
+p.explained_variance_ratio_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# we can reconstruct A & C only using one component
+pd.DataFrame(W[:, :1] @ C[:1, :] + p.mean_).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["A", "C"]].head()
+```
+
+%% Cell type:markdown id: tags:
+
+# PCA on all columns
+
+%% Cell type:code id: tags:
+
+``` python
+p = PCA()
+W = p.fit_transform(df)
+C = p.components_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# four components for 4d data
+C.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+p.explained_variance_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# noted the first two components are explaining 100% of the data
+ev_ratio = p.explained_variance_ratio_
+ev_ratio
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# we can reconstruct the original dataframe only using the first two components
+pd.DataFrame(W[:, :2] @ C[:2, :] + p.mean_).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+### Cumulative plot of explained variance ratio
+
+%% Cell type:code id: tags:
+
+``` python
+# cumsum() compute the cumulative sum
+s = pd.Series(p.explained_variance_ratio_.cumsum(), index=range(1,5))
+ax = s.plot.line(ylim=0)
+ax.set_ylabel("Explained Variance")
+ax.set_xlabel("Component")
+```
+
+%% Cell type:markdown id: tags:
+
+# Dimensionality Reduction on Feature Columns
+
+%% Cell type:code id: tags:
+
+``` python
+pipe = Pipeline([
+    ("pca", PCA(2)),
+    # n_components parameter
+    # specify an int for number of components to use
+    # or a float indicates how much variance we want to explain (explained_variance_ratio_)
+    ("km", KMeans(2)),
+])
+
+pipe.fit(df) # fit PCA, transform using PCA, fit KMeans using output from PCA
+
+groups = pipe.predict(df) # transform using PCA
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# -1 is white
+pd.DataFrame(pipe["pca"].transform(df)).plot.scatter(x=0, y=1, c=groups, vmin=-1)
+```
+
+%% Cell type:markdown id: tags:
+
+# Lossy Compression
+
+Use PCA to extract the most important information and throw away the less important ones
+
+%% Cell type:code id: tags:
+
+``` python
+img = plt.imread("bug.jpeg")
+plt.imshow(img)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+img.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# averaging the color dimension to make it a bit more easy to handle
+img = img.mean(axis=2)
+img.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+plt.imshow(img, cmap="gray")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# we want to explian 95% of the variance
+p = PCA(0.95)
+W = p.fit_transform(img)
+C = p.components_
+m = p.mean_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+original_size = len(img.reshape(-1))
+original_size
+```
+
+%% Cell type:code id: tags:
+
+``` python
+compressed_size = len(W.reshape(-1)) + len(C.reshape(-1)) + len(m.reshape(-1))
+compressed_size
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# compression ratio
+original_size / compressed_size
+```
+
+%% Cell type:code id: tags:
+
+``` python
+plt.imshow(W @ C + m, cmap="gray")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# savez saves numpy arrays into .npz format
+# use wb to write in binary format
+with open("img1.npz", "wb") as f:
+    np.savez(f, img)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+with open("img2.npz", "wb") as f:
+    np.savez(f, W, C, m)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+with np.load("img2.npz") as f:
+    W, C, m = f.values()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+plt.imshow(W @ C + m, cmap="gray")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# original plot is 33M vs. the compressed plot is 876K
+!ls -lh
+```
+%% Cell type:code id: tags:
+
+``` python
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+from sklearn.pipeline import Pipeline
+from sklearn.cluster import KMeans
+from sklearn.datasets import make_blobs
+from sklearn.decomposition import PCA
+
+x = np.random.uniform(0.1,5,100)
+noise = np.random.normal(scale=0.3, size=x.size)
+```
+
+%% Cell type:markdown id: tags:
+
+## Intuition: factorization
+Why is it useful to express something as a few parts multiplied together?
+To convey more information
+
+%% Cell type:code id: tags:
+
+``` python
+# at what points does y=0?
+# y = -x**3 + 7*x**2 - 14*x + 8
+y = (4-x) * (2-x) * (1-x)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+pd.DataFrame({"x": x, "y": y+noise}).plot.scatter(x="x", y="y")
+plt.hlines(0, -1, 6, color="k")
+```
+
+%% Cell type:markdown id: tags:
+
+## Some cool dimensionality reduction examples:
+https://pair-code.github.io/understanding-umap/ \
+https://distill.pub/2016/misread-tsne/
+
+%% Cell type:markdown id: tags:
+
+# Matrix Multiplication
+
+%% Cell type:code id: tags:
+
+``` python
+A = np.random.normal(size=(9, 7))
+B = np.random.normal(size=(6, 14))
+C = np.random.normal(size=(14, 3))
+D = np.random.normal(size=(3, 10))
+```
+
+%% Cell type:markdown id: tags:
+
+# Decomposition with Principal Component Analysis (PCA)
+Q: Is it possible to use fewer columns to represent this dataframe?
+
+%% Cell type:code id: tags:
+
+``` python
+df = pd.DataFrame(make_blobs(centers=2, random_state=320)[0], columns=["A", "B"])
+df["C"] = df["A"] * 2
+df["D"] = df["A"] - df["B"]
+df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+A: Yes. C is two times of A and D is A - B, so we only need A & B and their relationship to C & D to represent the dataframe.
+
+%% Cell type:markdown id: tags:
+
+# PCA on two columns
+
+%% Cell type:code id: tags:
+
+``` python
+# plot A & B column
+df.plot.scatter("A", "B")
+```
+
+%% Cell type:markdown id: tags:
+
+## sklearn.decomposition.PCA
+
+%% Cell type:code id: tags:
+
+``` python
+p = PCA()
+W = p.fit_transform(df[["A", "B"]])
+C = p.components_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# PCA will first find the mean
+mean_point = p.mean_
+mean_point
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["A", "B"]].mean()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# plot mean point
+df.plot.scatter("A", "B")
+plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
+```
+
+%% Cell type:markdown id: tags:
+
+C is called the **component matrix** \
+first row of C is the most important component \
+second row of C is the second most important component \
+and so on ...
+
+Each row is in the form of the slope of the componenet
+
+%% Cell type:code id: tags:
+
+``` python
+# two components for 2d data
+C
+```
+
+%% Cell type:markdown id: tags:
+
+For the first component, PCA will try to fit a line that corss the mean point and
+has the largest spreadout in terms of points. \
+The second component will be prependicular to the first component, corssing the mean point,
+and has the largest spreadout in its direction.
+
+%% Cell type:code id: tags:
+
+``` python
+# plot first component
+df.plot.scatter("A", "B")
+
+plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
+span = 6
+point2 = [span + mean_point[0], C[0][1] / C[0][0] * span + mean_point[1]]
+point3 = [-span + mean_point[0], C[0][1] / C[0][0] * (-span) + mean_point[1]]
+x = [point2[0], point3[0]]
+y = [point2[1], point3[1]]
+plt.plot(x, y, linestyle="-", color="red")
+```
+
+%% Cell type:markdown id: tags:
+
+First column of W represents relative positions of points along the first component \
+Second column of W represents relative positions of points along the second component \
+and so on ...
+
+%% Cell type:code id: tags:
+
+``` python
+W[:10]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+print(W.shape, C.shape)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+print(df[["A", "B"]].shape)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use W and C to reconstruct the original A & B columns
+pd.DataFrame((W @ C) + p.mean_).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["A", "B"]].head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use only the first component to approximately reconstruct A & B columns
+# the first column of W (relative position of W along the first component) multiply the first row of C (the first component)
+pd.DataFrame(W[:, :1] @ C[:1, :] + p.mean_).head()
+```
+
+%% Cell type:markdown id: tags:
+
+## Explained Variance
+
+%% Cell type:code id: tags:
+
+``` python
+a = np.array([1.1, 1.9, 3.2])
+a
+```
+
+%% Cell type:code id: tags:
+
+``` python
+b = np.array([1, 2, 3])
+b
+```
+
+%% Cell type:code id: tags:
+
+``` python
+a - b
+```
+
+%% Cell type:code id: tags:
+
+``` python
+a.var()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+(a - b).var()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+1 - (a - b).var() / a.var()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# the amount of variance explained by each components
+# the first component has largest explained variance
+# the second component has the second largest explained variance
+# and so on
+explained_variance = p.explained_variance_
+explained_variance
+```
+
+%% Cell type:code id: tags:
+
+``` python
+explained_variance / explained_variance.sum()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# explained variance percentage wise
+p.explained_variance_ratio_
+```
+
+%% Cell type:markdown id: tags:
+
+# PCA on two dependent columns
+
+%% Cell type:code id: tags:
+
+``` python
+p = PCA()
+W = p.fit_transform(df[["A", "C"]])
+C = p.components_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+mean = p.mean_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# plot A & C columns and the mean
+df.plot.scatter("A", "C")
+mean_point = [mean[0],mean[1]]
+plt.plot(mean[0],mean[1], marker="X", markersize=20, color="red")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# plot the first component
+df.plot.scatter("A", "C")
+mean_point = [mean[0],mean[1]]
+plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
+span = 6
+point2 = [span + mean_point[0], C[0][1] / C[0][0] * span + mean_point[1]]
+point3 = [-span + mean_point[0], C[0][1] / C[0][0] * (-span) + mean_point[1]]
+x = [point2[0], point3[0]]
+y = [point2[1], point3[1]]
+plt.plot(x, y, linestyle="-", color="red")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+p.explained_variance_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# noted the first component is explianing 100% of the data
+# because C is two times of A
+# the first component is capturing the 2* relationship using its slope
+p.explained_variance_ratio_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# we can reconstruct A & C only using one component
+pd.DataFrame(W[:, :1] @ C[:1, :] + p.mean_).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["A", "C"]].head()
+```
+
+%% Cell type:markdown id: tags:
+
+# PCA on all columns
+
+%% Cell type:code id: tags:
+
+``` python
+p = PCA()
+W = p.fit_transform(df)
+C = p.components_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# four components for 4d data
+C.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+p.explained_variance_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# noted the first two components are explaining 100% of the data
+ev_ratio = p.explained_variance_ratio_
+ev_ratio
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# we can reconstruct the original dataframe only using the first two components
+pd.DataFrame(W[:, :2] @ C[:2, :] + p.mean_).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+### Cumulative plot of explained variance ratio
+
+%% Cell type:code id: tags:
+
+``` python
+# cumsum() compute the cumulative sum
+s = pd.Series(p.explained_variance_ratio_.cumsum(), index=range(1,5))
+ax = s.plot.line(ylim=0)
+ax.set_ylabel("Explained Variance")
+ax.set_xlabel("Component")
+```
+
+%% Cell type:markdown id: tags:
+
+# Dimensionality Reduction on Feature Columns
+
+%% Cell type:code id: tags:
+
+``` python
+pipe = Pipeline([
+    ("pca", PCA(2)),
+    # n_components parameter
+    # specify an int for number of components to use
+    # or a float indicates how much variance we want to explain (explained_variance_ratio_)
+    ("km", KMeans(2)),
+])
+
+pipe.fit(df) # fit PCA, transform using PCA, fit KMeans using output from PCA
+
+groups = pipe.predict(df) # transform using PCA
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# -1 is white
+pd.DataFrame(pipe["pca"].transform(df)).plot.scatter(x=0, y=1, c=groups, vmin=-1)
+```
+
+%% Cell type:markdown id: tags:
+
+# Lossy Compression
+
+Use PCA to extract the most important information and throw away the less important ones
+
+%% Cell type:code id: tags:
+
+``` python
+img = plt.imread("bug.jpeg")
+plt.imshow(img)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+img.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# averaging the color dimension to make it a bit more easy to handle
+img = img.mean(axis=2)
+img.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+plt.imshow(img, cmap="gray")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# we want to explian 95% of the variance
+p = PCA(0.95)
+W = p.fit_transform(img)
+C = p.components_
+m = p.mean_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+original_size = len(img.reshape(-1))
+original_size
+```
+
+%% Cell type:code id: tags:
+
+``` python
+compressed_size = len(W.reshape(-1)) + len(C.reshape(-1)) + len(m.reshape(-1))
+compressed_size
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# compression ratio
+original_size / compressed_size
+```
+
+%% Cell type:code id: tags:
+
+``` python
+plt.imshow(W @ C + m, cmap="gray")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# savez saves numpy arrays into .npz format
+# use wb to write in binary format
+with open("img1.npz", "wb") as f:
+    np.savez(f, img)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+with open("img2.npz", "wb") as f:
+    np.savez(f, W, C, m)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+with np.load("img2.npz") as f:
+    W, C, m = f.values()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+plt.imshow(W @ C + m, cmap="gray")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# original plot is 33M vs. the compressed plot is 876K
+!ls -lh
+```
--- a/lecture_material/24-decomposition/24-pca_002.ipynb
+++ b/lecture_material/24-decomposition/24-pca_002.ipynb
+%% Cell type:code id: tags:
+
+``` python
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+from sklearn.pipeline import Pipeline
+from sklearn.cluster import KMeans
+from sklearn.datasets import make_blobs
+from sklearn.decomposition import PCA
+
+x = np.random.uniform(0.1,5,100)
+noise = np.random.normal(scale=0.3, size=x.size)
+```
+
+%% Cell type:markdown id: tags:
+
+## Intuition: factorization
+Why is it useful to express something as a few parts multiplied together?
+To convey more information
+
+%% Cell type:code id: tags:
+
+``` python
+# at what points does y=0?
+# y = -x**3 + 7*x**2 - 14*x + 8
+y = (4-x) * (2-x) * (1-x)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+pd.DataFrame({"x": x, "y": y+noise}).plot.scatter(x="x", y="y")
+plt.hlines(0, -1, 6, color="k")
+```
+
+%% Cell type:markdown id: tags:
+
+## Some cool dimensionality reduction examples:
+https://pair-code.github.io/understanding-umap/ \
+https://distill.pub/2016/misread-tsne/
+
+%% Cell type:markdown id: tags:
+
+# Matrix Multiplication
+
+%% Cell type:code id: tags:
+
+``` python
+A = np.random.normal(size=(9, 7))
+B = np.random.normal(size=(6, 14))
+C = np.random.normal(size=(14, 3))
+D = np.random.normal(size=(3, 10))
+```
+
+%% Cell type:markdown id: tags:
+
+# Decomposition with Principal Component Analysis (PCA)
+Q: Is it possible to use fewer columns to represent this dataframe?
+
+%% Cell type:code id: tags:
+
+``` python
+df = pd.DataFrame(make_blobs(centers=2, random_state=320)[0], columns=["A", "B"])
+df["C"] = df["A"] * 2
+df["D"] = df["A"] - df["B"]
+df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+A: Yes. C is two times of A and D is A - B, so we only need A & B and their relationship to C & D to represent the dataframe.
+
+%% Cell type:markdown id: tags:
+
+# PCA on two columns
+
+%% Cell type:code id: tags:
+
+``` python
+# plot A & B column
+df.plot.scatter("A", "B")
+```
+
+%% Cell type:markdown id: tags:
+
+## sklearn.decomposition.PCA
+
+%% Cell type:code id: tags:
+
+``` python
+p = PCA()
+W = p.fit_transform(df[["A", "B"]])
+C = p.components_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# PCA will first find the mean
+mean_point = p.mean_
+mean_point
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["A", "B"]].mean()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# plot mean point
+df.plot.scatter("A", "B")
+plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
+```
+
+%% Cell type:markdown id: tags:
+
+C is called the **component matrix** \
+first row of C is the most important component \
+second row of C is the second most important component \
+and so on ...
+
+Each row is in the form of the slope of the componenet
+
+%% Cell type:code id: tags:
+
+``` python
+# two components for 2d data
+C
+```
+
+%% Cell type:markdown id: tags:
+
+For the first component, PCA will try to fit a line that corss the mean point and
+has the largest spreadout in terms of points. \
+The second component will be prependicular to the first component, corssing the mean point,
+and has the largest spreadout in its direction.
+
+%% Cell type:code id: tags:
+
+``` python
+# plot first component
+df.plot.scatter("A", "B")
+
+plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
+span = 6
+point2 = [span + mean_point[0], C[0][1] / C[0][0] * span + mean_point[1]]
+point3 = [-span + mean_point[0], C[0][1] / C[0][0] * (-span) + mean_point[1]]
+x = [point2[0], point3[0]]
+y = [point2[1], point3[1]]
+plt.plot(x, y, linestyle="-", color="red")
+```
+
+%% Cell type:markdown id: tags:
+
+First column of W represents relative positions of points along the first component \
+Second column of W represents relative positions of points along the second component \
+and so on ...
+
+%% Cell type:code id: tags:
+
+``` python
+W[:10]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+print(W.shape, C.shape)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+print(df[["A", "B"]].shape)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use W and C to reconstruct the original A & B columns
+pd.DataFrame((W @ C) + p.mean_).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["A", "B"]].head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use only the first component to approximately reconstruct A & B columns
+# the first column of W (relative position of W along the first component) multiply the first row of C (the first component)
+pd.DataFrame(W[:, :1] @ C[:1, :] + p.mean_).head()
+```
+
+%% Cell type:markdown id: tags:
+
+## Explained Variance
+
+%% Cell type:code id: tags:
+
+``` python
+a = np.array([1.1, 1.9, 3.2])
+a
+```
+
+%% Cell type:code id: tags:
+
+``` python
+b = np.array([1, 2, 3])
+b
+```
+
+%% Cell type:code id: tags:
+
+``` python
+a - b
+```
+
+%% Cell type:code id: tags:
+
+``` python
+a.var()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+(a - b).var()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+1 - (a - b).var() / a.var()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# the amount of variance explained by each components
+# the first component has largest explained variance
+# the second component has the second largest explained variance
+# and so on
+explained_variance = p.explained_variance_
+explained_variance
+```
+
+%% Cell type:code id: tags:
+
+``` python
+explained_variance / explained_variance.sum()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# explained variance percentage wise
+p.explained_variance_ratio_
+```
+
+%% Cell type:markdown id: tags:
+
+# PCA on two dependent columns
+
+%% Cell type:code id: tags:
+
+``` python
+p = PCA()
+W = p.fit_transform(df[["A", "C"]])
+C = p.components_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+mean = p.mean_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# plot A & C columns and the mean
+df.plot.scatter("A", "C")
+mean_point = [mean[0],mean[1]]
+plt.plot(mean[0],mean[1], marker="X", markersize=20, color="red")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# plot the first component
+df.plot.scatter("A", "C")
+mean_point = [mean[0],mean[1]]
+plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
+span = 6
+point2 = [span + mean_point[0], C[0][1] / C[0][0] * span + mean_point[1]]
+point3 = [-span + mean_point[0], C[0][1] / C[0][0] * (-span) + mean_point[1]]
+x = [point2[0], point3[0]]
+y = [point2[1], point3[1]]
+plt.plot(x, y, linestyle="-", color="red")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+p.explained_variance_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# noted the first component is explianing 100% of the data
+# because C is two times of A
+# the first component is capturing the 2* relationship using its slope
+p.explained_variance_ratio_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# we can reconstruct A & C only using one component
+pd.DataFrame(W[:, :1] @ C[:1, :] + p.mean_).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["A", "C"]].head()
+```
+
+%% Cell type:markdown id: tags:
+
+# PCA on all columns
+
+%% Cell type:code id: tags:
+
+``` python
+p = PCA()
+W = p.fit_transform(df)
+C = p.components_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# four components for 4d data
+C.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+p.explained_variance_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# noted the first two components are explaining 100% of the data
+ev_ratio = p.explained_variance_ratio_
+ev_ratio
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# we can reconstruct the original dataframe only using the first two components
+pd.DataFrame(W[:, :2] @ C[:2, :] + p.mean_).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+### Cumulative plot of explained variance ratio
+
+%% Cell type:code id: tags:
+
+``` python
+# cumsum() compute the cumulative sum
+s = pd.Series(p.explained_variance_ratio_.cumsum(), index=range(1,5))
+ax = s.plot.line(ylim=0)
+ax.set_ylabel("Explained Variance")
+ax.set_xlabel("Component")
+```
+
+%% Cell type:markdown id: tags:
+
+# Dimensionality Reduction on Feature Columns
+
+%% Cell type:code id: tags:
+
+``` python
+pipe = Pipeline([
+    ("pca", PCA(2)),
+    # n_components parameter
+    # specify an int for number of components to use
+    # or a float indicates how much variance we want to explain (explained_variance_ratio_)
+    ("km", KMeans(2)),
+])
+
+pipe.fit(df) # fit PCA, transform using PCA, fit KMeans using output from PCA
+
+groups = pipe.predict(df) # transform using PCA
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# -1 is white
+pd.DataFrame(pipe["pca"].transform(df)).plot.scatter(x=0, y=1, c=groups, vmin=-1)
+```
+
+%% Cell type:markdown id: tags:
+
+# Lossy Compression
+
+Use PCA to extract the most important information and throw away the less important ones
+
+%% Cell type:code id: tags:
+
+``` python
+img = plt.imread("bug.jpeg")
+plt.imshow(img)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+img.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# averaging the color dimension to make it a bit more easy to handle
+img = img.mean(axis=2)
+img.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+plt.imshow(img, cmap="gray")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# we want to explian 95% of the variance
+p = PCA(0.95)
+W = p.fit_transform(img)
+C = p.components_
+m = p.mean_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+original_size = len(img.reshape(-1))
+original_size
+```
+
+%% Cell type:code id: tags:
+
+``` python
+compressed_size = len(W.reshape(-1)) + len(C.reshape(-1)) + len(m.reshape(-1))
+compressed_size
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# compression ratio
+original_size / compressed_size
+```
+
+%% Cell type:code id: tags:
+
+``` python
+plt.imshow(W @ C + m, cmap="gray")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# savez saves numpy arrays into .npz format
+# use wb to write in binary format
+with open("img1.npz", "wb") as f:
+    np.savez(f, img)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+with open("img2.npz", "wb") as f:
+    np.savez(f, W, C, m)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+with np.load("img2.npz") as f:
+    W, C, m = f.values()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+plt.imshow(W @ C + m, cmap="gray")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# original plot is 33M vs. the compressed plot is 876K
+!ls -lh
+```
+%% Cell type:code id: tags:
+
+``` python
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+from sklearn.pipeline import Pipeline
+from sklearn.cluster import KMeans
+from sklearn.datasets import make_blobs
+from sklearn.decomposition import PCA
+
+x = np.random.uniform(0.1,5,100)
+noise = np.random.normal(scale=0.3, size=x.size)
+```
+
+%% Cell type:markdown id: tags:
+
+## Intuition: factorization
+Why is it useful to express something as a few parts multiplied together?
+To convey more information
+
+%% Cell type:code id: tags:
+
+``` python
+# at what points does y=0?
+# y = -x**3 + 7*x**2 - 14*x + 8
+y = (4-x) * (2-x) * (1-x)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+pd.DataFrame({"x": x, "y": y+noise}).plot.scatter(x="x", y="y")
+plt.hlines(0, -1, 6, color="k")
+```
+
+%% Cell type:markdown id: tags:
+
+## Some cool dimensionality reduction examples:
+https://pair-code.github.io/understanding-umap/ \
+https://distill.pub/2016/misread-tsne/
+
+%% Cell type:markdown id: tags:
+
+# Matrix Multiplication
+
+%% Cell type:code id: tags:
+
+``` python
+A = np.random.normal(size=(9, 7))
+B = np.random.normal(size=(6, 14))
+C = np.random.normal(size=(14, 3))
+D = np.random.normal(size=(3, 10))
+```
+
+%% Cell type:markdown id: tags:
+
+# Decomposition with Principal Component Analysis (PCA)
+Q: Is it possible to use fewer columns to represent this dataframe?
+
+%% Cell type:code id: tags:
+
+``` python
+df = pd.DataFrame(make_blobs(centers=2, random_state=320)[0], columns=["A", "B"])
+df["C"] = df["A"] * 2
+df["D"] = df["A"] - df["B"]
+df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+A: Yes. C is two times of A and D is A - B, so we only need A & B and their relationship to C & D to represent the dataframe.
+
+%% Cell type:markdown id: tags:
+
+# PCA on two columns
+
+%% Cell type:code id: tags:
+
+``` python
+# plot A & B column
+df.plot.scatter("A", "B")
+```
+
+%% Cell type:markdown id: tags:
+
+## sklearn.decomposition.PCA
+
+%% Cell type:code id: tags:
+
+``` python
+p = PCA()
+W = p.fit_transform(df[["A", "B"]])
+C = p.components_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# PCA will first find the mean
+mean_point = p.mean_
+mean_point
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["A", "B"]].mean()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# plot mean point
+df.plot.scatter("A", "B")
+plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
+```
+
+%% Cell type:markdown id: tags:
+
+C is called the **component matrix** \
+first row of C is the most important component \
+second row of C is the second most important component \
+and so on ...
+
+Each row is in the form of the slope of the componenet
+
+%% Cell type:code id: tags:
+
+``` python
+# two components for 2d data
+C
+```
+
+%% Cell type:markdown id: tags:
+
+For the first component, PCA will try to fit a line that corss the mean point and
+has the largest spreadout in terms of points. \
+The second component will be prependicular to the first component, corssing the mean point,
+and has the largest spreadout in its direction.
+
+%% Cell type:code id: tags:
+
+``` python
+# plot first component
+df.plot.scatter("A", "B")
+
+plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
+span = 6
+point2 = [span + mean_point[0], C[0][1] / C[0][0] * span + mean_point[1]]
+point3 = [-span + mean_point[0], C[0][1] / C[0][0] * (-span) + mean_point[1]]
+x = [point2[0], point3[0]]
+y = [point2[1], point3[1]]
+plt.plot(x, y, linestyle="-", color="red")
+```
+
+%% Cell type:markdown id: tags:
+
+First column of W represents relative positions of points along the first component \
+Second column of W represents relative positions of points along the second component \
+and so on ...
+
+%% Cell type:code id: tags:
+
+``` python
+W[:10]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+print(W.shape, C.shape)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+print(df[["A", "B"]].shape)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use W and C to reconstruct the original A & B columns
+pd.DataFrame((W @ C) + p.mean_).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["A", "B"]].head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# use only the first component to approximately reconstruct A & B columns
+# the first column of W (relative position of W along the first component) multiply the first row of C (the first component)
+pd.DataFrame(W[:, :1] @ C[:1, :] + p.mean_).head()
+```
+
+%% Cell type:markdown id: tags:
+
+## Explained Variance
+
+%% Cell type:code id: tags:
+
+``` python
+a = np.array([1.1, 1.9, 3.2])
+a
+```
+
+%% Cell type:code id: tags:
+
+``` python
+b = np.array([1, 2, 3])
+b
+```
+
+%% Cell type:code id: tags:
+
+``` python
+a - b
+```
+
+%% Cell type:code id: tags:
+
+``` python
+a.var()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+(a - b).var()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+1 - (a - b).var() / a.var()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# the amount of variance explained by each components
+# the first component has largest explained variance
+# the second component has the second largest explained variance
+# and so on
+explained_variance = p.explained_variance_
+explained_variance
+```
+
+%% Cell type:code id: tags:
+
+``` python
+explained_variance / explained_variance.sum()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# explained variance percentage wise
+p.explained_variance_ratio_
+```
+
+%% Cell type:markdown id: tags:
+
+# PCA on two dependent columns
+
+%% Cell type:code id: tags:
+
+``` python
+p = PCA()
+W = p.fit_transform(df[["A", "C"]])
+C = p.components_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+mean = p.mean_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# plot A & C columns and the mean
+df.plot.scatter("A", "C")
+mean_point = [mean[0],mean[1]]
+plt.plot(mean[0],mean[1], marker="X", markersize=20, color="red")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# plot the first component
+df.plot.scatter("A", "C")
+mean_point = [mean[0],mean[1]]
+plt.plot(mean_point[0], mean_point[1], marker="X", markersize=20, color="red")
+span = 6
+point2 = [span + mean_point[0], C[0][1] / C[0][0] * span + mean_point[1]]
+point3 = [-span + mean_point[0], C[0][1] / C[0][0] * (-span) + mean_point[1]]
+x = [point2[0], point3[0]]
+y = [point2[1], point3[1]]
+plt.plot(x, y, linestyle="-", color="red")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+p.explained_variance_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# noted the first component is explianing 100% of the data
+# because C is two times of A
+# the first component is capturing the 2* relationship using its slope
+p.explained_variance_ratio_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# we can reconstruct A & C only using one component
+pd.DataFrame(W[:, :1] @ C[:1, :] + p.mean_).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["A", "C"]].head()
+```
+
+%% Cell type:markdown id: tags:
+
+# PCA on all columns
+
+%% Cell type:code id: tags:
+
+``` python
+p = PCA()
+W = p.fit_transform(df)
+C = p.components_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# four components for 4d data
+C.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+p.explained_variance_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# noted the first two components are explaining 100% of the data
+ev_ratio = p.explained_variance_ratio_
+ev_ratio
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# we can reconstruct the original dataframe only using the first two components
+pd.DataFrame(W[:, :2] @ C[:2, :] + p.mean_).head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+### Cumulative plot of explained variance ratio
+
+%% Cell type:code id: tags:
+
+``` python
+# cumsum() compute the cumulative sum
+s = pd.Series(p.explained_variance_ratio_.cumsum(), index=range(1,5))
+ax = s.plot.line(ylim=0)
+ax.set_ylabel("Explained Variance")
+ax.set_xlabel("Component")
+```
+
+%% Cell type:markdown id: tags:
+
+# Dimensionality Reduction on Feature Columns
+
+%% Cell type:code id: tags:
+
+``` python
+pipe = Pipeline([
+    ("pca", PCA(2)),
+    # n_components parameter
+    # specify an int for number of components to use
+    # or a float indicates how much variance we want to explain (explained_variance_ratio_)
+    ("km", KMeans(2)),
+])
+
+pipe.fit(df) # fit PCA, transform using PCA, fit KMeans using output from PCA
+
+groups = pipe.predict(df) # transform using PCA
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# -1 is white
+pd.DataFrame(pipe["pca"].transform(df)).plot.scatter(x=0, y=1, c=groups, vmin=-1)
+```
+
+%% Cell type:markdown id: tags:
+
+# Lossy Compression
+
+Use PCA to extract the most important information and throw away the less important ones
+
+%% Cell type:code id: tags:
+
+``` python
+img = plt.imread("bug.jpeg")
+plt.imshow(img)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+img.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# averaging the color dimension to make it a bit more easy to handle
+img = img.mean(axis=2)
+img.shape
+```
+
+%% Cell type:code id: tags:
+
+``` python
+plt.imshow(img, cmap="gray")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# we want to explian 95% of the variance
+p = PCA(0.95)
+W = p.fit_transform(img)
+C = p.components_
+m = p.mean_
+```
+
+%% Cell type:code id: tags:
+
+``` python
+original_size = len(img.reshape(-1))
+original_size
+```
+
+%% Cell type:code id: tags:
+
+``` python
+compressed_size = len(W.reshape(-1)) + len(C.reshape(-1)) + len(m.reshape(-1))
+compressed_size
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# compression ratio
+original_size / compressed_size
+```
+
+%% Cell type:code id: tags:
+
+``` python
+plt.imshow(W @ C + m, cmap="gray")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# savez saves numpy arrays into .npz format
+# use wb to write in binary format
+with open("img1.npz", "wb") as f:
+    np.savez(f, img)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+with open("img2.npz", "wb") as f:
+    np.savez(f, W, C, m)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+with np.load("img2.npz") as f:
+    W, C, m = f.values()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+plt.imshow(W @ C + m, cmap="gray")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# original plot is 33M vs. the compressed plot is 876K
+!ls -lh
+```
--- a/lecture_material/24-decomposition/bug.jpg
+++ b/lecture_material/24-decomposition/bug.jpg
--- a/lecture_material/25-unsupervised-recap/25-unsupervised-recap.pdf
+++ b/lecture_material/25-unsupervised-recap/25-unsupervised-recap.pdf
--- a/lecture_material/25-unsupervised-recap/25-unsupervised-recap.ppt
+++ b/lecture_material/25-unsupervised-recap/25-unsupervised-recap.ppt
--- a/lecture_material/26-parallelism/26-parallelism.pdf
+++ b/lecture_material/26-parallelism/26-parallelism.pdf
--- a/lecture_material/26-parallelism/26-parallelism.pptx
+++ b/lecture_material/26-parallelism/26-parallelism.pptx
No results found