# Clustering

In [None]:
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

# new import statements
from sklearn.cluster import KMeans, AgglomerativeClustering

# Unsupervised Machine Learning: Clustering

- In classification (supervised), we try to find boundaries/rules to separate points according to pre-determined labels.
- In clustering, the algorithm chooses the labels. Goal is to choose labels so that similar rows get labeled the same.

### K-Means Clustering

- K: number of clusters:
 - 3-Means => 3 clusters
 - 4-Means => 4 clusters, and so on
- Means: we will find centroids (aka means aka averages) to create clusters

- import statement:
```python
from sklearn.cluster import KMeans
```

#### Iterative algorithm for K-Means

In [None]:
# Generate random data
x, y = datasets.make_blobs(n_samples=100, centers=3, cluster_std=1.2, random_state=3)
df = pd.DataFrame(x, columns=["x0", "x1"])
df.head()

In [None]:
def km_scatter(df, **kwargs):
 """
 Produces scatter plot visualizations with x0 on x-axis and y0 on y-axis.
 It can also plot the centroids for clusters.
 Parameters:
 x0 => x-axis
 x1 => y-axis
 cluster => marker type
 """
 ax = kwargs.pop("ax", None)
 if not "label" in df.columns:
 return df.plot.scatter(x="x0", y="x1", marker="$?$", ax=ax, **kwargs)

 for marker in set(df["label"]):
 sub_df = df[df["label"] == marker]
 ax = sub_df.plot.scatter(x="x0", y="x1", marker=marker, ax=ax, **kwargs)
 return ax

ax = km_scatter(df, s=100, c="0.7")

### Hard Problem

Finding the best answer. What is the answer? Determing the centroids of the clusters.

### Easier Problem

Taking a random answer and make it a little better. Then repeat!
Downside? If randomization leads to very bad initial choice of centroids, that might lead to bad clustering (fewer clusters).

In [None]:
clusters = np.random.uniform(-5, 5, size=(3, 2))
clusters = pd.DataFrame(clusters, columns=["x0", "x1"])
clusters["label"] = ["o", "+", "x"]

ax = km_scatter(df, s=100, c="0.7")
km_scatter(clusters, s=200, c="red", ax=ax)

Two variables for us to deal with:
1. clusters: contains location of centroids and a label for them
2. df: contains the actual data points

In [None]:
clusters

In [None]:
df.head()

In [None]:
class KM:
 def __init__(self, df, clusters):
 # We make copies because we are going to keep changing the dataframe to 
 # identify better clusters
 self.df = df.copy()
 self.clusters = clusters.copy()
 self.labels = clusters["label"].values
 
 def plot(self):
 ax = km_scatter(self.df, color="0.7", s=100)
 km_scatter(self.clusters, ax=ax, color="red", s=200)
 
 def assign_points(self):
 """
 compute Euclidean distance between each point and each centroids
 """
 pass
 
 def update_centers(self):
 """
 update centroids by taking mean of the points that are nearest to that
 particular centroid
 """
 pass

"""
High-level algorithm:
1. Start with random locations for centroids
2. Iterate over each data point:
 1. Find the distance (Euclidean distance) between current data point and each centroid.
 2. Find the minimum of those distances and the corresponding label.
 3. Assign current data point to the closest cluster centroid label.
4. Once all points are assigned, compute new centroid for each cluster. Iterate over 
 each cluster:
 1. Extract subset of data points which got assigned to curr cluster label.
 2. Compute mean of all the assigned data points.
 3. Update cluster centroid.
5. Repeat steps 2 to 4 many times (iterative improvement).
"""

# Creating object instance
km = KM(df, clusters)
km.plot()

### `sklearn KMeans`

- import statement:
```python
from sklearn.cluster import KMeans
```
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

**Instantiation:**
`KMeans(n_clusters=<num>, n_init=<num>, max_iter=<num>)`
- `n_clusters`: number of clusters to be formed
- `n_init`: number of initial random seeds to try (to avoid downside of bad initial random choices)
- `max_iter`: maximum number of iterations for a single K-means run (single starting seed)

In [None]:
km_cluster = 
km_cluster

In [None]:
df.head()

**Methods:**
1. `fit`: find good centroids
2. `transform`: give me the distances from each point to each centroid
3. `predict`: give me the chosen group labels

**Attributes:**
- `<km object>.cluster_centers_`: coordinates of cluster centers
- `<km object>.inertia_`: sum of squared distances of samples to their closest cluster center

In [None]:
# `fit`: find good centroids
km_cluster.fit(df)
# coordinates of cluster centers
km_cluster.cluster_centers_

**Observeration:** 3 rows (because we have 3 clusters), and 2 columns (because the df had 2 columns).

In [None]:
# `transform`: give me the distances from each point to each centroid
km_cluster.transform(df)

**Observations**: Each row corresponds to a row in df. 3 columns correspond to 3 distances to the centroids.

In [None]:
# `predict`: give me the chosen group labels
km_cluster.predict(df)

### How many clusters do we need?

- metric: `<km object>.inertia_`: sum of squared distances of samples to their closest cluster center

In [None]:
km_cluster.inertia_

**Observation**: we want "inertia" to be as small as possible.

### Elbow plot to determine `n_clusters`

In [None]:
# create a series with clusters 1 to 10 and corresponding values are equal to intertia 
s = pd.Series(dtype=float)



s

In [None]:
ax = s.plot.line(figsize=(6, 4))
ax.set_ylabel("Inertia")
ax.set_xlabel("Number of clusters")

**Observation**: there is an "elbow" around `n_clusters`=3.

#### Will we always have a clear "elbow"?

- Let's generate uniform random data

In [None]:
df2 = pd.DataFrame(np.random.uniform(0, 10, (100, 2)))

s = pd.Series(dtype=float)

for num_clusters in range(1, 11):
 km = KMeans(num_clusters, n_init = 320)
 km.fit(df2)
 s.at[num_clusters] = km.inertia_

ax = s.plot.line(figsize=(6, 4))
ax.set_ylabel("Inertia")
ax.set_xlabel("Number of clusters")

**Observation**: there is an "elbow" around `n_clusters`=3.

#### Will we always have a clear "elbow"?

- Let's generate uniform random data

In [None]:
df2 = pd.DataFrame(np.random.uniform(0, 10, (100, 2)))
df2.plot.scatter(0, 1)

In [None]:
s = pd.Series(dtype=float)

for num_clusters in range(1, 11):
 km = KMeans(num_clusters, n_init = 320)
 km.fit(df2)
 s.at[num_clusters] = km.inertia_

ax = s.plot.line(figsize=(6, 4))
ax.set_ylabel("Inertia")
ax.set_xlabel("Number of clusters")

### K-Means use cases:

1. estimator
2. transformer:
 - sometimes we'll use an unsupervised learning technique (like k-means) to pre-process data, creating better inputs for a supervised learning technique (like logistic regression)

In [None]:
def make_data():
 x, y = datasets.make_blobs(n_samples=250, centers=5, random_state=5)
 xcols = ["x0", "x1"]
 df1 = pd.DataFrame(x, columns=xcols)
 df1["y"] = y > 0

 df2 = pd.DataFrame(np.random.uniform(-10, 10, size=(250, 2)), columns=["x0", "x1"])
 df2["y"] = False

 return pd.concat((df1, df2))

train, test = train_test_split(make_data())

In [None]:
plt.rcParams["font.size"] = 16
fig, ax = plt.subplots(ncols=2, figsize=(10,4))
train.plot.scatter(x="x0", y="x1", c=train["y"], vmin=-1, ax=ax[0])
test.plot.scatter(x="x0", y="x1", c="red", ax=ax[1])
ax[0].set_title("Training Data")
ax[1].set_title("Test Data")
plt.subplots_adjust(wspace=0.4)

#### Objective: use `LogisticRegression` to classify points as "black" or "gray".

In [None]:
model = Pipeline([
 ("km", KMeans(10, n_init = 320)),
 ("lr", LogisticRegression()),
])
# TO DO: fit the model with train columns "x0", "x1" and test column y

# TO DO: score the model with test columns "x0", "x1" and test column y


In [None]:
model = Pipeline([
 ("km", KMeans(10, n_init = 320)),
 ("std", StandardScaler()),
 ("lr", LogisticRegression()),
])
model.fit(train[["x0", "x1"]], train["y"])
model.score(test[["x0", "x1"]], test["y"])

### `StandardScaler` with `KMeans`

Recall that `StandardScaler` should always be applied after applying `PolynomialFeatures` (from last lecture).

In [None]:
x = datasets.make_blobs(centers=np.array([(0, 0), (0, 20), (3, 20)]))[0]
df = pd.DataFrame(x)
df.plot.scatter(x=0, y=1, figsize=(6, 4))

In [None]:
km_c = KMeans(2, n_init = 320)
km_c.fit(df)
km_c.predict(df)

#### `fit_predict(...)` is a shortcut for `fit` and `predict` method invocations.

In [None]:
KMeans(2, n_init = 320).fit_predict(df)

In [None]:
# -1 => white, 0 => gray, 1 => black
df.plot.scatter(x=0, y=1, figsize=(6, 4), c=KMeans(2, n_init = 320).fit_predict(df), vmin=-1, vmax=1)

**Observation**: scale for columns are intentionally not specified.

In [None]:
df

Let's make a copy of the data. Assuming initial data for both columns is in "km", let's convert one column (`0`) into "meters". 

In [None]:
df2 = df.copy()
df2[0] *= 1000 # km => m
df2.head()

In [None]:
df2.plot.scatter(x=0, y=1, figsize=(6,4), c=KMeans(2, n_init = 320).fit_predict(df2), vmin=-1, vmax=1)

**Observations**:
- One would expect to see the same clusters, but that is not happening here. Why?
 - x-axis difference is too high when compared to the y-axis difference
 - That is, KMeans doesn't get that x-axis has scaled data, whereas y-axis doesn't have scaled data
- This is not too far off from realistic datasets. 
 - That is, real-world dataset columns might have difference units. 
 - For example, one column might be representing temperature data where as another might be representing distance.

#### Conclusion: `StandardScaler` should be applied before `KMeans`

In [None]:
# TO DO: write a pipeline with StandardScaler and KMeans with 2 clusters



df2.plot.scatter(x=0, y=1, figsize=(6, 4), c=model.fit_predict(df2), vmin=-1, vmax=1)

### Wisconsin counties example

In [None]:
df = gpd.read_file("counties.geojson")
df.head()

#### If we want to use "POP100", "AREALAND", "developed", "forest", "pasture", "crops" for clustering, what transformer should we use? 

- StandardScaler.

### Goal here: cluster counties based on similar land usage.

In [None]:
df.plot()

In [None]:
df.plot(column="crops")

In [None]:
df.plot(column="forest")

### KMeans

In [None]:
xcols = ["developed", "forest", "pasture", "crops"]

# instantiate
km_c = KMeans(4, n_init = 320)
# fit
km_c.fit(df[xcols])
# predict
clusters = km_c.predict(df[xcols])

print(km_c.inertia_)
print(clusters)

df.plot(column=clusters, cmap="tab10")

**Observation**: cluster number can be random. That is, if you re-run the above cell twice, you will get different number for each cluster.

### Agglomerative clustering

- import statement
```python
from sklearn.cluster import AgglomerativeClustering
```

In [None]:
xcols = ["developed", "forest", "pasture", "crops"]

# instantiate
km_c = AgglomerativeClustering(4)
# fit
km_c.fit(df[xcols])
# predict
clusters = km_c.predict(df[xcols])

print(km_c.inertia_)
print(clusters)

df.plot(column=clusters, cmap="tab10")

**Observations**: 
- no centroids => no inertia => no elbow plots (how do we pick cluster count?):
 - AttributeError: 'AgglomerativeClustering' object has no attribute 'predict'
- no `predict` method, but there is `fit_predict`:
 - AttributeError: 'AgglomerativeClustering' object has no attribute 'predict'
 - why?
 - because each point could lead to a completely different tree
 - remember unlike KMeans (which is top-down), AgglomerativeClustering is bottom-up

In [None]:
xcols = ["developed", "forest", "pasture", "crops"]

# instantiate
km_c = AgglomerativeClustering(4)
# fit_predict
clusters = km_c.fit_predict(df[xcols])

# print(km_c.inertia_)
print(clusters)

df.plot(column=clusters, cmap="tab10")