{ "cells": [ { "cell_type": "markdown", "id": "035e9e2c-9781-4b9c-8395-be9e55e4e082", "metadata": {}, "source": [ "# Clustering" ] }, { "cell_type": "code", "execution_count": null, "id": "cbd48a28", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import geopandas as gpd\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn import datasets\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LinearRegression, LogisticRegression\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import PolynomialFeatures\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "# new import statements\n", "from sklearn.cluster import KMeans, AgglomerativeClustering" ] }, { "cell_type": "markdown", "id": "e72ea2f4", "metadata": {}, "source": [ "# Unsupervised Machine Learning: Clustering\n", "\n", "- In classification (supervised), we try to find boundaries/rules to separate points according to pre-determined labels.\n", "- In clustering, the algorithm chooses the labels. Goal is to choose labels so that similar rows get labeled the same.\n", "\n", "### K-Means Clustering\n", "\n", "- K: number of clusters:\n", " - 3-Means => 3 clusters\n", " - 4-Means => 4 clusters, and so on\n", "- Means: we will find centroids (aka means aka averages) to create clusters\n", "\n", "- import statement:\n", "```python\n", "from sklearn.cluster import KMeans\n", "```" ] }, { "cell_type": "markdown", "id": "3a0ad5a5", "metadata": {}, "source": [ "#### Iterative algorithm for K-Means" ] }, { "cell_type": "code", "execution_count": null, "id": "0b83aaf3", "metadata": {}, "outputs": [], "source": [ "# Generate random data\n", "x, y = datasets.make_blobs(n_samples=100, centers=3, cluster_std=1.2, random_state=3)\n", "df = pd.DataFrame(x, columns=[\"x0\", \"x1\"])\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "fbced908", "metadata": {}, "outputs": [], "source": [ "def km_scatter(df, **kwargs):\n", " \"\"\"\n", " Produces scatter plot visualizations with x0 on x-axis and y0 on y-axis.\n", " It can also plot the centroids for clusters.\n", " Parameters:\n", " x0 => x-axis\n", " x1 => y-axis\n", " cluster => marker type\n", " \"\"\"\n", " ax = kwargs.pop(\"ax\", None)\n", " if not \"label\" in df.columns:\n", " return df.plot.scatter(x=\"x0\", y=\"x1\", marker=\"$?$\", ax=ax, **kwargs)\n", "\n", " for marker in set(df[\"label\"]):\n", " sub_df = df[df[\"label\"] == marker]\n", " ax = sub_df.plot.scatter(x=\"x0\", y=\"x1\", marker=marker, ax=ax, **kwargs)\n", " return ax\n", "\n", "ax = km_scatter(df, s=100, c=\"0.7\")" ] }, { "cell_type": "markdown", "id": "47686eee", "metadata": {}, "source": [ "### Hard Problem\n", "\n", "Finding the best answer. What is the answer? Determing the centroids of the clusters.\n", "\n", "### Easier Problem\n", "\n", "Taking a random answer and make it a little better. Then repeat!\n", "Downside? If randomization leads to very bad initial choice of centroids, that might lead to bad clustering (fewer clusters)." ] }, { "cell_type": "code", "execution_count": null, "id": "6f8bde9e", "metadata": {}, "outputs": [], "source": [ "clusters = np.random.uniform(-5, 5, size=(3, 2))\n", "clusters = pd.DataFrame(clusters, columns=[\"x0\", \"x1\"])\n", "clusters[\"label\"] = [\"o\", \"+\", \"x\"]\n", "\n", "ax = km_scatter(df, s=100, c=\"0.7\")\n", "km_scatter(clusters, s=200, c=\"red\", ax=ax)" ] }, { "cell_type": "markdown", "id": "a3fe986c", "metadata": {}, "source": [ "Two variables for us to deal with:\n", "1. clusters: contains location of centroids and a label for them\n", "2. df: contains the actual data points" ] }, { "cell_type": "code", "execution_count": null, "id": "cfa1f1aa", "metadata": {}, "outputs": [], "source": [ "clusters" ] }, { "cell_type": "code", "execution_count": null, "id": "f210c534", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "a28466ce", "metadata": {}, "outputs": [], "source": [ "class KM:\n", " def __init__(self, df, clusters):\n", " # We make copies because we are going to keep changing the dataframe to \n", " # identify better clusters\n", " self.df = df.copy()\n", " self.clusters = clusters.copy()\n", " self.labels = clusters[\"label\"].values\n", " \n", " def plot(self):\n", " ax = km_scatter(self.df, color=\"0.7\", s=100)\n", " km_scatter(self.clusters, ax=ax, color=\"red\", s=200)\n", " \n", " def assign_points(self):\n", " \"\"\"\n", " compute Euclidean distance between each point and each centroids\n", " \"\"\"\n", " pass\n", " \n", " def update_centers(self):\n", " \"\"\"\n", " update centroids by taking mean of the points that are nearest to that\n", " particular centroid\n", " \"\"\"\n", " pass\n", "\n", "\"\"\"\n", "High-level algorithm:\n", "1. Start with random locations for centroids\n", "2. Iterate over each data point:\n", " 1. Find the distance (Euclidean distance) between current data point and each centroid.\n", " 2. Find the minimum of those distances and the corresponding label.\n", " 3. Assign current data point to the closest cluster centroid label.\n", "4. Once all points are assigned, compute new centroid for each cluster. Iterate over \n", " each cluster:\n", " 1. Extract subset of data points which got assigned to curr cluster label.\n", " 2. Compute mean of all the assigned data points.\n", " 3. Update cluster centroid.\n", "5. Repeat steps 2 to 4 many times (iterative improvement).\n", "\"\"\"\n", "\n", "# Creating object instance\n", "km = KM(df, clusters)\n", "km.plot()" ] }, { "cell_type": "markdown", "id": "938a6cc5", "metadata": {}, "source": [ "### `sklearn KMeans`\n", "\n", "- import statement:\n", "```python\n", "from sklearn.cluster import KMeans\n", "```\n", "- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html\n", "\n", "**Instantiation:**\n", "`KMeans(n_clusters=<num>, n_init=<num>, max_iter=<num>)`\n", "- `n_clusters`: number of clusters to be formed\n", "- `n_init`: number of initial random seeds to try (to avoid downside of bad initial random choices)\n", "- `max_iter`: maximum number of iterations for a single K-means run (single starting seed)" ] }, { "cell_type": "code", "execution_count": null, "id": "caa96a1e", "metadata": {}, "outputs": [], "source": [ "km_cluster = \n", "km_cluster" ] }, { "cell_type": "code", "execution_count": null, "id": "ea51243c", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "id": "84e59c4a", "metadata": {}, "source": [ "**Methods:**\n", "1. `fit`: find good centroids\n", "2. `transform`: give me the distances from each point to each centroid\n", "3. `predict`: give me the chosen group labels\n", "\n", "**Attributes:**\n", "- `<km object>.cluster_centers_`: coordinates of cluster centers\n", "- `<km object>.inertia_`: sum of squared distances of samples to their closest cluster center" ] }, { "cell_type": "code", "execution_count": null, "id": "26be1744", "metadata": {}, "outputs": [], "source": [ "# `fit`: find good centroids\n", "km_cluster.fit(df)\n", "# coordinates of cluster centers\n", "km_cluster.cluster_centers_" ] }, { "cell_type": "markdown", "id": "6ce05e61", "metadata": {}, "source": [ "**Observeration:** 3 rows (because we have 3 clusters), and 2 columns (because the df had 2 columns)." ] }, { "cell_type": "code", "execution_count": null, "id": "2df977a4", "metadata": {}, "outputs": [], "source": [ "# `transform`: give me the distances from each point to each centroid\n", "km_cluster.transform(df)" ] }, { "cell_type": "markdown", "id": "7cd8409e", "metadata": {}, "source": [ "**Observations**: Each row corresponds to a row in df. 3 columns correspond to 3 distances to the centroids." ] }, { "cell_type": "code", "execution_count": null, "id": "6a65a976", "metadata": {}, "outputs": [], "source": [ "# `predict`: give me the chosen group labels\n", "km_cluster.predict(df)" ] }, { "cell_type": "markdown", "id": "240d995a", "metadata": {}, "source": [ "### How many clusters do we need?\n", "\n", "- metric: `<km object>.inertia_`: sum of squared distances of samples to their closest cluster center" ] }, { "cell_type": "code", "execution_count": null, "id": "8bf73d2c", "metadata": {}, "outputs": [], "source": [ "km_cluster.inertia_" ] }, { "cell_type": "markdown", "id": "57b5ccc4", "metadata": {}, "source": [ "**Observation**: we want \"inertia\" to be as small as possible." ] }, { "cell_type": "markdown", "id": "ae70b416", "metadata": {}, "source": [ "### Elbow plot to determine `n_clusters`" ] }, { "cell_type": "code", "execution_count": null, "id": "607a96b0", "metadata": {}, "outputs": [], "source": [ "# create a series with clusters 1 to 10 and corresponding values are equal to intertia \n", "s = pd.Series(dtype=float)\n", "\n", "\n", "\n", "s" ] }, { "cell_type": "code", "execution_count": null, "id": "388cd23f", "metadata": {}, "outputs": [], "source": [ "ax = s.plot.line(figsize=(6, 4))\n", "ax.set_ylabel(\"Inertia\")\n", "ax.set_xlabel(\"Number of clusters\")" ] }, { "cell_type": "markdown", "id": "eab497cd", "metadata": {}, "source": [ "**Observation**: there is an \"elbow\" around `n_clusters`=3." ] }, { "cell_type": "markdown", "id": "8e763d1c", "metadata": {}, "source": [ "#### Will we always have a clear \"elbow\"?\n", "\n", "- Let's generate uniform random data" ] }, { "cell_type": "code", "execution_count": null, "id": "b5ad30ec", "metadata": {}, "outputs": [], "source": [ "df2 = pd.DataFrame(np.random.uniform(0, 10, (100, 2)))\n", "\n", "s = pd.Series(dtype=float)\n", "\n", "for num_clusters in range(1, 11):\n", " km = KMeans(num_clusters, n_init = 320)\n", " km.fit(df2)\n", " s.at[num_clusters] = km.inertia_\n", "\n", "ax = s.plot.line(figsize=(6, 4))\n", "ax.set_ylabel(\"Inertia\")\n", "ax.set_xlabel(\"Number of clusters\")" ] }, { "cell_type": "markdown", "id": "c6decdab-74a3-45b5-a408-e7b54bd992d9", "metadata": {}, "source": [ "**Observation**: there is an \"elbow\" around `n_clusters`=3." ] }, { "cell_type": "markdown", "id": "3c6115b0-61df-4660-8355-3cd56bd94080", "metadata": {}, "source": [ "#### Will we always have a clear \"elbow\"?\n", "\n", "- Let's generate uniform random data" ] }, { "cell_type": "code", "execution_count": null, "id": "424743f8-41de-42b8-ab78-07901682ee84", "metadata": {}, "outputs": [], "source": [ "df2 = pd.DataFrame(np.random.uniform(0, 10, (100, 2)))\n", "df2.plot.scatter(0, 1)" ] }, { "cell_type": "code", "execution_count": null, "id": "fba71303-d4c6-46d3-ae8e-f0e863d8e032", "metadata": {}, "outputs": [], "source": [ "s = pd.Series(dtype=float)\n", "\n", "for num_clusters in range(1, 11):\n", " km = KMeans(num_clusters, n_init = 320)\n", " km.fit(df2)\n", " s.at[num_clusters] = km.inertia_\n", "\n", "ax = s.plot.line(figsize=(6, 4))\n", "ax.set_ylabel(\"Inertia\")\n", "ax.set_xlabel(\"Number of clusters\")" ] }, { "cell_type": "markdown", "id": "8af299e7", "metadata": {}, "source": [ "### K-Means use cases:\n", "\n", "1. estimator\n", "2. transformer:\n", " - sometimes we'll use an unsupervised learning technique (like k-means) to pre-process data, creating better inputs for a supervised learning technique (like logistic regression)" ] }, { "cell_type": "code", "execution_count": null, "id": "6b99861d", "metadata": {}, "outputs": [], "source": [ "def make_data():\n", " x, y = datasets.make_blobs(n_samples=250, centers=5, random_state=5)\n", " xcols = [\"x0\", \"x1\"]\n", " df1 = pd.DataFrame(x, columns=xcols)\n", " df1[\"y\"] = y > 0\n", "\n", " df2 = pd.DataFrame(np.random.uniform(-10, 10, size=(250, 2)), columns=[\"x0\", \"x1\"])\n", " df2[\"y\"] = False\n", "\n", " return pd.concat((df1, df2))\n", "\n", "train, test = train_test_split(make_data())" ] }, { "cell_type": "code", "execution_count": null, "id": "c1a0353f", "metadata": {}, "outputs": [], "source": [ "plt.rcParams[\"font.size\"] = 16\n", "fig, ax = plt.subplots(ncols=2, figsize=(10,4))\n", "train.plot.scatter(x=\"x0\", y=\"x1\", c=train[\"y\"], vmin=-1, ax=ax[0])\n", "test.plot.scatter(x=\"x0\", y=\"x1\", c=\"red\", ax=ax[1])\n", "ax[0].set_title(\"Training Data\")\n", "ax[1].set_title(\"Test Data\")\n", "plt.subplots_adjust(wspace=0.4)" ] }, { "cell_type": "markdown", "id": "57800660", "metadata": {}, "source": [ "#### Objective: use `LogisticRegression` to classify points as \"black\" or \"gray\"." ] }, { "cell_type": "code", "execution_count": null, "id": "cba5b0b6", "metadata": {}, "outputs": [], "source": [ "model = Pipeline([\n", " (\"km\", KMeans(10, n_init = 320)),\n", " (\"lr\", LogisticRegression()),\n", "])\n", "# TO DO: fit the model with train columns \"x0\", \"x1\" and test column y\n", "\n", "# TO DO: score the model with test columns \"x0\", \"x1\" and test column y\n" ] }, { "cell_type": "code", "execution_count": null, "id": "e78a788c", "metadata": {}, "outputs": [], "source": [ "model = Pipeline([\n", " (\"km\", KMeans(10, n_init = 320)),\n", " (\"std\", StandardScaler()),\n", " (\"lr\", LogisticRegression()),\n", "])\n", "model.fit(train[[\"x0\", \"x1\"]], train[\"y\"])\n", "model.score(test[[\"x0\", \"x1\"]], test[\"y\"])" ] }, { "cell_type": "markdown", "id": "0d506007", "metadata": {}, "source": [ "### `StandardScaler` with `KMeans`\n", "\n", "Recall that `StandardScaler` should always be applied after applying `PolynomialFeatures` (from last lecture)." ] }, { "cell_type": "code", "execution_count": null, "id": "1229aad1", "metadata": {}, "outputs": [], "source": [ "x = datasets.make_blobs(centers=np.array([(0, 0), (0, 20), (3, 20)]))[0]\n", "df = pd.DataFrame(x)\n", "df.plot.scatter(x=0, y=1, figsize=(6, 4))" ] }, { "cell_type": "code", "execution_count": null, "id": "9f21a66d", "metadata": {}, "outputs": [], "source": [ "km_c = KMeans(2, n_init = 320)\n", "km_c.fit(df)\n", "km_c.predict(df)" ] }, { "cell_type": "markdown", "id": "7dc700d4", "metadata": {}, "source": [ "#### `fit_predict(...)` is a shortcut for `fit` and `predict` method invocations." ] }, { "cell_type": "code", "execution_count": null, "id": "7bc1d18d", "metadata": {}, "outputs": [], "source": [ "KMeans(2, n_init = 320).fit_predict(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "b2dfa6bd", "metadata": {}, "outputs": [], "source": [ "# -1 => white, 0 => gray, 1 => black\n", "df.plot.scatter(x=0, y=1, figsize=(6, 4), c=KMeans(2, n_init = 320).fit_predict(df), vmin=-1, vmax=1)" ] }, { "cell_type": "markdown", "id": "762f2882", "metadata": {}, "source": [ "**Observation**: scale for columns are intentionally not specified." ] }, { "cell_type": "code", "execution_count": null, "id": "4f99dfeb", "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "id": "d444185c", "metadata": {}, "source": [ "Let's make a copy of the data. Assuming initial data for both columns is in \"km\", let's convert one column (`0`) into \"meters\". " ] }, { "cell_type": "code", "execution_count": null, "id": "2e437218", "metadata": {}, "outputs": [], "source": [ "df2 = df.copy()\n", "df2[0] *= 1000 # km => m\n", "df2.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "c99315a5", "metadata": {}, "outputs": [], "source": [ "df2.plot.scatter(x=0, y=1, figsize=(6,4), c=KMeans(2, n_init = 320).fit_predict(df2), vmin=-1, vmax=1)" ] }, { "cell_type": "markdown", "id": "3966ddd3", "metadata": {}, "source": [ "**Observations**:\n", "- One would expect to see the same clusters, but that is not happening here. Why?\n", " - x-axis difference is too high when compared to the y-axis difference\n", " - That is, KMeans doesn't get that x-axis has scaled data, whereas y-axis doesn't have scaled data\n", "- This is not too far off from realistic datasets. \n", " - That is, real-world dataset columns might have difference units. \n", " - For example, one column might be representing temperature data where as another might be representing distance." ] }, { "cell_type": "markdown", "id": "b63c9d9b", "metadata": {}, "source": [ "#### Conclusion: `StandardScaler` should be applied before `KMeans`" ] }, { "cell_type": "code", "execution_count": null, "id": "2c81ed04", "metadata": {}, "outputs": [], "source": [ "# TO DO: write a pipeline with StandardScaler and KMeans with 2 clusters\n", "\n", "\n", "\n", "df2.plot.scatter(x=0, y=1, figsize=(6, 4), c=model.fit_predict(df2), vmin=-1, vmax=1)" ] }, { "cell_type": "markdown", "id": "359dd107", "metadata": {}, "source": [ "### Wisconsin counties example" ] }, { "cell_type": "code", "execution_count": null, "id": "8847306e", "metadata": {}, "outputs": [], "source": [ "df = gpd.read_file(\"counties.geojson\")\n", "df.head()" ] }, { "cell_type": "markdown", "id": "377e9bb0", "metadata": {}, "source": [ "#### If we want to use \"POP100\", \"AREALAND\", \"developed\", \"forest\", \"pasture\", \"crops\" for clustering, what transformer should we use? \n", "\n", "- StandardScaler." ] }, { "cell_type": "markdown", "id": "15f0b21c", "metadata": {}, "source": [ "### Goal here: cluster counties based on similar land usage." ] }, { "cell_type": "code", "execution_count": null, "id": "55013d0a", "metadata": {}, "outputs": [], "source": [ "df.plot()" ] }, { "cell_type": "code", "execution_count": null, "id": "a199a2af", "metadata": {}, "outputs": [], "source": [ "df.plot(column=\"crops\")" ] }, { "cell_type": "code", "execution_count": null, "id": "8735a7bb", "metadata": {}, "outputs": [], "source": [ "df.plot(column=\"forest\")" ] }, { "cell_type": "markdown", "id": "a42394cb", "metadata": {}, "source": [ "### KMeans" ] }, { "cell_type": "code", "execution_count": null, "id": "a8b3c831", "metadata": {}, "outputs": [], "source": [ "xcols = [\"developed\", \"forest\", \"pasture\", \"crops\"]\n", "\n", "# instantiate\n", "km_c = KMeans(4, n_init = 320)\n", "# fit\n", "km_c.fit(df[xcols])\n", "# predict\n", "clusters = km_c.predict(df[xcols])\n", "\n", "print(km_c.inertia_)\n", "print(clusters)\n", "\n", "df.plot(column=clusters, cmap=\"tab10\")" ] }, { "cell_type": "markdown", "id": "01189d61", "metadata": {}, "source": [ "**Observation**: cluster number can be random. That is, if you re-run the above cell twice, you will get different number for each cluster." ] }, { "cell_type": "markdown", "id": "1764b5a6", "metadata": {}, "source": [ "### Agglomerative clustering\n", "\n", "- import statement\n", "```python\n", "from sklearn.cluster import AgglomerativeClustering\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "3d7954a3", "metadata": {}, "outputs": [], "source": [ "xcols = [\"developed\", \"forest\", \"pasture\", \"crops\"]\n", "\n", "# instantiate\n", "km_c = AgglomerativeClustering(4)\n", "# fit\n", "km_c.fit(df[xcols])\n", "# predict\n", "clusters = km_c.predict(df[xcols])\n", "\n", "print(km_c.inertia_)\n", "print(clusters)\n", "\n", "df.plot(column=clusters, cmap=\"tab10\")" ] }, { "cell_type": "markdown", "id": "6f825891", "metadata": {}, "source": [ "**Observations**: \n", "- no centroids => no inertia => no elbow plots (how do we pick cluster count?):\n", " - AttributeError: 'AgglomerativeClustering' object has no attribute 'predict'\n", "- no `predict` method, but there is `fit_predict`:\n", " - AttributeError: 'AgglomerativeClustering' object has no attribute 'predict'\n", " - why?\n", " - because each point could lead to a completely different tree\n", " - remember unlike KMeans (which is top-down), AgglomerativeClustering is bottom-up" ] }, { "cell_type": "code", "execution_count": null, "id": "e16c6131", "metadata": {}, "outputs": [], "source": [ "xcols = [\"developed\", \"forest\", \"pasture\", \"crops\"]\n", "\n", "# instantiate\n", "km_c = AgglomerativeClustering(4)\n", "# fit_predict\n", "clusters = km_c.fit_predict(df[xcols])\n", "\n", "# print(km_c.inertia_)\n", "print(clusters)\n", "\n", "df.plot(column=clusters, cmap=\"tab10\")" ] }, { "cell_type": "code", "execution_count": null, "id": "5fc2bf08-a0e3-49a3-b089-af0f99f83613", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }