{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "035e9e2c-9781-4b9c-8395-be9e55e4e082",
   "metadata": {},
   "source": [
    "# Clustering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cbd48a28",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import geopandas as gpd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from sklearn import datasets\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import LinearRegression, LogisticRegression\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# new import statements\n",
    "from sklearn.cluster import KMeans, AgglomerativeClustering"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e72ea2f4",
   "metadata": {},
   "source": [
    "# Unsupervised Machine Learning: Clustering\n",
    "\n",
    "- In classification (supervised), we try to find boundaries/rules to separate points according to pre-determined labels.\n",
    "- In clustering, the algorithm chooses the labels.  Goal is to choose labels so that similar rows get labeled the same.\n",
    "\n",
    "### K-Means Clustering\n",
    "\n",
    "- K: number of clusters:\n",
    "    - 3-Means => 3 clusters\n",
    "    - 4-Means => 4 clusters, and so on\n",
    "- Means: we will find centroids (aka means aka averages) to create clusters\n",
    "\n",
    "- import statement:\n",
    "```python\n",
    "from sklearn.cluster import KMeans\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a0ad5a5",
   "metadata": {},
   "source": [
    "#### Iterative algorithm for K-Means"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b83aaf3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate random data\n",
    "x, y = datasets.make_blobs(n_samples=100, centers=3, cluster_std=1.2, random_state=3)\n",
    "df = pd.DataFrame(x, columns=[\"x0\", \"x1\"])\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbced908",
   "metadata": {},
   "outputs": [],
   "source": [
    "def km_scatter(df, **kwargs):\n",
    "    \"\"\"\n",
    "    Produces scatter plot visualizations with x0 on x-axis and y0 on y-axis.\n",
    "    It can also plot the centroids for clusters.\n",
    "    Parameters:\n",
    "        x0 => x-axis\n",
    "        x1 => y-axis\n",
    "        cluster => marker type\n",
    "    \"\"\"\n",
    "    ax = kwargs.pop(\"ax\", None)\n",
    "    if not \"label\" in df.columns:\n",
    "        return df.plot.scatter(x=\"x0\", y=\"x1\", marker=\"$?$\", ax=ax, **kwargs)\n",
    "\n",
    "    for marker in set(df[\"label\"]):\n",
    "        sub_df = df[df[\"label\"] == marker]\n",
    "        ax = sub_df.plot.scatter(x=\"x0\", y=\"x1\", marker=marker, ax=ax, **kwargs)\n",
    "    return ax\n",
    "\n",
    "ax = km_scatter(df, s=100, c=\"0.7\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47686eee",
   "metadata": {},
   "source": [
    "### Hard Problem\n",
    "\n",
    "Finding the best answer. What is the answer? Determing the centroids of the clusters.\n",
    "\n",
    "### Easier Problem\n",
    "\n",
    "Taking a random answer and make it a little better. Then repeat!\n",
    "Downside? If randomization leads to very bad initial choice of centroids, that might lead to bad clustering (fewer clusters)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f8bde9e",
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters = np.random.uniform(-5, 5, size=(3, 2))\n",
    "clusters = pd.DataFrame(clusters, columns=[\"x0\", \"x1\"])\n",
    "clusters[\"label\"] = [\"o\", \"+\", \"x\"]\n",
    "\n",
    "ax = km_scatter(df, s=100, c=\"0.7\")\n",
    "km_scatter(clusters, s=200, c=\"red\", ax=ax)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3fe986c",
   "metadata": {},
   "source": [
    "Two variables for us to deal with:\n",
    "1. clusters: contains location of centroids and a label for them\n",
    "2. df: contains the actual data points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cfa1f1aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "clusters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f210c534",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a28466ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "class KM:\n",
    "    def __init__(self, df, clusters):\n",
    "        # We make copies because we are going to keep changing the dataframe to \n",
    "        # identify better clusters\n",
    "        self.df = df.copy()\n",
    "        self.clusters = clusters.copy()\n",
    "        self.labels = clusters[\"label\"].values\n",
    "        \n",
    "    def plot(self):\n",
    "        ax = km_scatter(self.df, color=\"0.7\", s=100)\n",
    "        km_scatter(self.clusters, ax=ax, color=\"red\", s=200)\n",
    "        \n",
    "    def assign_points(self):\n",
    "        \"\"\"\n",
    "        compute Euclidean distance between each point and each centroids\n",
    "        \"\"\"\n",
    "        pass\n",
    "    \n",
    "    def update_centers(self):\n",
    "        \"\"\"\n",
    "        update centroids by taking mean of the points that are nearest to that\n",
    "        particular centroid\n",
    "        \"\"\"\n",
    "        pass\n",
    "\n",
    "\"\"\"\n",
    "High-level algorithm:\n",
    "1. Start with random locations for centroids\n",
    "2. Iterate over each data point:\n",
    "    1. Find the distance (Euclidean distance) between current data point and each centroid.\n",
    "    2. Find the minimum of those distances and the corresponding label.\n",
    "    3. Assign current data point to the closest cluster centroid label.\n",
    "4. Once all points are assigned, compute new centroid for each cluster. Iterate over \n",
    "   each cluster:\n",
    "    1. Extract subset of data points which got assigned to curr cluster label.\n",
    "    2. Compute mean of all the assigned data points.\n",
    "    3. Update cluster centroid.\n",
    "5. Repeat steps 2 to 4 many times (iterative improvement).\n",
    "\"\"\"\n",
    "\n",
    "# Creating object instance\n",
    "km = KM(df, clusters)\n",
    "km.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "938a6cc5",
   "metadata": {},
   "source": [
    "### `sklearn KMeans`\n",
    "\n",
    "- import statement:\n",
    "```python\n",
    "from sklearn.cluster import KMeans\n",
    "```\n",
    "- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html\n",
    "\n",
    "**Instantiation:**\n",
    "`KMeans(n_clusters=<num>, n_init=<num>, max_iter=<num>)`\n",
    "- `n_clusters`: number of clusters to be formed\n",
    "- `n_init`: number of initial random seeds to try (to avoid downside of bad initial random choices)\n",
    "- `max_iter`: maximum number of iterations for a single K-means run (single starting seed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "caa96a1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "km_cluster = \n",
    "km_cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea51243c",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84e59c4a",
   "metadata": {},
   "source": [
    "**Methods:**\n",
    "1. `fit`: find good centroids\n",
    "2. `transform`: give me the distances from each point to each centroid\n",
    "3. `predict`: give me the chosen group labels\n",
    "\n",
    "**Attributes:**\n",
    "- `<km object>.cluster_centers_`: coordinates of cluster centers\n",
    "- `<km object>.inertia_`: sum of squared distances of samples to their closest cluster center"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26be1744",
   "metadata": {},
   "outputs": [],
   "source": [
    "# `fit`: find good centroids\n",
    "km_cluster.fit(df)\n",
    "# coordinates of cluster centers\n",
    "km_cluster.cluster_centers_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ce05e61",
   "metadata": {},
   "source": [
    "**Observeration:** 3 rows (because we have 3 clusters), and 2 columns (because the df had 2 columns)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2df977a4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# `transform`: give me the distances from each point to each centroid\n",
    "km_cluster.transform(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cd8409e",
   "metadata": {},
   "source": [
    "**Observations**: Each row corresponds to a row in df. 3 columns correspond to 3 distances to the centroids."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a65a976",
   "metadata": {},
   "outputs": [],
   "source": [
    "# `predict`: give me the chosen group labels\n",
    "km_cluster.predict(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "240d995a",
   "metadata": {},
   "source": [
    "### How many clusters do we need?\n",
    "\n",
    "- metric: `<km object>.inertia_`: sum of squared distances of samples to their closest cluster center"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bf73d2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "km_cluster.inertia_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57b5ccc4",
   "metadata": {},
   "source": [
    "**Observation**: we want \"inertia\" to be as small as possible."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae70b416",
   "metadata": {},
   "source": [
    "### Elbow plot to determine `n_clusters`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "607a96b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a series with clusters 1 to 10 and corresponding values are equal to intertia \n",
    "s = pd.Series(dtype=float)\n",
    "\n",
    "\n",
    "\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "388cd23f",
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = s.plot.line(figsize=(6, 4))\n",
    "ax.set_ylabel(\"Inertia\")\n",
    "ax.set_xlabel(\"Number of clusters\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eab497cd",
   "metadata": {},
   "source": [
    "**Observation**: there is an \"elbow\" around `n_clusters`=3."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e763d1c",
   "metadata": {},
   "source": [
    "#### Will we always have a clear \"elbow\"?\n",
    "\n",
    "- Let's generate uniform random data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5ad30ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = pd.DataFrame(np.random.uniform(0, 10, (100, 2)))\n",
    "\n",
    "s = pd.Series(dtype=float)\n",
    "\n",
    "for num_clusters in range(1, 11):\n",
    "    km = KMeans(num_clusters, n_init = 320)\n",
    "    km.fit(df2)\n",
    "    s.at[num_clusters] = km.inertia_\n",
    "\n",
    "ax = s.plot.line(figsize=(6, 4))\n",
    "ax.set_ylabel(\"Inertia\")\n",
    "ax.set_xlabel(\"Number of clusters\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6decdab-74a3-45b5-a408-e7b54bd992d9",
   "metadata": {},
   "source": [
    "**Observation**: there is an \"elbow\" around `n_clusters`=3."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c6115b0-61df-4660-8355-3cd56bd94080",
   "metadata": {},
   "source": [
    "#### Will we always have a clear \"elbow\"?\n",
    "\n",
    "- Let's generate uniform random data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "424743f8-41de-42b8-ab78-07901682ee84",
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = pd.DataFrame(np.random.uniform(0, 10, (100, 2)))\n",
    "df2.plot.scatter(0, 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fba71303-d4c6-46d3-ae8e-f0e863d8e032",
   "metadata": {},
   "outputs": [],
   "source": [
    "s = pd.Series(dtype=float)\n",
    "\n",
    "for num_clusters in range(1, 11):\n",
    "    km = KMeans(num_clusters, n_init = 320)\n",
    "    km.fit(df2)\n",
    "    s.at[num_clusters] = km.inertia_\n",
    "\n",
    "ax = s.plot.line(figsize=(6, 4))\n",
    "ax.set_ylabel(\"Inertia\")\n",
    "ax.set_xlabel(\"Number of clusters\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8af299e7",
   "metadata": {},
   "source": [
    "### K-Means use cases:\n",
    "\n",
    "1. estimator\n",
    "2. transformer:\n",
    "    - sometimes we'll use an unsupervised learning technique (like k-means) to pre-process data, creating better inputs for a supervised learning technique (like logistic regression)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6b99861d",
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_data():\n",
    "    x, y = datasets.make_blobs(n_samples=250, centers=5, random_state=5)\n",
    "    xcols = [\"x0\", \"x1\"]\n",
    "    df1 = pd.DataFrame(x, columns=xcols)\n",
    "    df1[\"y\"] = y > 0\n",
    "\n",
    "    df2 = pd.DataFrame(np.random.uniform(-10, 10, size=(250, 2)), columns=[\"x0\", \"x1\"])\n",
    "    df2[\"y\"] = False\n",
    "\n",
    "    return pd.concat((df1, df2))\n",
    "\n",
    "train, test = train_test_split(make_data())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1a0353f",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams[\"font.size\"] = 16\n",
    "fig, ax = plt.subplots(ncols=2, figsize=(10,4))\n",
    "train.plot.scatter(x=\"x0\", y=\"x1\", c=train[\"y\"], vmin=-1, ax=ax[0])\n",
    "test.plot.scatter(x=\"x0\", y=\"x1\", c=\"red\", ax=ax[1])\n",
    "ax[0].set_title(\"Training Data\")\n",
    "ax[1].set_title(\"Test Data\")\n",
    "plt.subplots_adjust(wspace=0.4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57800660",
   "metadata": {},
   "source": [
    "#### Objective: use `LogisticRegression` to classify points as \"black\" or \"gray\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cba5b0b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = Pipeline([\n",
    "    (\"km\", KMeans(10, n_init = 320)),\n",
    "    (\"lr\", LogisticRegression()),\n",
    "])\n",
    "# TO DO: fit the model with train columns \"x0\", \"x1\" and test column y\n",
    "\n",
    "# TO DO: score the model with test columns \"x0\", \"x1\" and test column y\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e78a788c",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = Pipeline([\n",
    "    (\"km\", KMeans(10, n_init = 320)),\n",
    "    (\"std\", StandardScaler()),\n",
    "    (\"lr\", LogisticRegression()),\n",
    "])\n",
    "model.fit(train[[\"x0\", \"x1\"]], train[\"y\"])\n",
    "model.score(test[[\"x0\", \"x1\"]], test[\"y\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d506007",
   "metadata": {},
   "source": [
    "### `StandardScaler` with `KMeans`\n",
    "\n",
    "Recall that `StandardScaler` should always be applied after applying `PolynomialFeatures` (from last lecture)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1229aad1",
   "metadata": {},
   "outputs": [],
   "source": [
    "x = datasets.make_blobs(centers=np.array([(0, 0), (0, 20), (3, 20)]))[0]\n",
    "df = pd.DataFrame(x)\n",
    "df.plot.scatter(x=0, y=1, figsize=(6, 4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f21a66d",
   "metadata": {},
   "outputs": [],
   "source": [
    "km_c = KMeans(2, n_init = 320)\n",
    "km_c.fit(df)\n",
    "km_c.predict(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7dc700d4",
   "metadata": {},
   "source": [
    "#### `fit_predict(...)` is a shortcut for `fit` and `predict` method invocations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7bc1d18d",
   "metadata": {},
   "outputs": [],
   "source": [
    "KMeans(2, n_init = 320).fit_predict(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2dfa6bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# -1 => white, 0 => gray, 1 => black\n",
    "df.plot.scatter(x=0, y=1, figsize=(6, 4), c=KMeans(2, n_init = 320).fit_predict(df), vmin=-1, vmax=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "762f2882",
   "metadata": {},
   "source": [
    "**Observation**: scale for columns are intentionally not specified."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f99dfeb",
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d444185c",
   "metadata": {},
   "source": [
    "Let's make a copy of the data. Assuming initial data for both columns is in \"km\", let's convert one column (`0`) into \"meters\". "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2e437218",
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = df.copy()\n",
    "df2[0] *= 1000 # km => m\n",
    "df2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c99315a5",
   "metadata": {},
   "outputs": [],
   "source": [
    "df2.plot.scatter(x=0, y=1, figsize=(6,4), c=KMeans(2, n_init = 320).fit_predict(df2), vmin=-1, vmax=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3966ddd3",
   "metadata": {},
   "source": [
    "**Observations**:\n",
    "- One would expect to see the same clusters, but that is not happening here. Why?\n",
    "    - x-axis difference is too high when compared to the y-axis difference\n",
    "    - That is, KMeans doesn't get that x-axis has scaled data, whereas y-axis doesn't have scaled data\n",
    "- This is not too far off from realistic datasets. \n",
    "    - That is, real-world dataset columns might have difference units. \n",
    "    - For example, one column might be representing temperature data where as another might be representing distance."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b63c9d9b",
   "metadata": {},
   "source": [
    "#### Conclusion: `StandardScaler` should be applied before `KMeans`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c81ed04",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TO DO: write a pipeline with StandardScaler and KMeans with 2 clusters\n",
    "\n",
    "\n",
    "\n",
    "df2.plot.scatter(x=0, y=1, figsize=(6, 4), c=model.fit_predict(df2), vmin=-1, vmax=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "359dd107",
   "metadata": {},
   "source": [
    "### Wisconsin counties example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8847306e",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = gpd.read_file(\"counties.geojson\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "377e9bb0",
   "metadata": {},
   "source": [
    "#### If we want to use \"POP100\", \"AREALAND\", \"developed\", \"forest\", \"pasture\", \"crops\" for clustering, what transformer should we use? \n",
    "\n",
    "- StandardScaler."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15f0b21c",
   "metadata": {},
   "source": [
    "### Goal here: cluster counties based on similar land usage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "55013d0a",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a199a2af",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot(column=\"crops\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8735a7bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot(column=\"forest\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a42394cb",
   "metadata": {},
   "source": [
    "### KMeans"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8b3c831",
   "metadata": {},
   "outputs": [],
   "source": [
    "xcols = [\"developed\", \"forest\", \"pasture\", \"crops\"]\n",
    "\n",
    "# instantiate\n",
    "km_c = KMeans(4, n_init = 320)\n",
    "# fit\n",
    "km_c.fit(df[xcols])\n",
    "# predict\n",
    "clusters = km_c.predict(df[xcols])\n",
    "\n",
    "print(km_c.inertia_)\n",
    "print(clusters)\n",
    "\n",
    "df.plot(column=clusters, cmap=\"tab10\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01189d61",
   "metadata": {},
   "source": [
    "**Observation**: cluster number can be random. That is, if you re-run the above cell twice, you will get different number for each cluster."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1764b5a6",
   "metadata": {},
   "source": [
    "### Agglomerative clustering\n",
    "\n",
    "- import statement\n",
    "```python\n",
    "from sklearn.cluster import AgglomerativeClustering\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d7954a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "xcols = [\"developed\", \"forest\", \"pasture\", \"crops\"]\n",
    "\n",
    "# instantiate\n",
    "km_c = AgglomerativeClustering(4)\n",
    "# fit\n",
    "km_c.fit(df[xcols])\n",
    "# predict\n",
    "clusters = km_c.predict(df[xcols])\n",
    "\n",
    "print(km_c.inertia_)\n",
    "print(clusters)\n",
    "\n",
    "df.plot(column=clusters, cmap=\"tab10\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f825891",
   "metadata": {},
   "source": [
    "**Observations**: \n",
    "- no centroids => no inertia => no elbow plots (how do we pick cluster count?):\n",
    "    - AttributeError: 'AgglomerativeClustering' object has no attribute 'predict'\n",
    "- no `predict` method, but there is `fit_predict`:\n",
    "    - AttributeError: 'AgglomerativeClustering' object has no attribute 'predict'\n",
    "    - why?\n",
    "        - because each point could lead to a completely different tree\n",
    "        - remember unlike KMeans (which is top-down), AgglomerativeClustering is bottom-up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e16c6131",
   "metadata": {},
   "outputs": [],
   "source": [
    "xcols = [\"developed\", \"forest\", \"pasture\", \"crops\"]\n",
    "\n",
    "# instantiate\n",
    "km_c = AgglomerativeClustering(4)\n",
    "# fit_predict\n",
    "clusters = km_c.fit_predict(df[xcols])\n",
    "\n",
    "# print(km_c.inertia_)\n",
    "print(clusters)\n",
    "\n",
    "df.plot(column=clusters, cmap=\"tab10\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5fc2bf08-a0e3-49a3-b089-af0f99f83613",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}