finished lec 37 adv pandas

ed64bcea · LOUIS TYRRELL OLIPHANT · a5d56bc4 · ed64bcea · ed64bcea · ed64bcea
Commit ed64bcea authored 4 months ago by LOUIS TYRRELL OLIPHANT
--- a/f24/Louis_Lecture_Notes/37_AdvPandas/Lec37_AdvPandas.ipynb
+++ b/f24/Louis_Lecture_Notes/37_AdvPandas/Lec37_AdvPandas.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "CeWtFirwteFY"
+   },
+   "outputs": [],
+   "source": [
+    "# known import statements\n",
+    "import pandas as pd\n",
+    "import sqlite3\n",
+    "import os\n",
+    "\n",
+    "# new import statement\n",
+    "import numpy as np"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Get the Piazza data from 'piazza.db'\n",
+    "\n",
+    "db_name = \"piazza.db\"\n",
+    "assert os.path.exists(db_name)\n",
+    "conn = sqlite3.connect(db_name)\n",
+    "\n",
+    "def qry(sql):\n",
+    "    return pd.read_sql(sql, conn)\n",
+    "\n",
+    "df = qry(\"\"\"\n",
+    "    SELECT *\n",
+    "    FROM sqlite_master\n",
+    "    WHERE type='table'\n",
+    "\"\"\")\n",
+    "print(df.iloc[0]['sql'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "piazza_df = pd.read_sql(\"\"\"\n",
+    "    SELECT *\n",
+    "    FROM piazza\n",
+    "\"\"\", conn)\n",
+    "piazza_df.head(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Warmup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Warmup 1: Set the student id column as the index\n",
+    "piazza_df = piazza_df.set_index(\"student_id\")\n",
+    "piazza_df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Warmup 2a: Which 10 students post the most?\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Warmup 2b: Can you plot their number of posts as a bar graph? Be sure to label your axes!\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Warmup 2c: How about with their name rather than their student id?\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Warmup 3a: Which people had more than 10 answers? Include all roles.  Store the results in a dataframe named top_answers\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Warmup 3b: Plot this as a bar graph.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Warmup 3c: Plot the contributions of the various roles as a bar graph.\n",
+    "top_answers[\"role\"].value_counts().plot.bar()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Warmup 3d: Can you get this same data using SQL?\n",
+    "qry(\"\"\"\n",
+    "\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Warmup 3e: What about their average # of days online as well?\n",
+    "qry(\"\"\"\n",
+    "\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Warmup 3f: Can we do that in Pandas as well?\n",
+    "# TODAY'S TOPIC"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "yoLGptrqhbBo"
+   },
+   "source": [
+    "# Advanced Pandas\n",
+    "\n",
+    "## Learning Objectives: \n",
+    "\n",
+    "* Setting column as index for pandas `DataFrame`\n",
+    "* Identify, drop, or fill missing values (`np.NaN`) using Pandas `isna`, `dropna`, and `fillna`\n",
+    "* Applying transformations to `DataFrame`:\n",
+    "  * Use `apply` on pandas `Series` to apply a transformation function\n",
+    "  * Use `replace` to replace all target values in Pandas `Series` and `DataFrame` rows / columns\n",
+    "* Filter, aggregate, group, and summarize information in a `DataFrame` with `groupby`\n",
+    "* Convert .groupby examples to SQL\n",
+    "* Solving the same question using SQL and pandas `DataFrame` manipulations:\n",
+    "  * filtering, grouping, and aggregation / summarization"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sort piazza_df by name column ... What do we notice?\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Not a Number\n",
+    "\n",
+    "- `np.NaN` is the floating point representation of Not a Number\n",
+    "- You do not need to know / learn the details about the `numpy` package \n",
+    "\n",
+    "### Replacing / modifying values within the `DataFrame`\n",
+    "\n",
+    "Syntax: `df.replace(<TARGET>, <REPLACE>)`\n",
+    "\n",
+    "Let's now replace the missing values (empty strings) with `np.NaN`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Let's replace these empty strings with this special value.\n",
+    "piazza_df = ...\n",
+    "piazza_df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Sort by name again... What do we notice?\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Checking for missing values\n",
+    "\n",
+    "Syntax: `Series.isna()`\n",
+    "- Returns a boolean Series"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Run isna() on the name column\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# How many people are missing a name?\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# How many people are missing an email?\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# How many people are missing both a name and email?\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# How many people are missing either a name or email?\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# So... What do we do?\n",
+    "#  1. Drop those rows\n",
+    "#  2. Interpolate / Best Guess"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Option 1: Drop those rows.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Option 2a: Interpolate / Best Guess\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a function to take an email (e.g. \"calm_star@wisc.edu\")\n",
+    "# and return the name (e.g. \"calm star\")\n",
+    "def parse_name_from_email(email):\n",
+    "    if pd.isna(email):\n",
+    "        return np.nan\n",
+    "    else:\n",
+    "        pass # TODO Parse out the name!\n",
+    "\n",
+    "# Test your function!\n",
+    "parse_name_from_email(\"calm_star@wisc.edu\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Review: `Pandas.Series.apply(...)`\n",
+    "Syntax: `Series.apply(<FUNCTION OBJECT REFERENCE>)`\n",
+    "- applies input function to every element of the Series.\n",
+    "- Returns a new `Series`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Now, apply that function to each value in email!\n",
+    "piazza_df[\"guessed_name\"] = ???\n",
+    "piazza_df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a function to take a name (e.g. \"calm star\")\n",
+    "# and return the email (e.g. \"calm_star@wisc.edu\")\n",
+    "def parse_email_from_name(name):\n",
+    "    pass\n",
+    "\n",
+    "# Test your function!\n",
+    "parse_email_from_name(\"calm star\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Now, apply that function to each value in name!\n",
+    "piazza_df[\"guessed_email\"] = ???\n",
+    "piazza_df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### `Pandas.DataFrame.apply(...)`\n",
+    "Syntax: `DataFrame.apply(<FUNCTION OBJECT REFERENCE>, axis=1)`\n",
+    "- `axis=1` means apply to each row.\n",
+    "- returns a new `Series`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# If the name has a value, use it, otherwise use our best guess!\n",
+    "piazza_df[\"name\"] = piazza_df.apply(lambda r : r[\"guessed_name\"] if pd.isna(r[\"name\"]) else r[\"name\"], axis=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Same thing for email!\n",
+    "piazza_df[\"email\"] = piazza_df.apply(lambda r : r[\"guessed_email\"] if pd.isna(r[\"email\"]) else r[\"email\"], axis=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "help(piazza_df.drop)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Drop the guessing columns\n",
+    "piazza_df = piazza_df.drop(\"guessed_name\", axis=1)\n",
+    "piazza_df = piazza_df.drop(\"guessed_email\", axis=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "help(piazza_df.dropna)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# How many rows are missing data now?\n",
+    "len(piazza_df.dropna())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "help(piazza_df.fillna)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Give a name of \"anonymous\" and email of \"anonymous@wisc.edu\"\n",
+    "# to anyone left with missing data.\n",
+    "piazza_df['name'] = piazza_df['name'].fillna('anonymous')\n",
+    "\n",
+    "# TODO: now do the email column\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### `Pandas.DataFrame.groupby(...)`\n",
+    "\n",
+    "Syntax: `DataFrame.groupby(<COLUMN>)`\n",
+    "- Returns a `groupby` object\n",
+    "- Need to apply aggregation functions to use the return value of `groupby`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# What does this return?\n",
+    "piazza_df.groupby(\"role\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Try getting the \"mean\" of this groupby object.\n",
+    "piazza_df.groupby(\"role\").mean(numeric_only=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# How many answers does the average instructor, student, and TA give?\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# How would we write this in SQL?\n",
+    "qry(\"\"\"\n",
+    "\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# What is the total number of days spent online for instructors, students, and TAs?\n",
+    "# Order your answer from lowest to highest\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# How would we write this in SQL?\n",
+    "qry(\"\"\"\n",
+    "\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Of those individuals who spend less than 100 days online,\n",
+    "# how does their average number of posts compare to those that\n",
+    "# spend 100 days or more online? Do your analysis by role as well.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# How would we write this in SQL?\n",
+    "qry(\"\"\"\n",
+    "\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# What percentage of instructors, students, and TAs did not write a single answer,\n",
+    "# followup, or reply to a followup?\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# How would we write this in SQL?\n",
+    "qry(\"\"\"\n",
+    "\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "conn.close()"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
+%% Cell type:code id: tags:
+
+``` python
+# known import statements
+import pandas as pd
+import sqlite3
+import os
+
+# new import statement
+import numpy as np
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Get the Piazza data from 'piazza.db'
+
+db_name = "piazza.db"
+assert os.path.exists(db_name)
+conn = sqlite3.connect(db_name)
+
+def qry(sql):
+    return pd.read_sql(sql, conn)
+
+df = qry("""
+    SELECT *
+    FROM sqlite_master
+    WHERE type='table'
+""")
+print(df.iloc[0]['sql'])
+```
+
+%% Cell type:code id: tags:
+
+``` python
+piazza_df = pd.read_sql("""
+    SELECT *
+    FROM piazza
+""", conn)
+piazza_df.head(5)
+```
+
+%% Cell type:markdown id: tags:
+
+## Warmup
+
+%% Cell type:code id: tags:
+
+``` python
+# Warmup 1: Set the student id column as the index
+piazza_df = piazza_df.set_index("student_id")
+piazza_df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Warmup 2a: Which 10 students post the most?
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Warmup 2b: Can you plot their number of posts as a bar graph? Be sure to label your axes!
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Warmup 2c: How about with their name rather than their student id?
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Warmup 3a: Which people had more than 10 answers? Include all roles.  Store the results in a dataframe named top_answers
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Warmup 3b: Plot this as a bar graph.
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Warmup 3c: Plot the contributions of the various roles as a bar graph.
+top_answers["role"].value_counts().plot.bar()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Warmup 3d: Can you get this same data using SQL?
+qry("""
+
+""")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Warmup 3e: What about their average # of days online as well?
+qry("""
+
+""")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Warmup 3f: Can we do that in Pandas as well?
+# TODAY'S TOPIC
+```
+
+%% Cell type:markdown id: tags:
+
+# Advanced Pandas
+
+## Learning Objectives:
+
+* Setting column as index for pandas `DataFrame`
+* Identify, drop, or fill missing values (`np.NaN`) using Pandas `isna`, `dropna`, and `fillna`
+* Applying transformations to `DataFrame`:
+  * Use `apply` on pandas `Series` to apply a transformation function
+  * Use `replace` to replace all target values in Pandas `Series` and `DataFrame` rows / columns
+* Filter, aggregate, group, and summarize information in a `DataFrame` with `groupby`
+* Convert .groupby examples to SQL
+* Solving the same question using SQL and pandas `DataFrame` manipulations:
+  * filtering, grouping, and aggregation / summarization
+
+%% Cell type:code id: tags:
+
+``` python
+# Sort piazza_df by name column ... What do we notice?
+```
+
+%% Cell type:markdown id: tags:
+
+### Not a Number
+
+- `np.NaN` is the floating point representation of Not a Number
+- You do not need to know / learn the details about the `numpy` package
+
+### Replacing / modifying values within the `DataFrame`
+
+Syntax: `df.replace(<TARGET>, <REPLACE>)`
+
+Let's now replace the missing values (empty strings) with `np.NaN`
+
+%% Cell type:code id: tags:
+
+``` python
+# Let's replace these empty strings with this special value.
+piazza_df = ...
+piazza_df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Sort by name again... What do we notice?
+```
+
+%% Cell type:markdown id: tags:
+
+### Checking for missing values
+
+Syntax: `Series.isna()`
+- Returns a boolean Series
+
+%% Cell type:code id: tags:
+
+``` python
+# Run isna() on the name column
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# How many people are missing a name?
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# How many people are missing an email?
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# How many people are missing both a name and email?
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# How many people are missing either a name or email?
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# So... What do we do?
+#  1. Drop those rows
+#  2. Interpolate / Best Guess
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Option 1: Drop those rows.
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Option 2a: Interpolate / Best Guess
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Create a function to take an email (e.g. "calm_star@wisc.edu")
+# and return the name (e.g. "calm star")
+def parse_name_from_email(email):
+    if pd.isna(email):
+        return np.nan
+    else:
+        pass # TODO Parse out the name!
+
+# Test your function!
+parse_name_from_email("calm_star@wisc.edu")
+```
+
+%% Cell type:markdown id: tags:
+
+### Review: `Pandas.Series.apply(...)`
+Syntax: `Series.apply(<FUNCTION OBJECT REFERENCE>)`
+- applies input function to every element of the Series.
+- Returns a new `Series`
+
+%% Cell type:code id: tags:
+
+``` python
+# Now, apply that function to each value in email!
+piazza_df["guessed_name"] = ???
+piazza_df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Create a function to take a name (e.g. "calm star")
+# and return the email (e.g. "calm_star@wisc.edu")
+def parse_email_from_name(name):
+    pass
+
+# Test your function!
+parse_email_from_name("calm star")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Now, apply that function to each value in name!
+piazza_df["guessed_email"] = ???
+piazza_df
+```
+
+%% Cell type:markdown id: tags:
+
+### `Pandas.DataFrame.apply(...)`
+Syntax: `DataFrame.apply(<FUNCTION OBJECT REFERENCE>, axis=1)`
+- `axis=1` means apply to each row.
+- returns a new `Series`
+
+%% Cell type:code id: tags:
+
+``` python
+# If the name has a value, use it, otherwise use our best guess!
+piazza_df["name"] = piazza_df.apply(lambda r : r["guessed_name"] if pd.isna(r["name"]) else r["name"], axis=1)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Same thing for email!
+piazza_df["email"] = piazza_df.apply(lambda r : r["guessed_email"] if pd.isna(r["email"]) else r["email"], axis=1)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+help(piazza_df.drop)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Drop the guessing columns
+piazza_df = piazza_df.drop("guessed_name", axis=1)
+piazza_df = piazza_df.drop("guessed_email", axis=1)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+help(piazza_df.dropna)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# How many rows are missing data now?
+len(piazza_df.dropna())
+```
+
+%% Cell type:code id: tags:
+
+``` python
+help(piazza_df.fillna)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Give a name of "anonymous" and email of "anonymous@wisc.edu"
+# to anyone left with missing data.
+piazza_df['name'] = piazza_df['name'].fillna('anonymous')
+
+# TODO: now do the email column
+```
+
+%% Cell type:markdown id: tags:
+
+### `Pandas.DataFrame.groupby(...)`
+
+Syntax: `DataFrame.groupby(<COLUMN>)`
+- Returns a `groupby` object
+- Need to apply aggregation functions to use the return value of `groupby`
+
+%% Cell type:code id: tags:
+
+``` python
+# What does this return?
+piazza_df.groupby("role")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Try getting the "mean" of this groupby object.
+piazza_df.groupby("role").mean(numeric_only=True)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# How many answers does the average instructor, student, and TA give?
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# How would we write this in SQL?
+qry("""
+
+""")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# What is the total number of days spent online for instructors, students, and TAs?
+# Order your answer from lowest to highest
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# How would we write this in SQL?
+qry("""
+
+""")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Of those individuals who spend less than 100 days online,
+# how does their average number of posts compare to those that
+# spend 100 days or more online? Do your analysis by role as well.
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# How would we write this in SQL?
+qry("""
+
+""")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# What percentage of instructors, students, and TAs did not write a single answer,
+# followup, or reply to a followup?
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# How would we write this in SQL?
+qry("""
+
+""")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+conn.close()
+```
--- a/f24/Louis_Lecture_Notes/37_AdvPandas/Lec37_AdvPandas_Solution.ipynb
+++ b/f24/Louis_Lecture_Notes/37_AdvPandas/Lec37_AdvPandas_Solution.ipynb
--- a/f24/Louis_Lecture_Notes/37_AdvPandas/piazza.db
+++ b/f24/Louis_Lecture_Notes/37_AdvPandas/piazza.db