Compare revisions

eef9c6fd · eef9c6fd · eef9c6fd · eef9c6fd · eef9c6fd · eef9c6fd
--- a/f22/andy_lec_notes/lec_35/lec35_plotting1_complete.ipynb
+++ b/f22/andy_lec_notes/lec_35/lec35_plotting1_complete.ipynb
--- a/f22/andy_lec_notes/lec_35/lec35_plotting1_template.ipynb
+++ b/f22/andy_lec_notes/lec_35/lec35_plotting1_template.ipynb
--- a/f22/andy_lec_notes/lec_35/readme.md
+++ b/f22/andy_lec_notes/lec_35/readme.md
--- a/f22/andy_lec_notes/lec_36/iris-flowers.db
+++ b/f22/andy_lec_notes/lec_36/iris-flowers.db
--- a/f22/andy_lec_notes/lec_36/iris.csv
+++ b/f22/andy_lec_notes/lec_36/iris.csv
--- a/f22/andy_lec_notes/lec37_Dec07_Plotting2/lec_37_plotting2_scatter_plots.ipynb
+++ b/f22/andy_lec_notes/lec37_Dec07_Plotting2/lec_37_plotting2_scatter_plots.ipynb
--- a/f22/andy_lec_notes/lec37_Dec07_Plotting2/lec_37_plotting2_scatter_plots_template.ipynb
+++ b/f22/andy_lec_notes/lec37_Dec07_Plotting2/lec_37_plotting2_scatter_plots_template.ipynb
+%% Cell type:code id: tags:
+
+``` python
+# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
+from IPython.core.display import display, HTML
+display(HTML('<style>em { color: red; }</style> <style>.container { width:100% !important; }</style>'))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+%matplotlib inline
+```
+
+%% Cell type:code id: tags:
+
+``` python
+import pandas as pd
+from pandas import DataFrame, Series
+
+import sqlite3
+import os
+
+import matplotlib
+# new import statement
+from matplotlib import pyplot as plt
+
+import requests
+matplotlib.rcParams["font.size"] = 12
+```
+
+%% Cell type:markdown id: tags:
+
+#### Wrapping up bus dataset example
+
+%% Cell type:markdown id: tags:
+
+#### What are the top routes, and how many people ride them daily?
+
+%% Cell type:code id: tags:
+
+``` python
+path = "bus.db"
+# assert existence of path
+assert os.path.exists(path)
+
+# establish connection to bus.db
+conn = sqlite3.connect(path)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df = pd.read_sql("""
+SELECT Route, SUM(DailyBoardings) AS daily
+FROM boarding
+GROUP BY Route
+ORDER BY daily DESC
+""", conn)
+
+df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's extract daily column from df
+df["daily"]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's create a bar plot from daily column Series
+df["daily"].plot.bar()
+
+# Oops wrong x-axis labels!
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df = ???
+
+# let's plot for top 5 routes alone
+???
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's use slicing to aggregate the rest of the data
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's plot the bars
+ax = (s / 1000).plot.bar(color = "k")
+ax.set_ylabel("Rides / Day (Thousands)")
+None
+```
+
+%% Cell type:code id: tags:
+
+``` python
+conn.close()
+```
+
+%% Cell type:markdown id: tags:
+
+### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
+- This set of data is used in beginning Machine Learning Courses
+- You can train a ML algorithm to use the values to predict the class of iris
+- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 1:  Downloading IRIS dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data)
+
+%% Cell type:code id: tags:
+
+``` python
+# use requests to get this URL
+url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
+response = ???
+
+# check that the request was successful
+???
+
+# open a file called "iris.csv" for writing the data locally
+file_obj = open("iris.csv", ???)
+
+# write the text of response to the file object
+file_obj.write(???)
+
+# close the file object
+file_obj.close()
+
+# Look at the file you downloaded. What's wrong with it?
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 2: Making a DataFrame
+
+%% Cell type:code id: tags:
+
+``` python
+# read the "iris.csv" file into a Pandas dataframe
+iris_df = ???
+
+# display the head of the data frame
+iris_df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 3: Our CSV file has no header. Let's add column names.
+- Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
+
+%% Cell type:code id: tags:
+
+``` python
+# Attribute Information:
+# 1. sepal length in cm
+# 2. sepal width in cm
+# 3. petal length in cm
+# 4. petal width in cm
+# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
+
+# These should be our headers
+# ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
+
+iris_df = pd.read_csv("iris.csv",
+                 ???)
+iris_df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 4: Connect to our database version of this data!
+
+%% Cell type:code id: tags:
+
+``` python
+iris_conn = sqlite3.connect("iris-flowers.db")
+pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
+Break any ties by ordering by the shortest sepal width.
+
+%% Cell type:code id: tags:
+
+``` python
+pd.read_sql("""
+    SELECT
+    FROM
+    WHERE
+    ORDER BY
+    LIMIT 10
+""", iris_conn)
+```
+
+%% Cell type:markdown id: tags:
+
+# Lecture 36:  Scatter Plots
+**Learning Objectives**
+- Set the marker, color, and size of scatter plot data
+- Calculate correlation between DataFrame columns
+- Use subplots to group scatterplot data
+
+%% Cell type:markdown id: tags:
+
+## Set the marker, color, and size of scatter plot data
+
+To start, let's look at some made-up data about Trees.
+The city of Madison maintains a database of all the trees they care for.
+
+%% Cell type:code id: tags:
+
+``` python
+trees = [
+    {"age": 1, "height": 1.5, "diameter": 0.8},
+    {"age": 1, "height": 1.9, "diameter": 1.2},
+    {"age": 1, "height": 1.8, "diameter": 1.4},
+    {"age": 2, "height": 1.8, "diameter": 0.9},
+    {"age": 2, "height": 2.5, "diameter": 1.5},
+    {"age": 2, "height": 3, "diameter": 1.8},
+    {"age": 2, "height": 2.9, "diameter": 1.7},
+    {"age": 3, "height": 3.2, "diameter": 2.1},
+    {"age": 3, "height": 3, "diameter": 2},
+    {"age": 3, "height": 2.4, "diameter": 2.2},
+    {"age": 2, "height": 3.1, "diameter": 2.9},
+    {"age": 4, "height": 2.5, "diameter": 3.1},
+    {"age": 4, "height": 3.9, "diameter": 3.1},
+    {"age": 4, "height": 4.9, "diameter": 2.8},
+    {"age": 4, "height": 5.2, "diameter": 3.5},
+    {"age": 4, "height": 4.8, "diameter": 4},
+]
+trees_df = DataFrame(trees)
+trees_df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+### Scatter Plots
+We can make a scatter plot of a DataFrame using the following function...
+
+`df_name.plot.scatter(x = "x_col_name", y = "y_col_name", \
+                     color = "red", marker = "*", s = 50)`
+
+%% Cell type:markdown id: tags:
+
+Plot the trees data comparing a tree's age to its height...
+ - What is `df_name`?
+ - What is `x_col_name`?
+ - What is `y_col_name`?
+
+%% Cell type:code id: tags:
+
+``` python
+# TODO: change y to diameter
+```
+
+%% Cell type:markdown id: tags:
+
+Now plot with a little more beautification...
+ - Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
+ - Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
+ - Change the size (any int)
+
+%% Cell type:code id: tags:
+
+``` python
+# Plot with some more beautification options.
+trees_df.plot.scatter(x = "age", y = "height", color = "r",  marker = "D", s = 50)
+# D for diamond
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Add a title to your plot.
+ax = trees_df.plot.scatter(x = "age", y = "height", color = "r", marker = "D", s = 50)
+# D for diamond
+ax.set_title("Tree Age vs Height")
+```
+
+%% Cell type:markdown id: tags:
+
+#### Correlation
+
+%% Cell type:code id: tags:
+
+``` python
+# What is the correlation between our DataFrame columns?
+corr_df = trees_df.corr()
+corr_df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# What is the correlation between age and height (don't use .iloc)
+# Using index in this case isn't considered as hardcoding
+corr_df['age']['height']
+```
+
+%% Cell type:markdown id: tags:
+
+### Variating Stylistic Parameters
+
+%% Cell type:code id: tags:
+
+``` python
+# Option 1:
+trees_df.plot.scatter(x = "age", y = "height",  marker = "H", s = "diameter")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Option 2:
+# this way allows you to make it bigger
+trees_df.plot.scatter(x = "age", y = "height", marker = "H", s = trees_df["diameter"] * 50)
+```
+
+%% Cell type:markdown id: tags:
+
+## Use subplots to group scatterplot data
+
+%% Cell type:markdown id: tags:
+
+### Re-visit the Iris Data
+
+%% Cell type:code id: tags:
+
+``` python
+iris_df
+```
+
+%% Cell type:markdown id: tags:
+
+### How do we create a *scatter plot* for various *class types*?
+First, gather all the class types.
+
+%% Cell type:code id: tags:
+
+``` python
+# In Pandas
+varietes = list(set(iris_df["class"]))
+varietes
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# In SQL
+varietes = list(pd.read_sql("""
+    SELECT DISTINCT class
+    FROM iris
+""", iris_conn)["class"])
+varietes
+```
+
+%% Cell type:markdown id: tags:
+
+In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
+
+%% Cell type:code id: tags:
+
+``` python
+# If you want to continue using SQL instead, don't close the connection!
+iris_conn.close()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Change this scatter plot so that the data is only for class ='Iris-setosa'
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Write a for loop that iterates through each variety in classes
+# and makes a plot for only that class
+
+# For each class add a color and a marker style
+colors = ["blue", "green", "red"]
+markers = ["o", "^", "v"]
+
+for i in range(len(varietes)):
+    ???
+```
+
+%% Cell type:markdown id: tags:
+
+Did you notice that it made 3 plots?!?! What's decieving about this?
+
+%% Cell type:markdown id: tags:
+
+### We can make Subplots in plots, called an AxesSubplot, keyword ax
+1. if AxesSuplot ax passed, then plot in that subplot
+2. if ax is None, create a new AxesSubplot
+3. return AxesSubplot that was used
+
+%% Cell type:code id: tags:
+
+``` python
+# complete this code to make 3 plots in one
+
+plot_area = None   # don't change this...look at this variable in line 12
+colors = ["blue", "green", "red"]
+markers = ["o", "^", "v"]
+for i in range(len(varietes)):
+    ???
+```
+
+%% Cell type:markdown id: tags:
+
+### Let's focus on "Iris-virginica" data
+
+%% Cell type:code id: tags:
+
+``` python
+iris_virginica = ???
+assert(len(iris_virginica) == 50)
+iris_virginica.head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+iris_virginica.plot.scatter(x = "pet-width", y = "pet-length")
+```
+
+%% Cell type:markdown id: tags:
+
+### Let's learn about *xlim* and *ylim*
+- Allows us to set x-axis and y-axis limits
+- Takes either a single value (LOWER-BOUND) or a tuple containing two values (LOWER-BOUND, UPPER-BOUND)
+- You need to be careful about setting the UPPER-BOUND
+
+%% Cell type:code id: tags:
+
+``` python
+iris_virginica.plot.scatter(x = "pet-width", y = "pet-length", xlim = ???, ylim = ???)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
+                    xlim = (0, 6), ylim = (0, 6),
+                    figsize = (3, 3))
+
+# What is wrong with this plot?
+```
+
+%% Cell type:markdown id: tags:
+
+What is the maximum pet-len?
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+ax.get_ylim()
+```
+
+%% Cell type:markdown id: tags:
+
+Let's include assert statements to make sure we don't crop the plot!
+
+%% Cell type:code id: tags:
+
+``` python
+ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
+                     xlim = (0, 6), ylim = (0, 6),
+                     figsize = (3, 3))
+assert iris_virginica["pet-length"].max() <= ax.get_ylim()[1]
+```
+
+%% Cell type:markdown id: tags:
+
+### Now let's try all 4 assert statements
+
+```
+assert iris_virginica[ax.get_xlabel()].min() >= ax.get_xlim()[0]
+assert iris_virginica[ax.get_xlabel()].max() <= ax.get_xlim()[1]
+assert iris_virginica[ax.get_ylabel()].min() >= ax.get_ylim()[0]
+assert iris_virginica[ax.get_ylabel()].max() <= ax.get_ylim()[1]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
+                     xlim = (0, 7), ylim = (0, 7),
+                     figsize = (3, 3))
+assert iris_virginica[ax.get_xlabel()].min() >= ax.get_xlim()[0]
+assert iris_virginica[ax.get_xlabel()].max() <= ax.get_xlim()[1]
+assert iris_virginica[ax.get_ylabel()].min() >= ax.get_ylim()[0]
+assert iris_virginica[ax.get_ylabel()].max() <= ax.get_ylim()[1]
+```
+
+%% Cell type:markdown id: tags:
+
+### Time-Permitting
+Plot this data in an interesting/meaningful way & identify any correlations.
+
+%% Cell type:code id: tags:
+
+``` python
+students = pd.DataFrame({
+    "name": [
+        "Cole",
+        "Cynthia",
+        "Alice",
+        "Seth"
+    ],
+    "grade": [
+        "C",
+        "AB",
+        "B",
+        "BC"
+    ],
+    "gpa": [
+        2.0,
+        3.5,
+        3.0,
+        2.5
+    ],
+    "attendance": [
+        4,
+        11,
+        10,
+        6
+    ],
+    "height": [
+        68,
+        66,
+        60,
+        72
+    ]
+})
+students
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Min, Max, and Overall Difference in Student Height
+min_height = students["height"].min()
+max_height = students["height"].max()
+diff_height = max_height - min_height
+
+# Normalize students heights on a scale of [0, 1] (black to white)
+height_colors = (students["height"] - min_height) / diff_height
+
+# Normalize students heights on a scale of [0, 0.5] (black to gray)
+height_colors = height_colors / 2
+
+# Color must be a string (e.g. c='0.34')
+height_colors = height_colors.astype("string")
+
+height_colors
+```
+
+%% Cell type:code id: tags:
+
+``` python
+students.plot.scatter(x="attendance", y="gpa", c=height_colors)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+students.corr()
+```
+%% Cell type:code id: tags:
+
+``` python
+# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
+from IPython.core.display import display, HTML
+display(HTML('<style>em { color: red; }</style> <style>.container { width:100% !important; }</style>'))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+%matplotlib inline
+```
+
+%% Cell type:code id: tags:
+
+``` python
+import pandas as pd
+from pandas import DataFrame, Series
+
+import sqlite3
+import os
+
+import matplotlib
+# new import statement
+from matplotlib import pyplot as plt
+
+import requests
+matplotlib.rcParams["font.size"] = 12
+```
+
+%% Cell type:markdown id: tags:
+
+#### Wrapping up bus dataset example
+
+%% Cell type:markdown id: tags:
+
+#### What are the top routes, and how many people ride them daily?
+
+%% Cell type:code id: tags:
+
+``` python
+path = "bus.db"
+# assert existence of path
+assert os.path.exists(path)
+
+# establish connection to bus.db
+conn = sqlite3.connect(path)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df = pd.read_sql("""
+SELECT Route, SUM(DailyBoardings) AS daily
+FROM boarding
+GROUP BY Route
+ORDER BY daily DESC
+""", conn)
+
+df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's extract daily column from df
+df["daily"]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's create a bar plot from daily column Series
+df["daily"].plot.bar()
+
+# Oops wrong x-axis labels!
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df = ???
+
+# let's plot for top 5 routes alone
+???
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's use slicing to aggregate the rest of the data
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's plot the bars
+ax = (s / 1000).plot.bar(color = "k")
+ax.set_ylabel("Rides / Day (Thousands)")
+None
+```
+
+%% Cell type:code id: tags:
+
+``` python
+conn.close()
+```
+
+%% Cell type:markdown id: tags:
+
+### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
+- This set of data is used in beginning Machine Learning Courses
+- You can train a ML algorithm to use the values to predict the class of iris
+- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 1:  Downloading IRIS dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data)
+
+%% Cell type:code id: tags:
+
+``` python
+# use requests to get this URL
+url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
+response = ???
+
+# check that the request was successful
+???
+
+# open a file called "iris.csv" for writing the data locally
+file_obj = open("iris.csv", ???)
+
+# write the text of response to the file object
+file_obj.write(???)
+
+# close the file object
+file_obj.close()
+
+# Look at the file you downloaded. What's wrong with it?
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 2: Making a DataFrame
+
+%% Cell type:code id: tags:
+
+``` python
+# read the "iris.csv" file into a Pandas dataframe
+iris_df = ???
+
+# display the head of the data frame
+iris_df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 3: Our CSV file has no header. Let's add column names.
+- Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
+
+%% Cell type:code id: tags:
+
+``` python
+# Attribute Information:
+# 1. sepal length in cm
+# 2. sepal width in cm
+# 3. petal length in cm
+# 4. petal width in cm
+# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
+
+# These should be our headers
+# ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
+
+iris_df = pd.read_csv("iris.csv",
+                 ???)
+iris_df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 4: Connect to our database version of this data!
+
+%% Cell type:code id: tags:
+
+``` python
+iris_conn = sqlite3.connect("iris-flowers.db")
+pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
+Break any ties by ordering by the shortest sepal width.
+
+%% Cell type:code id: tags:
+
+``` python
+pd.read_sql("""
+    SELECT
+    FROM
+    WHERE
+    ORDER BY
+    LIMIT 10
+""", iris_conn)
+```
+
+%% Cell type:markdown id: tags:
+
+# Lecture 36:  Scatter Plots
+**Learning Objectives**
+- Set the marker, color, and size of scatter plot data
+- Calculate correlation between DataFrame columns
+- Use subplots to group scatterplot data
+
+%% Cell type:markdown id: tags:
+
+## Set the marker, color, and size of scatter plot data
+
+To start, let's look at some made-up data about Trees.
+The city of Madison maintains a database of all the trees they care for.
+
+%% Cell type:code id: tags:
+
+``` python
+trees = [
+    {"age": 1, "height": 1.5, "diameter": 0.8},
+    {"age": 1, "height": 1.9, "diameter": 1.2},
+    {"age": 1, "height": 1.8, "diameter": 1.4},
+    {"age": 2, "height": 1.8, "diameter": 0.9},
+    {"age": 2, "height": 2.5, "diameter": 1.5},
+    {"age": 2, "height": 3, "diameter": 1.8},
+    {"age": 2, "height": 2.9, "diameter": 1.7},
+    {"age": 3, "height": 3.2, "diameter": 2.1},
+    {"age": 3, "height": 3, "diameter": 2},
+    {"age": 3, "height": 2.4, "diameter": 2.2},
+    {"age": 2, "height": 3.1, "diameter": 2.9},
+    {"age": 4, "height": 2.5, "diameter": 3.1},
+    {"age": 4, "height": 3.9, "diameter": 3.1},
+    {"age": 4, "height": 4.9, "diameter": 2.8},
+    {"age": 4, "height": 5.2, "diameter": 3.5},
+    {"age": 4, "height": 4.8, "diameter": 4},
+]
+trees_df = DataFrame(trees)
+trees_df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+### Scatter Plots
+We can make a scatter plot of a DataFrame using the following function...
+
+`df_name.plot.scatter(x = "x_col_name", y = "y_col_name", \
+                     color = "red", marker = "*", s = 50)`
+
+%% Cell type:markdown id: tags:
+
+Plot the trees data comparing a tree's age to its height...
+ - What is `df_name`?
+ - What is `x_col_name`?
+ - What is `y_col_name`?
+
+%% Cell type:code id: tags:
+
+``` python
+# TODO: change y to diameter
+```
+
+%% Cell type:markdown id: tags:
+
+Now plot with a little more beautification...
+ - Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
+ - Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
+ - Change the size (any int)
+
+%% Cell type:code id: tags:
+
+``` python
+# Plot with some more beautification options.
+trees_df.plot.scatter(x = "age", y = "height", color = "r",  marker = "D", s = 50)
+# D for diamond
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Add a title to your plot.
+ax = trees_df.plot.scatter(x = "age", y = "height", color = "r", marker = "D", s = 50)
+# D for diamond
+ax.set_title("Tree Age vs Height")
+```
+
+%% Cell type:markdown id: tags:
+
+#### Correlation
+
+%% Cell type:code id: tags:
+
+``` python
+# What is the correlation between our DataFrame columns?
+corr_df = trees_df.corr()
+corr_df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# What is the correlation between age and height (don't use .iloc)
+# Using index in this case isn't considered as hardcoding
+corr_df['age']['height']
+```
+
+%% Cell type:markdown id: tags:
+
+### Variating Stylistic Parameters
+
+%% Cell type:code id: tags:
+
+``` python
+# Option 1:
+trees_df.plot.scatter(x = "age", y = "height",  marker = "H", s = "diameter")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Option 2:
+# this way allows you to make it bigger
+trees_df.plot.scatter(x = "age", y = "height", marker = "H", s = trees_df["diameter"] * 50)
+```
+
+%% Cell type:markdown id: tags:
+
+## Use subplots to group scatterplot data
+
+%% Cell type:markdown id: tags:
+
+### Re-visit the Iris Data
+
+%% Cell type:code id: tags:
+
+``` python
+iris_df
+```
+
+%% Cell type:markdown id: tags:
+
+### How do we create a *scatter plot* for various *class types*?
+First, gather all the class types.
+
+%% Cell type:code id: tags:
+
+``` python
+# In Pandas
+varietes = list(set(iris_df["class"]))
+varietes
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# In SQL
+varietes = list(pd.read_sql("""
+    SELECT DISTINCT class
+    FROM iris
+""", iris_conn)["class"])
+varietes
+```
+
+%% Cell type:markdown id: tags:
+
+In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
+
+%% Cell type:code id: tags:
+
+``` python
+# If you want to continue using SQL instead, don't close the connection!
+iris_conn.close()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Change this scatter plot so that the data is only for class ='Iris-setosa'
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Write a for loop that iterates through each variety in classes
+# and makes a plot for only that class
+
+# For each class add a color and a marker style
+colors = ["blue", "green", "red"]
+markers = ["o", "^", "v"]
+
+for i in range(len(varietes)):
+    ???
+```
+
+%% Cell type:markdown id: tags:
+
+Did you notice that it made 3 plots?!?! What's decieving about this?
+
+%% Cell type:markdown id: tags:
+
+### We can make Subplots in plots, called an AxesSubplot, keyword ax
+1. if AxesSuplot ax passed, then plot in that subplot
+2. if ax is None, create a new AxesSubplot
+3. return AxesSubplot that was used
+
+%% Cell type:code id: tags:
+
+``` python
+# complete this code to make 3 plots in one
+
+plot_area = None   # don't change this...look at this variable in line 12
+colors = ["blue", "green", "red"]
+markers = ["o", "^", "v"]
+for i in range(len(varietes)):
+    ???
+```
+
+%% Cell type:markdown id: tags:
+
+### Let's focus on "Iris-virginica" data
+
+%% Cell type:code id: tags:
+
+``` python
+iris_virginica = ???
+assert(len(iris_virginica) == 50)
+iris_virginica.head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+iris_virginica.plot.scatter(x = "pet-width", y = "pet-length")
+```
+
+%% Cell type:markdown id: tags:
+
+### Let's learn about *xlim* and *ylim*
+- Allows us to set x-axis and y-axis limits
+- Takes either a single value (LOWER-BOUND) or a tuple containing two values (LOWER-BOUND, UPPER-BOUND)
+- You need to be careful about setting the UPPER-BOUND
+
+%% Cell type:code id: tags:
+
+``` python
+iris_virginica.plot.scatter(x = "pet-width", y = "pet-length", xlim = ???, ylim = ???)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
+                    xlim = (0, 6), ylim = (0, 6),
+                    figsize = (3, 3))
+
+# What is wrong with this plot?
+```
+
+%% Cell type:markdown id: tags:
+
+What is the maximum pet-len?
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+ax.get_ylim()
+```
+
+%% Cell type:markdown id: tags:
+
+Let's include assert statements to make sure we don't crop the plot!
+
+%% Cell type:code id: tags:
+
+``` python
+ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
+                     xlim = (0, 6), ylim = (0, 6),
+                     figsize = (3, 3))
+assert iris_virginica["pet-length"].max() <= ax.get_ylim()[1]
+```
+
+%% Cell type:markdown id: tags:
+
+### Now let's try all 4 assert statements
+
+```
+assert iris_virginica[ax.get_xlabel()].min() >= ax.get_xlim()[0]
+assert iris_virginica[ax.get_xlabel()].max() <= ax.get_xlim()[1]
+assert iris_virginica[ax.get_ylabel()].min() >= ax.get_ylim()[0]
+assert iris_virginica[ax.get_ylabel()].max() <= ax.get_ylim()[1]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
+                     xlim = (0, 7), ylim = (0, 7),
+                     figsize = (3, 3))
+assert iris_virginica[ax.get_xlabel()].min() >= ax.get_xlim()[0]
+assert iris_virginica[ax.get_xlabel()].max() <= ax.get_xlim()[1]
+assert iris_virginica[ax.get_ylabel()].min() >= ax.get_ylim()[0]
+assert iris_virginica[ax.get_ylabel()].max() <= ax.get_ylim()[1]
+```
+
+%% Cell type:markdown id: tags:
+
+### Time-Permitting
+Plot this data in an interesting/meaningful way & identify any correlations.
+
+%% Cell type:code id: tags:
+
+``` python
+students = pd.DataFrame({
+    "name": [
+        "Cole",
+        "Cynthia",
+        "Alice",
+        "Seth"
+    ],
+    "grade": [
+        "C",
+        "AB",
+        "B",
+        "BC"
+    ],
+    "gpa": [
+        2.0,
+        3.5,
+        3.0,
+        2.5
+    ],
+    "attendance": [
+        4,
+        11,
+        10,
+        6
+    ],
+    "height": [
+        68,
+        66,
+        60,
+        72
+    ]
+})
+students
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Min, Max, and Overall Difference in Student Height
+min_height = students["height"].min()
+max_height = students["height"].max()
+diff_height = max_height - min_height
+
+# Normalize students heights on a scale of [0, 1] (black to white)
+height_colors = (students["height"] - min_height) / diff_height
+
+# Normalize students heights on a scale of [0, 0.5] (black to gray)
+height_colors = height_colors / 2
+
+# Color must be a string (e.g. c='0.34')
+height_colors = height_colors.astype("string")
+
+height_colors
+```
+
+%% Cell type:code id: tags:
+
+``` python
+students.plot.scatter(x="attendance", y="gpa", c=height_colors)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+students.corr()
+```
--- a/f22/andy_lec_notes/lec_36/readme.md
+++ b/f22/andy_lec_notes/lec_36/readme.md
--- a/f22/andy_lec_notes/lec_37/fire_hydrants.csv
+++ b/f22/andy_lec_notes/lec_37/fire_hydrants.csv
--- a/f22/andy_lec_notes/lec_37/lec37_plotting3_complete.ipynb
+++ b/f22/andy_lec_notes/lec_37/lec37_plotting3_complete.ipynb
--- a/f22/andy_lec_notes/lec_37/lec37_plotting3_template.ipynb
+++ b/f22/andy_lec_notes/lec_37/lec37_plotting3_template.ipynb
--- a/f22/andy_lec_notes/lec_37/readme.md
+++ b/f22/andy_lec_notes/lec_37/readme.md
--- a/f22/andy_lec_notes/lec_38/lec38_plotting4_complete.ipynb
+++ b/f22/andy_lec_notes/lec_38/lec38_plotting4_complete.ipynb
--- a/f22/andy_lec_notes/lec_38/lec38_plotting4_template.ipynb
+++ b/f22/andy_lec_notes/lec_38/lec38_plotting4_template.ipynb
--- a/f22/andy_lec_notes/lec_38/readme.md
+++ b/f22/andy_lec_notes/lec_38/readme.md
--- a/f22/andy_lec_notes/lec_36/lec36_plotting2_850.ipynb
+++ b/f22/andy_lec_notes/lec_36/lec36_plotting2_850.ipynb
-%% Cell type:code id: tags:
-
-``` python
-# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
-from IPython.core.display import display, HTML
-display(HTML('<style>em { color: red; }</style> <style>.container { width:100% !important; }</style>'))
-```
-
-%% Output
-
-
-%% Cell type:code id: tags:
-
-``` python
-import pandas as pd
-from pandas import DataFrame, Series
-
-import sqlite3
-import os
-
-import matplotlib
-from matplotlib import pyplot as plt
-
-import requests
-matplotlib.rcParams["font.size"] = 12
-```
-
-%% Cell type:markdown id: tags:
-
-### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
- This set of data is used in beginning Machine Learning Courses
- You can train a ML algorithm to use the values to predict the class of iris
- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 1:  Requests and file writing
-
-# use requests to get this file  "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
-response = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
-
-# check that the request was successful
-response.raise_for_status()
-
-# open a file called "iris.csv" for writing the data locally to avoid spamming their server
-file_obj = open("iris.csv", "w")
-
-# write the text of response to the file object
-file_obj.write(response.text)
-
-# close the file object
-file_obj.close()
-
-# Look at the file you downloaded. What's wrong with it?
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 2:  Making a DataFrame
-
-# read the "iris.csv" file into a Pandas dataframe
-
-# display the head of the data frame
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 3: Our CSV file has no header....let's add column names.
-#           Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
-
-# Attribute Information:
-# 1. sepal length in cm
-# 2. sepal width in cm
-# 3. petal length in cm
-# 4. petal width in cm
-# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
-
-# These should be our headers ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 4: Connect to our database version of this data
-iris_conn = sqlite3.connect("iris-flowers.db")
-
-# find out the name of the table
-pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
-#           Break any ties by ordering by the shortest sepal width.
-
-pd.read_sql("""
-
-""", iris_conn)
-```
-
-%% Cell type:markdown id: tags:
-
-# Lecture 36:  Scatter Plots
-**Learning Objectives**
- Set the marker, color, and size of scatter plot data
- Calculate correlation between DataFrame columns
- Use subplots to group scatterplot data
-
-%% Cell type:markdown id: tags:
-
-## Set the marker, color, and size of scatter plot data
-
-To start, let's look at some made-up data about Trees.
-The city of Madison maintains a database of all the trees they care for.
-
-%% Cell type:code id: tags:
-
-``` python
-trees = [
-    {"age": 1, "height": 1.5, "diameter": 0.8},
-    {"age": 1, "height": 1.9, "diameter": 1.2},
-    {"age": 1, "height": 1.8, "diameter": 1.4},
-    {"age": 2, "height": 1.8, "diameter": 0.9},
-    {"age": 2, "height": 2.5, "diameter": 1.5},
-    {"age": 2, "height": 3, "diameter": 1.8},
-    {"age": 2, "height": 2.9, "diameter": 1.7},
-    {"age": 3, "height": 3.2, "diameter": 2.1},
-    {"age": 3, "height": 3, "diameter": 2},
-    {"age": 3, "height": 2.4, "diameter": 2.2},
-    {"age": 2, "height": 3.1, "diameter": 2.9},
-    {"age": 4, "height": 2.5, "diameter": 3.1},
-    {"age": 4, "height": 3.9, "diameter": 3.1},
-    {"age": 4, "height": 4.9, "diameter": 2.8},
-    {"age": 4, "height": 5.2, "diameter": 3.5},
-    {"age": 4, "height": 4.8, "diameter": 4},
-]
-trees_df = DataFrame(trees)
-trees_df.head()
-```
-
-%% Cell type:markdown id: tags:
-
-### Scatter Plots
-We can make a scatter plot of a DataFrame using the following function...
-
-`df_name.plot.scatter(x="x_col_name", y="y_col_name", color="peachpuff")`
-
-Plot the trees data comparing a tree's age to its height...
- - What is `df_name`?
- - What is `x_col_name`?
- - What is `y_col_name`?
-
-%% Cell type:code id: tags:
-
-``` python
-```
-
-%% Cell type:markdown id: tags:
-
-Now plot with a little more beautification...
- - Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
- - Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
- - Change the size (any int)
-
-%% Cell type:code id: tags:
-
-``` python
-# Plot with some more beautification options.
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Add a title to your plot.
-```
-
-%% Cell type:markdown id: tags:
-
-#### Correlation
-
-%% Cell type:code id: tags:
-
-``` python
-# What is the correlation between our DataFrame columns?
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# What is the correlation between age and height (don't use .iloc)
-```
-
-%% Cell type:markdown id: tags:
-
-### The Size can be based on a DataFrame value
-
-%% Cell type:code id: tags:
-
-``` python
-# Option 1:
-trees_df.plot.scatter(x="age", y="height",  marker="H", s="diameter")
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Option 2:
-trees_df.plot.scatter(x="age", y="height", marker = "H", s=trees_df["diameter"] * 50) # this way allows you to make it bigger
-```
-
-%% Cell type:markdown id: tags:
-
-## Use subplots to group scatterplot data
-
-%% Cell type:markdown id: tags:
-
-### Re-visit the Iris Data
-
-%% Cell type:code id: tags:
-
-``` python
-iris_df
-```
-
-%% Cell type:markdown id: tags:
-
-### How do we create a *scatter plot* for various *class types*?
-First, gather all the class types.
-
-%% Cell type:code id: tags:
-
-``` python
-# In Pandas
-varieties = ???
-varieties
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# In SQL
-varietes = pd.read_sql("""
-
-""", iris_conn)
-varietes
-```
-
-%% Cell type:markdown id: tags:
-
-In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
-
-%% Cell type:code id: tags:
-
-``` python
-# If you want to continue using SQL instead, don't close the connection!
-iris_conn.close()
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Change this scatter plot so that the data is only for class ='Iris-setosa'
-iris_df.plot.scatter(x = "pet-width", y = "pet-length")
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Write a for loop that iterates through each variety in classes
-# and makes a plot for only that class
-
-for i in range(len(varietes)):
-    variety = varietes[i]
-    pass
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# copy/paste the code above, but this time make each plot a different color
-colors = ["blue", "green", "red"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# copy/paste the code above, but this time make each plot a different color AND marker
-colors = ["blue", "green", "red"]
-markers = ["o", "^", "v"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Did you notice that it made 3 plots?!?! What's deceiving about this?
-```
-
-%% Cell type:code id: tags:
-
-``` python
-colors = ["blue", "green", "red"]
-markers = ["o", "^", "v"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Have to be VERY careful to not crop out data.
-# We'll talk about this next lecture.
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Better yet, we could combine these.
-```
-
-%% Cell type:markdown id: tags:
-
-### We can make Subplots in plots, called an AxesSubplot, keyword ax
-1. if AxesSuplot ax passed, then plot in that subplot
-2. if ax is None, create a new AxesSubplot
-3. return AxesSubplot that was used
-
-%% Cell type:code id: tags:
-
-``` python
-# complete this code to make 3 plots in one
-
-plot_area = None   # don't change this...look at this variable in line 12
-colors = ["blue", "green", "red"]
-markers = ["o", "^", "v"]
-```
-
-%% Cell type:markdown id: tags:
-
-### Time-Permitting
-Plot this data in an interesting/meaningful way & identify any correlations.
-
-%% Cell type:code id: tags:
-
-``` python
-students = pd.DataFrame({
-    "name": [
-        "Cole",
-        "Cynthia",
-        "Alice",
-        "Seth"
-    ],
-    "grade": [
-        "C",
-        "AB",
-        "B",
-        "BC"
-    ],
-    "gpa": [
-        2.0,
-        3.5,
-        3.0,
-        2.5
-    ],
-    "attendance": [
-        4,
-        11,
-        10,
-        6
-    ],
-    "height": [
-        68,
-        66,
-        60,
-        72
-    ]
-})
-students
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Min, Max, and Overall Difference in Student Height
-min_height = students["height"].min()
-max_height = students["height"].max()
-diff_height = max_height - min_height
-
-# Normalize students heights on a scale of [0, 1] (black to white)
-height_colors = (students["height"] - min_height) / diff_height
-
-# Normalize students heights on a scale of [0, 0.5] (black to gray)
-height_colors = height_colors / 2
-
-# Color must be a string (e.g. c='0.34')
-height_colors = height_colors.astype("string")
-
-height_colors
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Plot!
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# What are the correlations?
-```
-
-%% Cell type:markdown id: tags:
-
-![image.png](attachment:image.png)
-
-%% Cell type:markdown id: tags:
-
-https://www.researchgate.net/publication/247907373_Stupid_Data_Miner_Tricks_Overfitting_the_SP_500
-%% Cell type:code id: tags:
-
-``` python
-# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
-from IPython.core.display import display, HTML
-display(HTML('<style>em { color: red; }</style> <style>.container { width:100% !important; }</style>'))
-```
-
-%% Output
-
-
-%% Cell type:code id: tags:
-
-``` python
-import pandas as pd
-from pandas import DataFrame, Series
-
-import sqlite3
-import os
-
-import matplotlib
-from matplotlib import pyplot as plt
-
-import requests
-matplotlib.rcParams["font.size"] = 12
-```
-
-%% Cell type:markdown id: tags:
-
-### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
- This set of data is used in beginning Machine Learning Courses
- You can train a ML algorithm to use the values to predict the class of iris
- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 1:  Requests and file writing
-
-# use requests to get this file  "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
-response = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
-
-# check that the request was successful
-response.raise_for_status()
-
-# open a file called "iris.csv" for writing the data locally to avoid spamming their server
-file_obj = open("iris.csv", "w")
-
-# write the text of response to the file object
-file_obj.write(response.text)
-
-# close the file object
-file_obj.close()
-
-# Look at the file you downloaded. What's wrong with it?
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 2:  Making a DataFrame
-
-# read the "iris.csv" file into a Pandas dataframe
-
-# display the head of the data frame
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 3: Our CSV file has no header....let's add column names.
-#           Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
-
-# Attribute Information:
-# 1. sepal length in cm
-# 2. sepal width in cm
-# 3. petal length in cm
-# 4. petal width in cm
-# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
-
-# These should be our headers ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 4: Connect to our database version of this data
-iris_conn = sqlite3.connect("iris-flowers.db")
-
-# find out the name of the table
-pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
-#           Break any ties by ordering by the shortest sepal width.
-
-pd.read_sql("""
-
-""", iris_conn)
-```
-
-%% Cell type:markdown id: tags:
-
-# Lecture 36:  Scatter Plots
-**Learning Objectives**
- Set the marker, color, and size of scatter plot data
- Calculate correlation between DataFrame columns
- Use subplots to group scatterplot data
-
-%% Cell type:markdown id: tags:
-
-## Set the marker, color, and size of scatter plot data
-
-To start, let's look at some made-up data about Trees.
-The city of Madison maintains a database of all the trees they care for.
-
-%% Cell type:code id: tags:
-
-``` python
-trees = [
-    {"age": 1, "height": 1.5, "diameter": 0.8},
-    {"age": 1, "height": 1.9, "diameter": 1.2},
-    {"age": 1, "height": 1.8, "diameter": 1.4},
-    {"age": 2, "height": 1.8, "diameter": 0.9},
-    {"age": 2, "height": 2.5, "diameter": 1.5},
-    {"age": 2, "height": 3, "diameter": 1.8},
-    {"age": 2, "height": 2.9, "diameter": 1.7},
-    {"age": 3, "height": 3.2, "diameter": 2.1},
-    {"age": 3, "height": 3, "diameter": 2},
-    {"age": 3, "height": 2.4, "diameter": 2.2},
-    {"age": 2, "height": 3.1, "diameter": 2.9},
-    {"age": 4, "height": 2.5, "diameter": 3.1},
-    {"age": 4, "height": 3.9, "diameter": 3.1},
-    {"age": 4, "height": 4.9, "diameter": 2.8},
-    {"age": 4, "height": 5.2, "diameter": 3.5},
-    {"age": 4, "height": 4.8, "diameter": 4},
-]
-trees_df = DataFrame(trees)
-trees_df.head()
-```
-
-%% Cell type:markdown id: tags:
-
-### Scatter Plots
-We can make a scatter plot of a DataFrame using the following function...
-
-`df_name.plot.scatter(x="x_col_name", y="y_col_name", color="peachpuff")`
-
-Plot the trees data comparing a tree's age to its height...
- - What is `df_name`?
- - What is `x_col_name`?
- - What is `y_col_name`?
-
-%% Cell type:code id: tags:
-
-``` python
-```
-
-%% Cell type:markdown id: tags:
-
-Now plot with a little more beautification...
- - Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
- - Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
- - Change the size (any int)
-
-%% Cell type:code id: tags:
-
-``` python
-# Plot with some more beautification options.
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Add a title to your plot.
-```
-
-%% Cell type:markdown id: tags:
-
-#### Correlation
-
-%% Cell type:code id: tags:
-
-``` python
-# What is the correlation between our DataFrame columns?
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# What is the correlation between age and height (don't use .iloc)
-```
-
-%% Cell type:markdown id: tags:
-
-### The Size can be based on a DataFrame value
-
-%% Cell type:code id: tags:
-
-``` python
-# Option 1:
-trees_df.plot.scatter(x="age", y="height",  marker="H", s="diameter")
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Option 2:
-trees_df.plot.scatter(x="age", y="height", marker = "H", s=trees_df["diameter"] * 50) # this way allows you to make it bigger
-```
-
-%% Cell type:markdown id: tags:
-
-## Use subplots to group scatterplot data
-
-%% Cell type:markdown id: tags:
-
-### Re-visit the Iris Data
-
-%% Cell type:code id: tags:
-
-``` python
-iris_df
-```
-
-%% Cell type:markdown id: tags:
-
-### How do we create a *scatter plot* for various *class types*?
-First, gather all the class types.
-
-%% Cell type:code id: tags:
-
-``` python
-# In Pandas
-varieties = ???
-varieties
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# In SQL
-varietes = pd.read_sql("""
-
-""", iris_conn)
-varietes
-```
-
-%% Cell type:markdown id: tags:
-
-In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
-
-%% Cell type:code id: tags:
-
-``` python
-# If you want to continue using SQL instead, don't close the connection!
-iris_conn.close()
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Change this scatter plot so that the data is only for class ='Iris-setosa'
-iris_df.plot.scatter(x = "pet-width", y = "pet-length")
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Write a for loop that iterates through each variety in classes
-# and makes a plot for only that class
-
-for i in range(len(varietes)):
-    variety = varietes[i]
-    pass
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# copy/paste the code above, but this time make each plot a different color
-colors = ["blue", "green", "red"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# copy/paste the code above, but this time make each plot a different color AND marker
-colors = ["blue", "green", "red"]
-markers = ["o", "^", "v"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Did you notice that it made 3 plots?!?! What's deceiving about this?
-```
-
-%% Cell type:code id: tags:
-
-``` python
-colors = ["blue", "green", "red"]
-markers = ["o", "^", "v"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Have to be VERY careful to not crop out data.
-# We'll talk about this next lecture.
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Better yet, we could combine these.
-```
-
-%% Cell type:markdown id: tags:
-
-### We can make Subplots in plots, called an AxesSubplot, keyword ax
-1. if AxesSuplot ax passed, then plot in that subplot
-2. if ax is None, create a new AxesSubplot
-3. return AxesSubplot that was used
-
-%% Cell type:code id: tags:
-
-``` python
-# complete this code to make 3 plots in one
-
-plot_area = None   # don't change this...look at this variable in line 12
-colors = ["blue", "green", "red"]
-markers = ["o", "^", "v"]
-```
-
-%% Cell type:markdown id: tags:
-
-### Time-Permitting
-Plot this data in an interesting/meaningful way & identify any correlations.
-
-%% Cell type:code id: tags:
-
-``` python
-students = pd.DataFrame({
-    "name": [
-        "Cole",
-        "Cynthia",
-        "Alice",
-        "Seth"
-    ],
-    "grade": [
-        "C",
-        "AB",
-        "B",
-        "BC"
-    ],
-    "gpa": [
-        2.0,
-        3.5,
-        3.0,
-        2.5
-    ],
-    "attendance": [
-        4,
-        11,
-        10,
-        6
-    ],
-    "height": [
-        68,
-        66,
-        60,
-        72
-    ]
-})
-students
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Min, Max, and Overall Difference in Student Height
-min_height = students["height"].min()
-max_height = students["height"].max()
-diff_height = max_height - min_height
-
-# Normalize students heights on a scale of [0, 1] (black to white)
-height_colors = (students["height"] - min_height) / diff_height
-
-# Normalize students heights on a scale of [0, 0.5] (black to gray)
-height_colors = height_colors / 2
-
-# Color must be a string (e.g. c='0.34')
-height_colors = height_colors.astype("string")
-
-height_colors
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Plot!
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# What are the correlations?
-```
-
-%% Cell type:markdown id: tags:
-
-![image.png](attachment:image.png)
-
-%% Cell type:markdown id: tags:
-
-https://www.researchgate.net/publication/247907373_Stupid_Data_Miner_Tricks_Overfitting_the_SP_500
--- a/f22/meena_lec_notes/lec-35/lec_35_pandas3_data_transformation.ipynb
+++ b/f22/meena_lec_notes/lec-35/lec_35_pandas3_data_transformation.ipynb
+%% Cell type:code id: tags:
+
+``` python
+# known import statements
+import pandas as pd
+import sqlite3 as sql # note that we are renaming to sql
+import os
+
+# new import statement
+import numpy as np
+```
+
+%% Cell type:markdown id: tags:
+
+# Lecture 35 Pandas 3: Data Transformation
+* Data transformation is the process of changing the format, structure, or values of data.
+* Often needed during data cleaning and sometimes during data analysis
+
+%% Cell type:markdown id: tags:
+
+# Today's Learning Objectives:
+
+* Setting column as index for pandas `DataFrame`
+* Identify, drop, or fill missing values (`np.NaN`) using Pandas `isna`, `dropna`, and `fillna`
+* Applying transformations to `DataFrame`:
+  * Use `apply` on pandas `Series` to apply a transformation function
+  * Use `replace` to replace all target values in Pandas `Series` and `DataFrame` rows / columns
+* Filter, aggregate, group, and summarize information in a `DataFrame` with `groupby`
+* Convert .groupby examples to SQL
+* Solving the same question using SQL and pandas `DataFrame` manipulations:
+  * filtering, grouping, and aggregation / summarization
+
+%% Cell type:markdown id: tags:
+
+# The dataset: Spotify songs
+Adapted from https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify.
+
+If you are interested in digging deeper in this dataset, here's a [blog post](https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a) that explain each column in details.
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 1: Establish a connection to the spotify.db database
+
+%% Cell type:code id: tags:
+
+``` python
+# open up the spotify database
+db_pathname = "spotify.db"
+assert os.path.exists(db_pathname)
+conn = sql.connect(db_pathname)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def qry(sql):
+    return pd.read_sql(sql, conn)
+```
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 2: Identify the table name(s) inside the database
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("SELECT * from sqlite_master")
+df
+```
+
+%% Output
+
+        type                        name tbl_name  rootpage  \
+    0  table                     spotify  spotify      1527
+    1  index  sqlite_autoindex_spotify_1  spotify      1528
+    
+                                                     sql
+    0  CREATE TABLE spotify(\nid TEXT PRIMARY KEY,\nt...
+    1                                               None
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 3: Use pandas lookup expression to extract the "sql" column and display the full query using .iloc lookup
+
+%% Cell type:code id: tags:
+
+``` python
+print(df["sql"].iloc[0])
+```
+
+%% Output
+
+    CREATE TABLE spotify(
+    id TEXT PRIMARY KEY,
+    title BLOB,
+    song_name BLOB,
+    genre TEXT,
+    duration_ms INTEGER,
+    key INTEGER,
+    mode INTEGER,
+    time_signature INTEGER,
+    tempo REAL,
+    acousticness REAL,
+    danceability REAL,
+    energy REAL,
+    instrumentalness REAL,
+    liveness REAL,
+    loudness REAL,
+    speechiness REAL,
+    valence REAL)
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 4: Store the data inside `spotify` table inside a variable called `df`
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("SELECT * FROM spotify")
+df
+```
+
+%% Output
+
+                               id                        title  song_name  \
+    0      7pgJBLVz5VmnL7uGHmRj6p                               Pathology
+    1      0vSWgAlfpye0WCGeNmuNhy                                Symbiote
+    2      7EL7ifncK2PWFYThJjzR25                               BRAINFOOD
+    3      1umsRbM7L4ju7rn9aU8Ju6                               Sacrifice
+    4      4SKqOHKYU5pgHr5UiVKiQN                                Backpack
+    ...                       ...                          ...        ...
+    35872  46bXU7Sgj7104ZoXxzz9tM           Euphoric Hardstyle
+    35873  0he2ViGMUO3ajKTxLOfWVT  Greatest Hardstyle Playlist
+    35874  72DAt9Lbpy9EUS29OzQLob       Best of Hardstyle 2020
+    35875  6HXgExFVuE1c3cq9QjFCcU           Euphoric Hardstyle
+    35876  6MAAMZImxcvYhRnxDLTufD       Best of Hardstyle 2020
+    
+               genre  duration_ms  key  mode  time_signature    tempo  \
+    0      Dark Trap       224427    8     1               4  115.080
+    1      Dark Trap        98821    5     1               4  218.050
+    2      Dark Trap       101172    8     1               4  189.938
+    3      Dark Trap        96062   10     0               4  139.990
+    4      Dark Trap       135079    5     1               4  128.014
+    ...          ...          ...  ...   ...             ...      ...
+    35872  hardstyle       269208    4     1               4  150.013
+    35873  hardstyle       210112    0     0               4  149.928
+    35874  hardstyle       234823    8     1               4  154.935
+    35875  hardstyle       323200    6     0               4  150.042
+    35876  hardstyle       162161    9     1               4  155.047
+    
+           acousticness  danceability  energy  instrumentalness  liveness  \
+    0          0.401000         0.719   0.493          0.000000    0.1180
+    1          0.013800         0.850   0.893          0.000004    0.3720
+    2          0.187000         0.864   0.365          0.000000    0.1160
+    3          0.145000         0.767   0.576          0.000003    0.0968
+    4          0.007700         0.765   0.726          0.000000    0.6190
+    ...             ...           ...     ...               ...       ...
+    35872      0.031500         0.528   0.693          0.000345    0.1210
+    35873      0.022500         0.517   0.768          0.000018    0.2050
+    35874      0.026000         0.361   0.821          0.000242    0.3850
+    35875      0.000551         0.477   0.921          0.029600    0.0575
+    35876      0.001890         0.529   0.945          0.000055    0.4140
+    
+           loudness  speechiness  valence
+    0        -7.230       0.0794   0.1240
+    1        -4.783       0.0623   0.0391
+    2       -10.219       0.0655   0.0478
+    3        -9.683       0.2560   0.1870
+    4        -5.580       0.1910   0.2700
+    ...         ...          ...      ...
+    35872    -5.148       0.0304   0.3940
+    35873    -7.922       0.0479   0.3830
+    35874    -3.102       0.0505   0.1240
+    35875    -4.777       0.0392   0.4880
+    35876    -5.862       0.0615   0.1340
+    
+    [35877 rows x 17 columns]
+
+%% Cell type:markdown id: tags:
+
+### Setting a column as row indices for the `DataFrame`
+
+- Syntax: `df.set_index("<COLUMN>")`
+- Returns a new DataFrame object instance reference.
+- WARNING: executing this twice will result in `KeyError` being thrown. Once you set a column as row index, it will no longer be a column within the `DataFrame`. If you tried this, go back and execute the above cell and update `df` once more and then execute the below cell exactly once.
+
+%% Cell type:code id: tags:
+
+``` python
+# Set the id column as row indices
+df = df.set_index("id")
+df
+```
+
+%% Output
+
+                                                  title  song_name      genre  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p                               Pathology  Dark Trap
+    0vSWgAlfpye0WCGeNmuNhy                                Symbiote  Dark Trap
+    7EL7ifncK2PWFYThJjzR25                               BRAINFOOD  Dark Trap
+    1umsRbM7L4ju7rn9aU8Ju6                               Sacrifice  Dark Trap
+    4SKqOHKYU5pgHr5UiVKiQN                                Backpack  Dark Trap
+    ...                                             ...        ...        ...
+    46bXU7Sgj7104ZoXxzz9tM           Euphoric Hardstyle             hardstyle
+    0he2ViGMUO3ajKTxLOfWVT  Greatest Hardstyle Playlist             hardstyle
+    72DAt9Lbpy9EUS29OzQLob       Best of Hardstyle 2020             hardstyle
+    6HXgExFVuE1c3cq9QjFCcU           Euphoric Hardstyle             hardstyle
+    6MAAMZImxcvYhRnxDLTufD       Best of Hardstyle 2020             hardstyle
+    
+                            duration_ms  key  mode  time_signature    tempo  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p       224427    8     1               4  115.080
+    0vSWgAlfpye0WCGeNmuNhy        98821    5     1               4  218.050
+    7EL7ifncK2PWFYThJjzR25       101172    8     1               4  189.938
+    1umsRbM7L4ju7rn9aU8Ju6        96062   10     0               4  139.990
+    4SKqOHKYU5pgHr5UiVKiQN       135079    5     1               4  128.014
+    ...                             ...  ...   ...             ...      ...
+    46bXU7Sgj7104ZoXxzz9tM       269208    4     1               4  150.013
+    0he2ViGMUO3ajKTxLOfWVT       210112    0     0               4  149.928
+    72DAt9Lbpy9EUS29OzQLob       234823    8     1               4  154.935
+    6HXgExFVuE1c3cq9QjFCcU       323200    6     0               4  150.042
+    6MAAMZImxcvYhRnxDLTufD       162161    9     1               4  155.047
+    
+                            acousticness  danceability  energy  instrumentalness  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p      0.401000         0.719   0.493          0.000000
+    0vSWgAlfpye0WCGeNmuNhy      0.013800         0.850   0.893          0.000004
+    7EL7ifncK2PWFYThJjzR25      0.187000         0.864   0.365          0.000000
+    1umsRbM7L4ju7rn9aU8Ju6      0.145000         0.767   0.576          0.000003
+    4SKqOHKYU5pgHr5UiVKiQN      0.007700         0.765   0.726          0.000000
+    ...                              ...           ...     ...               ...
+    46bXU7Sgj7104ZoXxzz9tM      0.031500         0.528   0.693          0.000345
+    0he2ViGMUO3ajKTxLOfWVT      0.022500         0.517   0.768          0.000018
+    72DAt9Lbpy9EUS29OzQLob      0.026000         0.361   0.821          0.000242
+    6HXgExFVuE1c3cq9QjFCcU      0.000551         0.477   0.921          0.029600
+    6MAAMZImxcvYhRnxDLTufD      0.001890         0.529   0.945          0.000055
+    
+                            liveness  loudness  speechiness  valence
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    0.1180    -7.230       0.0794   0.1240
+    0vSWgAlfpye0WCGeNmuNhy    0.3720    -4.783       0.0623   0.0391
+    7EL7ifncK2PWFYThJjzR25    0.1160   -10.219       0.0655   0.0478
+    1umsRbM7L4ju7rn9aU8Ju6    0.0968    -9.683       0.2560   0.1870
+    4SKqOHKYU5pgHr5UiVKiQN    0.6190    -5.580       0.1910   0.2700
+    ...                          ...       ...          ...      ...
+    46bXU7Sgj7104ZoXxzz9tM    0.1210    -5.148       0.0304   0.3940
+    0he2ViGMUO3ajKTxLOfWVT    0.2050    -7.922       0.0479   0.3830
+    72DAt9Lbpy9EUS29OzQLob    0.3850    -3.102       0.0505   0.1240
+    6HXgExFVuE1c3cq9QjFCcU    0.0575    -4.777       0.0392   0.4880
+    6MAAMZImxcvYhRnxDLTufD    0.4140    -5.862       0.0615   0.1340
+    
+    [35877 rows x 16 columns]
+
+%% Cell type:markdown id: tags:
+
+### Not a Number
+
+- `np.NaN` is the floating point representation of Not a Number
+- You do not need to know / learn the details about the `numpy` package
+
+### Replacing / modifying values within the `DataFrame`
+
+Syntax: `df.replace(<TARGET>, <REPLACE>)`
+- Your target can be `str`, `int`, `float`, `None` (there are other possiblities, but those are too advanced for this course)
+- Returns a new DataFrame object instance reference.
+
+Let's now replace the missing values (empty strings) with `np.NAN`
+
+%% Cell type:code id: tags:
+
+``` python
+df = df.replace("", np.NaN)
+df.head(10) # title is the album name
+```
+
+%% Output
+
+                           title             song_name      genre  duration_ms  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p   NaN             Pathology  Dark Trap       224427
+    0vSWgAlfpye0WCGeNmuNhy   NaN              Symbiote  Dark Trap        98821
+    7EL7ifncK2PWFYThJjzR25   NaN             BRAINFOOD  Dark Trap       101172
+    1umsRbM7L4ju7rn9aU8Ju6   NaN             Sacrifice  Dark Trap        96062
+    4SKqOHKYU5pgHr5UiVKiQN   NaN              Backpack  Dark Trap       135079
+    3uE1swbcRp5BrO64UNy6Ex   NaN     TakingOutTheTrash  Dark Trap       192833
+    3KJrwOuqiEwHq6QTreZT61   NaN           Io sono qui  Dark Trap       180880
+    4QhUXx4ON40TIBrZIlnIke   NaN                Murder  Dark Trap       186261
+    09320vyX4qHd4GjHIpy5w0   NaN        High 'N Mighty  Dark Trap       124676
+    6xEnbXM1us9fDJy2LC0lru   NaN  Bang Ya Fucking Head  Dark Trap       154929
+    
+                            key  mode  time_signature    tempo  acousticness  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    8     1               4  115.080        0.4010
+    0vSWgAlfpye0WCGeNmuNhy    5     1               4  218.050        0.0138
+    7EL7ifncK2PWFYThJjzR25    8     1               4  189.938        0.1870
+    1umsRbM7L4ju7rn9aU8Ju6   10     0               4  139.990        0.1450
+    4SKqOHKYU5pgHr5UiVKiQN    5     1               4  128.014        0.0077
+    3uE1swbcRp5BrO64UNy6Ex   11     1               4  120.004        0.1720
+    3KJrwOuqiEwHq6QTreZT61   10     0               4  128.066        0.0987
+    4QhUXx4ON40TIBrZIlnIke    0     1               4  114.956        0.0343
+    09320vyX4qHd4GjHIpy5w0    7     1               5  111.958        0.1120
+    6xEnbXM1us9fDJy2LC0lru    1     1               4  125.013        0.0525
+    
+                            danceability  energy  instrumentalness  liveness  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p         0.719   0.493          0.000000    0.1180
+    0vSWgAlfpye0WCGeNmuNhy         0.850   0.893          0.000004    0.3720
+    7EL7ifncK2PWFYThJjzR25         0.864   0.365          0.000000    0.1160
+    1umsRbM7L4ju7rn9aU8Ju6         0.767   0.576          0.000003    0.0968
+    4SKqOHKYU5pgHr5UiVKiQN         0.765   0.726          0.000000    0.6190
+    3uE1swbcRp5BrO64UNy6Ex         0.814   0.575          0.000291    0.1090
+    3KJrwOuqiEwHq6QTreZT61         0.812   0.813          0.000150    0.0758
+    4QhUXx4ON40TIBrZIlnIke         0.602   0.578          0.000000    0.1640
+    09320vyX4qHd4GjHIpy5w0         0.876   0.768          0.000012    0.2830
+    6xEnbXM1us9fDJy2LC0lru         0.690   0.760          0.000000    0.1340
+    
+                            loudness  speechiness  valence
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    -7.230       0.0794   0.1240
+    0vSWgAlfpye0WCGeNmuNhy    -4.783       0.0623   0.0391
+    7EL7ifncK2PWFYThJjzR25   -10.219       0.0655   0.0478
+    1umsRbM7L4ju7rn9aU8Ju6    -9.683       0.2560   0.1870
+    4SKqOHKYU5pgHr5UiVKiQN    -5.580       0.1910   0.2700
+    3uE1swbcRp5BrO64UNy6Ex    -9.635       0.0635   0.2880
+    3KJrwOuqiEwHq6QTreZT61    -5.583       0.0984   0.3480
+    4QhUXx4ON40TIBrZIlnIke    -5.610       0.0283   0.1560
+    09320vyX4qHd4GjHIpy5w0    -6.606       0.2010   0.7200
+    6xEnbXM1us9fDJy2LC0lru    -5.431       0.0895   0.0797
+
+%% Cell type:markdown id: tags:
+
+### Checking for missing values
+
+Syntax: `Series.isna()`
+- Returns a boolean Series
+
+Let's check if any of the "song_name"(s) are missing
+
+%% Cell type:code id: tags:
+
+``` python
+df["song_name"].isna()
+```
+
+%% Output
+
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    False
+    0vSWgAlfpye0WCGeNmuNhy    False
+    7EL7ifncK2PWFYThJjzR25    False
+    1umsRbM7L4ju7rn9aU8Ju6    False
+    4SKqOHKYU5pgHr5UiVKiQN    False
+                              ...
+    46bXU7Sgj7104ZoXxzz9tM     True
+    0he2ViGMUO3ajKTxLOfWVT     True
+    72DAt9Lbpy9EUS29OzQLob     True
+    6HXgExFVuE1c3cq9QjFCcU     True
+    6MAAMZImxcvYhRnxDLTufD     True
+    Name: song_name, Length: 35877, dtype: bool
+
+%% Cell type:markdown id: tags:
+
+### Review: `Pandas.Series.value_counts()`
+- Returns a new `Series` with unique values from the original `Series` as keys and the count of those unique values as values.
+- Return value `Series` is ordered using descending order of counts
+
+%% Cell type:code id: tags:
+
+``` python
+# count the number of missing values for song name
+df["song_name"].isna().value_counts()
+```
+
+%% Output
+
+    False    18342
+    True     17535
+    Name: song_name, dtype: int64
+
+%% Cell type:markdown id: tags:
+
+### Missing value manipulation
+Syntax: `df.fillna(<REPLACE>)`
+- Returns a new DataFrame object instance reference.
+
+%% Cell type:code id: tags:
+
+``` python
+# use .fillna to replace missing values
+df["song_name"].fillna("No Song Name")
+
+# to replace the original DataFrame's column, you need to explicitly update that object instance
+df["song_name"] = df["song_name"].fillna("No Song Name")
+df
+```
+
+%% Output
+
+                                                  title     song_name      genre  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p                          NaN     Pathology  Dark Trap
+    0vSWgAlfpye0WCGeNmuNhy                          NaN      Symbiote  Dark Trap
+    7EL7ifncK2PWFYThJjzR25                          NaN     BRAINFOOD  Dark Trap
+    1umsRbM7L4ju7rn9aU8Ju6                          NaN     Sacrifice  Dark Trap
+    4SKqOHKYU5pgHr5UiVKiQN                          NaN      Backpack  Dark Trap
+    ...                                             ...           ...        ...
+    46bXU7Sgj7104ZoXxzz9tM           Euphoric Hardstyle  No Song Name  hardstyle
+    0he2ViGMUO3ajKTxLOfWVT  Greatest Hardstyle Playlist  No Song Name  hardstyle
+    72DAt9Lbpy9EUS29OzQLob       Best of Hardstyle 2020  No Song Name  hardstyle
+    6HXgExFVuE1c3cq9QjFCcU           Euphoric Hardstyle  No Song Name  hardstyle
+    6MAAMZImxcvYhRnxDLTufD       Best of Hardstyle 2020  No Song Name  hardstyle
+    
+                            duration_ms  key  mode  time_signature    tempo  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p       224427    8     1               4  115.080
+    0vSWgAlfpye0WCGeNmuNhy        98821    5     1               4  218.050
+    7EL7ifncK2PWFYThJjzR25       101172    8     1               4  189.938
+    1umsRbM7L4ju7rn9aU8Ju6        96062   10     0               4  139.990
+    4SKqOHKYU5pgHr5UiVKiQN       135079    5     1               4  128.014
+    ...                             ...  ...   ...             ...      ...
+    46bXU7Sgj7104ZoXxzz9tM       269208    4     1               4  150.013
+    0he2ViGMUO3ajKTxLOfWVT       210112    0     0               4  149.928
+    72DAt9Lbpy9EUS29OzQLob       234823    8     1               4  154.935
+    6HXgExFVuE1c3cq9QjFCcU       323200    6     0               4  150.042
+    6MAAMZImxcvYhRnxDLTufD       162161    9     1               4  155.047
+    
+                            acousticness  danceability  energy  instrumentalness  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p      0.401000         0.719   0.493          0.000000
+    0vSWgAlfpye0WCGeNmuNhy      0.013800         0.850   0.893          0.000004
+    7EL7ifncK2PWFYThJjzR25      0.187000         0.864   0.365          0.000000
+    1umsRbM7L4ju7rn9aU8Ju6      0.145000         0.767   0.576          0.000003
+    4SKqOHKYU5pgHr5UiVKiQN      0.007700         0.765   0.726          0.000000
+    ...                              ...           ...     ...               ...
+    46bXU7Sgj7104ZoXxzz9tM      0.031500         0.528   0.693          0.000345
+    0he2ViGMUO3ajKTxLOfWVT      0.022500         0.517   0.768          0.000018
+    72DAt9Lbpy9EUS29OzQLob      0.026000         0.361   0.821          0.000242
+    6HXgExFVuE1c3cq9QjFCcU      0.000551         0.477   0.921          0.029600
+    6MAAMZImxcvYhRnxDLTufD      0.001890         0.529   0.945          0.000055
+    
+                            liveness  loudness  speechiness  valence
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    0.1180    -7.230       0.0794   0.1240
+    0vSWgAlfpye0WCGeNmuNhy    0.3720    -4.783       0.0623   0.0391
+    7EL7ifncK2PWFYThJjzR25    0.1160   -10.219       0.0655   0.0478
+    1umsRbM7L4ju7rn9aU8Ju6    0.0968    -9.683       0.2560   0.1870
+    4SKqOHKYU5pgHr5UiVKiQN    0.6190    -5.580       0.1910   0.2700
+    ...                          ...       ...          ...      ...
+    46bXU7Sgj7104ZoXxzz9tM    0.1210    -5.148       0.0304   0.3940
+    0he2ViGMUO3ajKTxLOfWVT    0.2050    -7.922       0.0479   0.3830
+    72DAt9Lbpy9EUS29OzQLob    0.3850    -3.102       0.0505   0.1240
+    6HXgExFVuE1c3cq9QjFCcU    0.0575    -4.777       0.0392   0.4880
+    6MAAMZImxcvYhRnxDLTufD    0.4140    -5.862       0.0615   0.1340
+    
+    [35877 rows x 16 columns]
+
+%% Cell type:markdown id: tags:
+
+### Dropping missing values
+Syntax: `df.dropna()`
+- Returns a new DataFrame object instance reference.
+
+%% Cell type:code id: tags:
+
+``` python
+# .dropna will drop all rows that contain NaN in them
+df.dropna()
+```
+
+%% Output
+
+                                                  title     song_name      genre  \
+    id
+    5LzAV6KfjN8VhWCedeygfY            Dirtybird Players  No Song Name  techhouse
+    3TsCb6ueD678XBJDiRrvhr                   tech house  No Song Name  techhouse
+    6Y0Fy2buEis7bEOlG0QET1           Tech House Bangerz  No Song Name  techhouse
+    4EJI2XGViSQp6WscLKgYDD                   tech house  No Song Name  techhouse
+    4x6VzOQTLIrkkCWcDPh5Y0           blanc | Tech House  No Song Name  techhouse
+    ...                                             ...           ...        ...
+    46bXU7Sgj7104ZoXxzz9tM           Euphoric Hardstyle  No Song Name  hardstyle
+    0he2ViGMUO3ajKTxLOfWVT  Greatest Hardstyle Playlist  No Song Name  hardstyle
+    72DAt9Lbpy9EUS29OzQLob       Best of Hardstyle 2020  No Song Name  hardstyle
+    6HXgExFVuE1c3cq9QjFCcU           Euphoric Hardstyle  No Song Name  hardstyle
+    6MAAMZImxcvYhRnxDLTufD       Best of Hardstyle 2020  No Song Name  hardstyle
+    
+                            duration_ms  key  mode  time_signature    tempo  \
+    id
+    5LzAV6KfjN8VhWCedeygfY       197499    7     1               4  127.997
+    3TsCb6ueD678XBJDiRrvhr       206000   10     1               4  124.994
+    6Y0Fy2buEis7bEOlG0QET1       199839    4     0               4  124.006
+    4EJI2XGViSQp6WscLKgYDD       173861    8     1               4  125.031
+    4x6VzOQTLIrkkCWcDPh5Y0       394960    8     0               4  127.029
+    ...                             ...  ...   ...             ...      ...
+    46bXU7Sgj7104ZoXxzz9tM       269208    4     1               4  150.013
+    0he2ViGMUO3ajKTxLOfWVT       210112    0     0               4  149.928
+    72DAt9Lbpy9EUS29OzQLob       234823    8     1               4  154.935
+    6HXgExFVuE1c3cq9QjFCcU       323200    6     0               4  150.042
+    6MAAMZImxcvYhRnxDLTufD       162161    9     1               4  155.047
+    
+                            acousticness  danceability  energy  instrumentalness  \
+    id
+    5LzAV6KfjN8VhWCedeygfY      0.000957         0.806   0.950          0.920000
+    3TsCb6ueD678XBJDiRrvhr      0.062300         0.729   0.978          0.908000
+    6Y0Fy2buEis7bEOlG0QET1      0.019100         0.724   0.792          0.812000
+    4EJI2XGViSQp6WscLKgYDD      0.053000         0.700   0.898          0.418000
+    4x6VzOQTLIrkkCWcDPh5Y0      0.000301         0.803   0.919          0.926000
+    ...                              ...           ...     ...               ...
+    46bXU7Sgj7104ZoXxzz9tM      0.031500         0.528   0.693          0.000345
+    0he2ViGMUO3ajKTxLOfWVT      0.022500         0.517   0.768          0.000018
+    72DAt9Lbpy9EUS29OzQLob      0.026000         0.361   0.821          0.000242
+    6HXgExFVuE1c3cq9QjFCcU      0.000551         0.477   0.921          0.029600
+    6MAAMZImxcvYhRnxDLTufD      0.001890         0.529   0.945          0.000055
+    
+                            liveness  loudness  speechiness  valence
+    id
+    5LzAV6KfjN8VhWCedeygfY    0.1130    -6.782       0.0811    0.580
+    3TsCb6ueD678XBJDiRrvhr    0.0353    -6.645       0.0420    0.778
+    6Y0Fy2buEis7bEOlG0QET1    0.1080    -8.555       0.0405    0.346
+    4EJI2XGViSQp6WscLKgYDD    0.5740    -6.099       0.2570    0.791
+    4x6VzOQTLIrkkCWcDPh5Y0    0.1020    -8.667       0.0702    0.754
+    ...                          ...       ...          ...      ...
+    46bXU7Sgj7104ZoXxzz9tM    0.1210    -5.148       0.0304    0.394
+    0he2ViGMUO3ajKTxLOfWVT    0.2050    -7.922       0.0479    0.383
+    72DAt9Lbpy9EUS29OzQLob    0.3850    -3.102       0.0505    0.124
+    6HXgExFVuE1c3cq9QjFCcU    0.0575    -4.777       0.0392    0.488
+    6MAAMZImxcvYhRnxDLTufD    0.4140    -5.862       0.0615    0.134
+    
+    [17529 rows x 16 columns]
+
+%% Cell type:markdown id: tags:
+
+### Review: `Pandas.Series.apply(...)`
+Syntax: `Series.apply(<FUNCTION OBJECT REFERENCE>)`
+- applies input function to every element of the Series.
+- Returns a new `Series` object instance reference.
+
+Let's apply transformation function to `mode` column `Series`:
+- mode = 1 means major modality (sounds happy)
+- mode = 0 means minor modality (sounds sad)
+
+%% Cell type:code id: tags:
+
+``` python
+def replace_mode(m):
+    if m == 1:
+        return "major"
+    else:
+        return "minor"
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df["mode"].apply(replace_mode)
+```
+
+%% Output
+
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    major
+    0vSWgAlfpye0WCGeNmuNhy    major
+    7EL7ifncK2PWFYThJjzR25    major
+    1umsRbM7L4ju7rn9aU8Ju6    minor
+    4SKqOHKYU5pgHr5UiVKiQN    major
+                              ...
+    46bXU7Sgj7104ZoXxzz9tM    major
+    0he2ViGMUO3ajKTxLOfWVT    minor
+    72DAt9Lbpy9EUS29OzQLob    major
+    6HXgExFVuE1c3cq9QjFCcU    minor
+    6MAAMZImxcvYhRnxDLTufD    major
+    Name: mode, Length: 35877, dtype: object
+
+%% Cell type:markdown id: tags:
+
+### `lambda`
+
+Let's write a `lambda` function instead of the `replace_mode` function
+
+%% Cell type:code id: tags:
+
+``` python
+df["mode"].apply(lambda m: "major" if m == 1 else "minor")
+```
+
+%% Output
+
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    major
+    0vSWgAlfpye0WCGeNmuNhy    major
+    7EL7ifncK2PWFYThJjzR25    major
+    1umsRbM7L4ju7rn9aU8Ju6    minor
+    4SKqOHKYU5pgHr5UiVKiQN    major
+                              ...
+    46bXU7Sgj7104ZoXxzz9tM    major
+    0he2ViGMUO3ajKTxLOfWVT    minor
+    72DAt9Lbpy9EUS29OzQLob    major
+    6HXgExFVuE1c3cq9QjFCcU    minor
+    6MAAMZImxcvYhRnxDLTufD    major
+    Name: mode, Length: 35877, dtype: object
+
+%% Cell type:markdown id: tags:
+
+Typically transformed columns are added as new columns within the DataFrame.
+Let's add a new `modified_mode` column.
+
+%% Cell type:code id: tags:
+
+``` python
+df["modified_mode"] = df["mode"].apply(lambda m: "major" if m == 1 else "minor")
+df
+```
+
+%% Output
+
+                                                  title     song_name      genre  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p                          NaN     Pathology  Dark Trap
+    0vSWgAlfpye0WCGeNmuNhy                          NaN      Symbiote  Dark Trap
+    7EL7ifncK2PWFYThJjzR25                          NaN     BRAINFOOD  Dark Trap
+    1umsRbM7L4ju7rn9aU8Ju6                          NaN     Sacrifice  Dark Trap
+    4SKqOHKYU5pgHr5UiVKiQN                          NaN      Backpack  Dark Trap
+    ...                                             ...           ...        ...
+    46bXU7Sgj7104ZoXxzz9tM           Euphoric Hardstyle  No Song Name  hardstyle
+    0he2ViGMUO3ajKTxLOfWVT  Greatest Hardstyle Playlist  No Song Name  hardstyle
+    72DAt9Lbpy9EUS29OzQLob       Best of Hardstyle 2020  No Song Name  hardstyle
+    6HXgExFVuE1c3cq9QjFCcU           Euphoric Hardstyle  No Song Name  hardstyle
+    6MAAMZImxcvYhRnxDLTufD       Best of Hardstyle 2020  No Song Name  hardstyle
+    
+                            duration_ms  key  mode  time_signature    tempo  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p       224427    8     1               4  115.080
+    0vSWgAlfpye0WCGeNmuNhy        98821    5     1               4  218.050
+    7EL7ifncK2PWFYThJjzR25       101172    8     1               4  189.938
+    1umsRbM7L4ju7rn9aU8Ju6        96062   10     0               4  139.990
+    4SKqOHKYU5pgHr5UiVKiQN       135079    5     1               4  128.014
+    ...                             ...  ...   ...             ...      ...
+    46bXU7Sgj7104ZoXxzz9tM       269208    4     1               4  150.013
+    0he2ViGMUO3ajKTxLOfWVT       210112    0     0               4  149.928
+    72DAt9Lbpy9EUS29OzQLob       234823    8     1               4  154.935
+    6HXgExFVuE1c3cq9QjFCcU       323200    6     0               4  150.042
+    6MAAMZImxcvYhRnxDLTufD       162161    9     1               4  155.047
+    
+                            acousticness  danceability  energy  instrumentalness  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p      0.401000         0.719   0.493          0.000000
+    0vSWgAlfpye0WCGeNmuNhy      0.013800         0.850   0.893          0.000004
+    7EL7ifncK2PWFYThJjzR25      0.187000         0.864   0.365          0.000000
+    1umsRbM7L4ju7rn9aU8Ju6      0.145000         0.767   0.576          0.000003
+    4SKqOHKYU5pgHr5UiVKiQN      0.007700         0.765   0.726          0.000000
+    ...                              ...           ...     ...               ...
+    46bXU7Sgj7104ZoXxzz9tM      0.031500         0.528   0.693          0.000345
+    0he2ViGMUO3ajKTxLOfWVT      0.022500         0.517   0.768          0.000018
+    72DAt9Lbpy9EUS29OzQLob      0.026000         0.361   0.821          0.000242
+    6HXgExFVuE1c3cq9QjFCcU      0.000551         0.477   0.921          0.029600
+    6MAAMZImxcvYhRnxDLTufD      0.001890         0.529   0.945          0.000055
+    
+                            liveness  loudness  speechiness  valence modified_mode
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    0.1180    -7.230       0.0794   0.1240         major
+    0vSWgAlfpye0WCGeNmuNhy    0.3720    -4.783       0.0623   0.0391         major
+    7EL7ifncK2PWFYThJjzR25    0.1160   -10.219       0.0655   0.0478         major
+    1umsRbM7L4ju7rn9aU8Ju6    0.0968    -9.683       0.2560   0.1870         minor
+    4SKqOHKYU5pgHr5UiVKiQN    0.6190    -5.580       0.1910   0.2700         major
+    ...                          ...       ...          ...      ...           ...
+    46bXU7Sgj7104ZoXxzz9tM    0.1210    -5.148       0.0304   0.3940         major
+    0he2ViGMUO3ajKTxLOfWVT    0.2050    -7.922       0.0479   0.3830         minor
+    72DAt9Lbpy9EUS29OzQLob    0.3850    -3.102       0.0505   0.1240         major
+    6HXgExFVuE1c3cq9QjFCcU    0.0575    -4.777       0.0392   0.4880         minor
+    6MAAMZImxcvYhRnxDLTufD    0.4140    -5.862       0.0615   0.1340         major
+    
+    [35877 rows x 17 columns]
+
+%% Cell type:markdown id: tags:
+
+#### Let's go back to the original table from the SQL database
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("SELECT * FROM spotify")
+df
+```
+
+%% Output
+
+                               id                        title  song_name  \
+    0      7pgJBLVz5VmnL7uGHmRj6p                               Pathology
+    1      0vSWgAlfpye0WCGeNmuNhy                                Symbiote
+    2      7EL7ifncK2PWFYThJjzR25                               BRAINFOOD
+    3      1umsRbM7L4ju7rn9aU8Ju6                               Sacrifice
+    4      4SKqOHKYU5pgHr5UiVKiQN                                Backpack
+    ...                       ...                          ...        ...
+    35872  46bXU7Sgj7104ZoXxzz9tM           Euphoric Hardstyle
+    35873  0he2ViGMUO3ajKTxLOfWVT  Greatest Hardstyle Playlist
+    35874  72DAt9Lbpy9EUS29OzQLob       Best of Hardstyle 2020
+    35875  6HXgExFVuE1c3cq9QjFCcU           Euphoric Hardstyle
+    35876  6MAAMZImxcvYhRnxDLTufD       Best of Hardstyle 2020
+    
+               genre  duration_ms  key  mode  time_signature    tempo  \
+    0      Dark Trap       224427    8     1               4  115.080
+    1      Dark Trap        98821    5     1               4  218.050
+    2      Dark Trap       101172    8     1               4  189.938
+    3      Dark Trap        96062   10     0               4  139.990
+    4      Dark Trap       135079    5     1               4  128.014
+    ...          ...          ...  ...   ...             ...      ...
+    35872  hardstyle       269208    4     1               4  150.013
+    35873  hardstyle       210112    0     0               4  149.928
+    35874  hardstyle       234823    8     1               4  154.935
+    35875  hardstyle       323200    6     0               4  150.042
+    35876  hardstyle       162161    9     1               4  155.047
+    
+           acousticness  danceability  energy  instrumentalness  liveness  \
+    0          0.401000         0.719   0.493          0.000000    0.1180
+    1          0.013800         0.850   0.893          0.000004    0.3720
+    2          0.187000         0.864   0.365          0.000000    0.1160
+    3          0.145000         0.767   0.576          0.000003    0.0968
+    4          0.007700         0.765   0.726          0.000000    0.6190
+    ...             ...           ...     ...               ...       ...
+    35872      0.031500         0.528   0.693          0.000345    0.1210
+    35873      0.022500         0.517   0.768          0.000018    0.2050
+    35874      0.026000         0.361   0.821          0.000242    0.3850
+    35875      0.000551         0.477   0.921          0.029600    0.0575
+    35876      0.001890         0.529   0.945          0.000055    0.4140
+    
+           loudness  speechiness  valence
+    0        -7.230       0.0794   0.1240
+    1        -4.783       0.0623   0.0391
+    2       -10.219       0.0655   0.0478
+    3        -9.683       0.2560   0.1870
+    4        -5.580       0.1910   0.2700
+    ...         ...          ...      ...
+    35872    -5.148       0.0304   0.3940
+    35873    -7.922       0.0479   0.3830
+    35874    -3.102       0.0505   0.1240
+    35875    -4.777       0.0392   0.4880
+    35876    -5.862       0.0615   0.1340
+    
+    [35877 rows x 17 columns]
+
+%% Cell type:markdown id: tags:
+
+Extract just the "genre" and "duration_ms" columns from `df`.
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]]
+```
+
+%% Output
+
+               genre  duration_ms
+    0      Dark Trap       224427
+    1      Dark Trap        98821
+    2      Dark Trap       101172
+    3      Dark Trap        96062
+    4      Dark Trap       135079
+    ...          ...          ...
+    35872  hardstyle       269208
+    35873  hardstyle       210112
+    35874  hardstyle       234823
+    35875  hardstyle       323200
+    35876  hardstyle       162161
+    
+    [35877 rows x 2 columns]
+
+%% Cell type:markdown id: tags:
+
+### `Pandas.DataFrame.groupby(...)`
+
+Syntax: `DataFrame.groupby(<COLUMN>)`
+- Returns a `groupby` object instance reference
+- Need to apply aggregation methods to use the return value of `groupby`
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]].groupby("genre")
+```
+
+%% Output
+
+    <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbc472bad90>
+
+%% Cell type:markdown id: tags:
+
+### What is the average duration for each genre ordered based on decreasing order of averages?
+#### v1: using `df` (`pandas`) to answer the question
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]].groupby("genre").mean()
+```
+
+%% Output
+
+                       duration_ms
+    genre
+    Dark Trap        196059.938997
+    Emo              218370.989519
+    Hiphop           227885.028411
+    Pop              211558.052980
+    Rap              200816.798836
+    RnB              225628.556955
+    Trap Metal       145940.519467
+    Underground Rap  175506.191224
+    dnb              288860.138811
+    hardstyle        232828.626542
+    psytrance        445770.492075
+    techhouse        298395.587596
+    techno           399123.187453
+    trance           288729.366262
+    trap             225149.277731
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]].groupby("genre").mean().sort_values(by = "duration_ms", ascending = False)
+```
+
+%% Output
+
+                       duration_ms
+    genre
+    psytrance        445770.492075
+    techno           399123.187453
+    techhouse        298395.587596
+    dnb              288860.138811
+    trance           288729.366262
+    hardstyle        232828.626542
+    Hiphop           227885.028411
+    RnB              225628.556955
+    trap             225149.277731
+    Emo              218370.989519
+    Pop              211558.052980
+    Rap              200816.798836
+    Dark Trap        196059.938997
+    Underground Rap  175506.191224
+    Trap Metal       145940.519467
+
+%% Cell type:markdown id: tags:
+
+One way to check whether `groupby` works would be to use `value_counts` on the same column `Series`.
+
+%% Cell type:code id: tags:
+
+``` python
+df["genre"].value_counts()
+```
+
+%% Output
+
+    Underground Rap    4330
+    Dark Trap          3590
+    Hiphop             3027
+    trance             2804
+    psytrance          2650
+    techno             2646
+    dnb                2507
+    trap               2362
+    hardstyle          2351
+    techhouse          2209
+    RnB                1905
+    Trap Metal         1875
+    Emo                1622
+    Rap                1546
+    Pop                 453
+    Name: genre, dtype: int64
+
+%% Cell type:markdown id: tags:
+
+### What is the average duration for each genre ordered based on decreasing order of averages?
+#### v2: using SQL query to answer the question
+
+%% Cell type:code id: tags:
+
+``` python
+# SQL equivalent query of the above Pandas query
+avg_duration_per_genre = qry("""
+SELECT genre, AVG(duration_ms) as avg_duration
+FROM spotify
+GROUP BY genre
+ORDER BY avg_duration DESC
+""")
+
+# How can we get make the SQL query output to be exactly same as df.groupby?
+avg_duration_per_genre = avg_duration_per_genre.set_index("genre")
+avg_duration_per_genre
+```
+
+%% Output
+
+                      avg_duration
+    genre
+    psytrance        445770.492075
+    techno           399123.187453
+    techhouse        298395.587596
+    dnb              288860.138811
+    trance           288729.366262
+    hardstyle        232828.626542
+    Hiphop           227885.028411
+    RnB              225628.556955
+    trap             225149.277731
+    Emo              218370.989519
+    Pop              211558.052980
+    Rap              200816.798836
+    Dark Trap        196059.938997
+    Underground Rap  175506.191224
+    Trap Metal       145940.519467
+
+%% Cell type:markdown id: tags:
+
+### What is the average speechiness for each mode, time signature pair?
+#### v1: pandas
+
+%% Cell type:code id: tags:
+
+``` python
+# use a list to indicate all the columns you want to groupby
+df[["mode", "time_signature", "speechiness"]].groupby(["mode", "time_signature"]).mean()
+```
+
+%% Output
+
+                         speechiness
+    mode time_signature
+    0    1                  0.181224
+         3                  0.121837
+         4                  0.126688
+         5                  0.204890
+    1    1                  0.173138
+         3                  0.129512
+         4                  0.139170
+         5                  0.220177
+
+%% Cell type:code id: tags:
+
+``` python
+# SQL equivalent query of the above Pandas query
+qry("""
+SELECT mode, time_signature, AVG(speechiness) as avg_speechiness
+FROM spotify
+GROUP BY mode, time_signature
+""")
+```
+
+%% Output
+
+       mode  time_signature  avg_speechiness
+    0     0               1         0.181224
+    1     0               3         0.121837
+    2     0               4         0.126688
+    3     0               5         0.204890
+    4     1               1         0.173138
+    5     1               3         0.129512
+    6     1               4         0.139170
+    7     1               5         0.220177
+
+%% Cell type:markdown id: tags:
+
+### Self-practice
+
+%% Cell type:markdown id: tags:
+
+### Which songs have a tempo greater than 150 and what are their genre?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+fast_songs = df[df["tempo"] > 150]
+fast_songs[["song_name", "genre"]]
+```
+
+%% Output
+
+                   song_name      genre
+    1               Symbiote  Dark Trap
+    2              BRAINFOOD  Dark Trap
+    18     FunnyToSeeYouHere  Dark Trap
+    19                Killer  Dark Trap
+    20                   608  Dark Trap
+    ...                  ...        ...
+    35871                     hardstyle
+    35872                     hardstyle
+    35874                     hardstyle
+    35875                     hardstyle
+    35876                     hardstyle
+    
+    [13753 rows x 2 columns]
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+
+qry("""
+SELECT song_name, genre
+FROM spotify
+WHERE tempo > 150
+""")
+```
+
+%% Output
+
+                   song_name      genre
+    0               Symbiote  Dark Trap
+    1              BRAINFOOD  Dark Trap
+    2      FunnyToSeeYouHere  Dark Trap
+    3                 Killer  Dark Trap
+    4                    608  Dark Trap
+    ...                  ...        ...
+    13748                     hardstyle
+    13749                     hardstyle
+    13750                     hardstyle
+    13751                     hardstyle
+    13752                     hardstyle
+    
+    [13753 rows x 2 columns]
+
+%% Cell type:markdown id: tags:
+
+### What is the sum of danceability and liveness for "Hiphop" genre songs?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+hiphop_songs = df[df["genre"] == "Hiphop"]
+hiphop_songs["danceability"] + hiphop_songs["liveness"]
+```
+
+%% Output
+
+    15321    0.8416
+    15322    0.9201
+    15323    0.8580
+    15324    0.8240
+    15325    0.9348
+              ...
+    18343    0.6690
+    18344    0.5370
+    18345    0.8850
+    18346    0.8770
+    18347    0.8703
+    Length: 3027, dtype: float64
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+hiphop_songs = qry("""
+SELECT danceability + liveness as song_score
+FROM spotify
+WHERE genre = "Hiphop"
+""")
+hiphop_songs["song_score"]
+```
+
+%% Output
+
+    0       0.8416
+    1       0.9201
+    2       0.8580
+    3       0.8240
+    4       0.9348
+             ...
+    3022    0.6690
+    3023    0.5370
+    3024    0.8850
+    3025    0.8770
+    3026    0.8703
+    Name: song_score, Length: 3027, dtype: float64
+
+%% Cell type:markdown id: tags:
+
+### Find all song_name ordered by ascending order of duration_ms. Eliminate songs which don't have a song_name
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+songs_by_duration = list(df.sort_values(by = "duration_ms")["song_name"])
+# [song for song in songs_by_duration if song != ""] # uncomment to see the output
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v2
+songs_by_duration = qry("""
+SELECT song_name
+FROM spotify
+ORDER BY duration_ms
+""")
+songs_by_duration = list(songs_by_duration["song_name"])
+# [song for song in songs_by_duration if song != ""] # uncomment to see the output
+```
+
+%% Cell type:markdown id: tags:
+
+### How many distinct "genre"s are there in the dataset?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+list(set(list(df["genre"])))
+```
+
+%% Output
+
+    ['trance',
+     'techno',
+     'dnb',
+     'Trap Metal',
+     'RnB',
+     'Pop',
+     'psytrance',
+     'techhouse',
+     'trap',
+     'Dark Trap',
+     'Emo',
+     'Underground Rap',
+     'Rap',
+     'Hiphop',
+     'hardstyle']
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+genres = qry("""
+SELECT DISTINCT genre
+FROM spotify
+""")
+list(genres["genre"])
+```
+
+%% Output
+
+    ['Dark Trap',
+     'Underground Rap',
+     'Trap Metal',
+     'Emo',
+     'Rap',
+     'RnB',
+     'Pop',
+     'Hiphop',
+     'techhouse',
+     'techno',
+     'trance',
+     'psytrance',
+     'trap',
+     'dnb',
+     'hardstyle']
+
+%% Cell type:markdown id: tags:
+
+### Considering only songs with energy greater than 0.5, what is the maximum energy for each "genre" with song count greater than 2000?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+high_energy_songs = df[df["energy"] > 0.5]
+genre_groups = high_energy_songs[["genre", "energy"]].groupby("genre")
+max_energy = genre_groups.max()
+max_energy["energy"]
+```
+
+%% Output
+
+    genre
+    Dark Trap          0.998
+    Emo                0.995
+    Hiphop             0.978
+    Pop                0.977
+    Rap                0.980
+    RnB                0.974
+    Trap Metal         0.999
+    Underground Rap    0.997
+    dnb                0.999
+    hardstyle          0.999
+    psytrance          0.999
+    techhouse          0.999
+    techno             1.000
+    trance             1.000
+    trap               1.000
+    Name: energy, dtype: float64
+
+%% Cell type:code id: tags:
+
+``` python
+genre_counts = genre_groups.count()
+genre_counts["energy_max"] = max_energy["energy"]
+filtered_genre_counts = genre_counts[genre_counts["energy"] > 2000]
+filtered_genre_counts
+```
+
+%% Output
+
+                     energy  energy_max
+    genre
+    Dark Trap          2757       0.998
+    Hiphop             2497       0.978
+    Underground Rap    3420       0.997
+    dnb                2496       0.999
+    hardstyle          2345       0.999
+    psytrance          2642       0.999
+    techhouse          2164       0.999
+    techno             2534       1.000
+    trance             2786       1.000
+    trap               2346       1.000
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+qry("""
+SELECT genre, COUNT(*) as song_count, MAX("energy") as energy_max
+FROM spotify
+WHERE energy > 0.5
+GROUP BY genre
+HAVING song_count > 2000
+""")
+```
+
+%% Output
+
+                 genre  song_count  energy_max
+    0        Dark Trap        2757       0.998
+    1           Hiphop        2497       0.978
+    2  Underground Rap        3420       0.997
+    3              dnb        2496       0.999
+    4        hardstyle        2345       0.999
+    5        psytrance        2642       0.999
+    6        techhouse        2164       0.999
+    7           techno        2534       1.000
+    8           trance        2786       1.000
+    9             trap        2346       1.000
+
+%% Cell type:code id: tags:
+
+``` python
+# Close the database connection here
+conn.close()
+```
+%% Cell type:code id: tags:
+
+``` python
+# known import statements
+import pandas as pd
+import sqlite3 as sql # note that we are renaming to sql
+import os
+
+# new import statement
+import numpy as np
+```
+
+%% Cell type:markdown id: tags:
+
+# Lecture 35 Pandas 3: Data Transformation
+* Data transformation is the process of changing the format, structure, or values of data.
+* Often needed during data cleaning and sometimes during data analysis
+
+%% Cell type:markdown id: tags:
+
+# Today's Learning Objectives:
+
+* Setting column as index for pandas `DataFrame`
+* Identify, drop, or fill missing values (`np.NaN`) using Pandas `isna`, `dropna`, and `fillna`
+* Applying transformations to `DataFrame`:
+  * Use `apply` on pandas `Series` to apply a transformation function
+  * Use `replace` to replace all target values in Pandas `Series` and `DataFrame` rows / columns
+* Filter, aggregate, group, and summarize information in a `DataFrame` with `groupby`
+* Convert .groupby examples to SQL
+* Solving the same question using SQL and pandas `DataFrame` manipulations:
+  * filtering, grouping, and aggregation / summarization
+
+%% Cell type:markdown id: tags:
+
+# The dataset: Spotify songs
+Adapted from https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify.
+
+If you are interested in digging deeper in this dataset, here's a [blog post](https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a) that explain each column in details.
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 1: Establish a connection to the spotify.db database
+
+%% Cell type:code id: tags:
+
+``` python
+# open up the spotify database
+db_pathname = "spotify.db"
+assert os.path.exists(db_pathname)
+conn = sql.connect(db_pathname)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def qry(sql):
+    return pd.read_sql(sql, conn)
+```
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 2: Identify the table name(s) inside the database
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("SELECT * from sqlite_master")
+df
+```
+
+%% Output
+
+        type                        name tbl_name  rootpage  \
+    0  table                     spotify  spotify      1527
+    1  index  sqlite_autoindex_spotify_1  spotify      1528
+    
+                                                     sql
+    0  CREATE TABLE spotify(\nid TEXT PRIMARY KEY,\nt...
+    1                                               None
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 3: Use pandas lookup expression to extract the "sql" column and display the full query using .iloc lookup
+
+%% Cell type:code id: tags:
+
+``` python
+print(df["sql"].iloc[0])
+```
+
+%% Output
+
+    CREATE TABLE spotify(
+    id TEXT PRIMARY KEY,
+    title BLOB,
+    song_name BLOB,
+    genre TEXT,
+    duration_ms INTEGER,
+    key INTEGER,
+    mode INTEGER,
+    time_signature INTEGER,
+    tempo REAL,
+    acousticness REAL,
+    danceability REAL,
+    energy REAL,
+    instrumentalness REAL,
+    liveness REAL,
+    loudness REAL,
+    speechiness REAL,
+    valence REAL)
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 4: Store the data inside `spotify` table inside a variable called `df`
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("SELECT * FROM spotify")
+df
+```
+
+%% Output
+
+                               id                        title  song_name  \
+    0      7pgJBLVz5VmnL7uGHmRj6p                               Pathology
+    1      0vSWgAlfpye0WCGeNmuNhy                                Symbiote
+    2      7EL7ifncK2PWFYThJjzR25                               BRAINFOOD
+    3      1umsRbM7L4ju7rn9aU8Ju6                               Sacrifice
+    4      4SKqOHKYU5pgHr5UiVKiQN                                Backpack
+    ...                       ...                          ...        ...
+    35872  46bXU7Sgj7104ZoXxzz9tM           Euphoric Hardstyle
+    35873  0he2ViGMUO3ajKTxLOfWVT  Greatest Hardstyle Playlist
+    35874  72DAt9Lbpy9EUS29OzQLob       Best of Hardstyle 2020
+    35875  6HXgExFVuE1c3cq9QjFCcU           Euphoric Hardstyle
+    35876  6MAAMZImxcvYhRnxDLTufD       Best of Hardstyle 2020
+    
+               genre  duration_ms  key  mode  time_signature    tempo  \
+    0      Dark Trap       224427    8     1               4  115.080
+    1      Dark Trap        98821    5     1               4  218.050
+    2      Dark Trap       101172    8     1               4  189.938
+    3      Dark Trap        96062   10     0               4  139.990
+    4      Dark Trap       135079    5     1               4  128.014
+    ...          ...          ...  ...   ...             ...      ...
+    35872  hardstyle       269208    4     1               4  150.013
+    35873  hardstyle       210112    0     0               4  149.928
+    35874  hardstyle       234823    8     1               4  154.935
+    35875  hardstyle       323200    6     0               4  150.042
+    35876  hardstyle       162161    9     1               4  155.047
+    
+           acousticness  danceability  energy  instrumentalness  liveness  \
+    0          0.401000         0.719   0.493          0.000000    0.1180
+    1          0.013800         0.850   0.893          0.000004    0.3720
+    2          0.187000         0.864   0.365          0.000000    0.1160
+    3          0.145000         0.767   0.576          0.000003    0.0968
+    4          0.007700         0.765   0.726          0.000000    0.6190
+    ...             ...           ...     ...               ...       ...
+    35872      0.031500         0.528   0.693          0.000345    0.1210
+    35873      0.022500         0.517   0.768          0.000018    0.2050
+    35874      0.026000         0.361   0.821          0.000242    0.3850
+    35875      0.000551         0.477   0.921          0.029600    0.0575
+    35876      0.001890         0.529   0.945          0.000055    0.4140
+    
+           loudness  speechiness  valence
+    0        -7.230       0.0794   0.1240
+    1        -4.783       0.0623   0.0391
+    2       -10.219       0.0655   0.0478
+    3        -9.683       0.2560   0.1870
+    4        -5.580       0.1910   0.2700
+    ...         ...          ...      ...
+    35872    -5.148       0.0304   0.3940
+    35873    -7.922       0.0479   0.3830
+    35874    -3.102       0.0505   0.1240
+    35875    -4.777       0.0392   0.4880
+    35876    -5.862       0.0615   0.1340
+    
+    [35877 rows x 17 columns]
+
+%% Cell type:markdown id: tags:
+
+### Setting a column as row indices for the `DataFrame`
+
+- Syntax: `df.set_index("<COLUMN>")`
+- Returns a new DataFrame object instance reference.
+- WARNING: executing this twice will result in `KeyError` being thrown. Once you set a column as row index, it will no longer be a column within the `DataFrame`. If you tried this, go back and execute the above cell and update `df` once more and then execute the below cell exactly once.
+
+%% Cell type:code id: tags:
+
+``` python
+# Set the id column as row indices
+df = df.set_index("id")
+df
+```
+
+%% Output
+
+                                                  title  song_name      genre  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p                               Pathology  Dark Trap
+    0vSWgAlfpye0WCGeNmuNhy                                Symbiote  Dark Trap
+    7EL7ifncK2PWFYThJjzR25                               BRAINFOOD  Dark Trap
+    1umsRbM7L4ju7rn9aU8Ju6                               Sacrifice  Dark Trap
+    4SKqOHKYU5pgHr5UiVKiQN                                Backpack  Dark Trap
+    ...                                             ...        ...        ...
+    46bXU7Sgj7104ZoXxzz9tM           Euphoric Hardstyle             hardstyle
+    0he2ViGMUO3ajKTxLOfWVT  Greatest Hardstyle Playlist             hardstyle
+    72DAt9Lbpy9EUS29OzQLob       Best of Hardstyle 2020             hardstyle
+    6HXgExFVuE1c3cq9QjFCcU           Euphoric Hardstyle             hardstyle
+    6MAAMZImxcvYhRnxDLTufD       Best of Hardstyle 2020             hardstyle
+    
+                            duration_ms  key  mode  time_signature    tempo  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p       224427    8     1               4  115.080
+    0vSWgAlfpye0WCGeNmuNhy        98821    5     1               4  218.050
+    7EL7ifncK2PWFYThJjzR25       101172    8     1               4  189.938
+    1umsRbM7L4ju7rn9aU8Ju6        96062   10     0               4  139.990
+    4SKqOHKYU5pgHr5UiVKiQN       135079    5     1               4  128.014
+    ...                             ...  ...   ...             ...      ...
+    46bXU7Sgj7104ZoXxzz9tM       269208    4     1               4  150.013
+    0he2ViGMUO3ajKTxLOfWVT       210112    0     0               4  149.928
+    72DAt9Lbpy9EUS29OzQLob       234823    8     1               4  154.935
+    6HXgExFVuE1c3cq9QjFCcU       323200    6     0               4  150.042
+    6MAAMZImxcvYhRnxDLTufD       162161    9     1               4  155.047
+    
+                            acousticness  danceability  energy  instrumentalness  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p      0.401000         0.719   0.493          0.000000
+    0vSWgAlfpye0WCGeNmuNhy      0.013800         0.850   0.893          0.000004
+    7EL7ifncK2PWFYThJjzR25      0.187000         0.864   0.365          0.000000
+    1umsRbM7L4ju7rn9aU8Ju6      0.145000         0.767   0.576          0.000003
+    4SKqOHKYU5pgHr5UiVKiQN      0.007700         0.765   0.726          0.000000
+    ...                              ...           ...     ...               ...
+    46bXU7Sgj7104ZoXxzz9tM      0.031500         0.528   0.693          0.000345
+    0he2ViGMUO3ajKTxLOfWVT      0.022500         0.517   0.768          0.000018
+    72DAt9Lbpy9EUS29OzQLob      0.026000         0.361   0.821          0.000242
+    6HXgExFVuE1c3cq9QjFCcU      0.000551         0.477   0.921          0.029600
+    6MAAMZImxcvYhRnxDLTufD      0.001890         0.529   0.945          0.000055
+    
+                            liveness  loudness  speechiness  valence
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    0.1180    -7.230       0.0794   0.1240
+    0vSWgAlfpye0WCGeNmuNhy    0.3720    -4.783       0.0623   0.0391
+    7EL7ifncK2PWFYThJjzR25    0.1160   -10.219       0.0655   0.0478
+    1umsRbM7L4ju7rn9aU8Ju6    0.0968    -9.683       0.2560   0.1870
+    4SKqOHKYU5pgHr5UiVKiQN    0.6190    -5.580       0.1910   0.2700
+    ...                          ...       ...          ...      ...
+    46bXU7Sgj7104ZoXxzz9tM    0.1210    -5.148       0.0304   0.3940
+    0he2ViGMUO3ajKTxLOfWVT    0.2050    -7.922       0.0479   0.3830
+    72DAt9Lbpy9EUS29OzQLob    0.3850    -3.102       0.0505   0.1240
+    6HXgExFVuE1c3cq9QjFCcU    0.0575    -4.777       0.0392   0.4880
+    6MAAMZImxcvYhRnxDLTufD    0.4140    -5.862       0.0615   0.1340
+    
+    [35877 rows x 16 columns]
+
+%% Cell type:markdown id: tags:
+
+### Not a Number
+
+- `np.NaN` is the floating point representation of Not a Number
+- You do not need to know / learn the details about the `numpy` package
+
+### Replacing / modifying values within the `DataFrame`
+
+Syntax: `df.replace(<TARGET>, <REPLACE>)`
+- Your target can be `str`, `int`, `float`, `None` (there are other possiblities, but those are too advanced for this course)
+- Returns a new DataFrame object instance reference.
+
+Let's now replace the missing values (empty strings) with `np.NAN`
+
+%% Cell type:code id: tags:
+
+``` python
+df = df.replace("", np.NaN)
+df.head(10) # title is the album name
+```
+
+%% Output
+
+                           title             song_name      genre  duration_ms  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p   NaN             Pathology  Dark Trap       224427
+    0vSWgAlfpye0WCGeNmuNhy   NaN              Symbiote  Dark Trap        98821
+    7EL7ifncK2PWFYThJjzR25   NaN             BRAINFOOD  Dark Trap       101172
+    1umsRbM7L4ju7rn9aU8Ju6   NaN             Sacrifice  Dark Trap        96062
+    4SKqOHKYU5pgHr5UiVKiQN   NaN              Backpack  Dark Trap       135079
+    3uE1swbcRp5BrO64UNy6Ex   NaN     TakingOutTheTrash  Dark Trap       192833
+    3KJrwOuqiEwHq6QTreZT61   NaN           Io sono qui  Dark Trap       180880
+    4QhUXx4ON40TIBrZIlnIke   NaN                Murder  Dark Trap       186261
+    09320vyX4qHd4GjHIpy5w0   NaN        High 'N Mighty  Dark Trap       124676
+    6xEnbXM1us9fDJy2LC0lru   NaN  Bang Ya Fucking Head  Dark Trap       154929
+    
+                            key  mode  time_signature    tempo  acousticness  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    8     1               4  115.080        0.4010
+    0vSWgAlfpye0WCGeNmuNhy    5     1               4  218.050        0.0138
+    7EL7ifncK2PWFYThJjzR25    8     1               4  189.938        0.1870
+    1umsRbM7L4ju7rn9aU8Ju6   10     0               4  139.990        0.1450
+    4SKqOHKYU5pgHr5UiVKiQN    5     1               4  128.014        0.0077
+    3uE1swbcRp5BrO64UNy6Ex   11     1               4  120.004        0.1720
+    3KJrwOuqiEwHq6QTreZT61   10     0               4  128.066        0.0987
+    4QhUXx4ON40TIBrZIlnIke    0     1               4  114.956        0.0343
+    09320vyX4qHd4GjHIpy5w0    7     1               5  111.958        0.1120
+    6xEnbXM1us9fDJy2LC0lru    1     1               4  125.013        0.0525
+    
+                            danceability  energy  instrumentalness  liveness  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p         0.719   0.493          0.000000    0.1180
+    0vSWgAlfpye0WCGeNmuNhy         0.850   0.893          0.000004    0.3720
+    7EL7ifncK2PWFYThJjzR25         0.864   0.365          0.000000    0.1160
+    1umsRbM7L4ju7rn9aU8Ju6         0.767   0.576          0.000003    0.0968
+    4SKqOHKYU5pgHr5UiVKiQN         0.765   0.726          0.000000    0.6190
+    3uE1swbcRp5BrO64UNy6Ex         0.814   0.575          0.000291    0.1090
+    3KJrwOuqiEwHq6QTreZT61         0.812   0.813          0.000150    0.0758
+    4QhUXx4ON40TIBrZIlnIke         0.602   0.578          0.000000    0.1640
+    09320vyX4qHd4GjHIpy5w0         0.876   0.768          0.000012    0.2830
+    6xEnbXM1us9fDJy2LC0lru         0.690   0.760          0.000000    0.1340
+    
+                            loudness  speechiness  valence
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    -7.230       0.0794   0.1240
+    0vSWgAlfpye0WCGeNmuNhy    -4.783       0.0623   0.0391
+    7EL7ifncK2PWFYThJjzR25   -10.219       0.0655   0.0478
+    1umsRbM7L4ju7rn9aU8Ju6    -9.683       0.2560   0.1870
+    4SKqOHKYU5pgHr5UiVKiQN    -5.580       0.1910   0.2700
+    3uE1swbcRp5BrO64UNy6Ex    -9.635       0.0635   0.2880
+    3KJrwOuqiEwHq6QTreZT61    -5.583       0.0984   0.3480
+    4QhUXx4ON40TIBrZIlnIke    -5.610       0.0283   0.1560
+    09320vyX4qHd4GjHIpy5w0    -6.606       0.2010   0.7200
+    6xEnbXM1us9fDJy2LC0lru    -5.431       0.0895   0.0797
+
+%% Cell type:markdown id: tags:
+
+### Checking for missing values
+
+Syntax: `Series.isna()`
+- Returns a boolean Series
+
+Let's check if any of the "song_name"(s) are missing
+
+%% Cell type:code id: tags:
+
+``` python
+df["song_name"].isna()
+```
+
+%% Output
+
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    False
+    0vSWgAlfpye0WCGeNmuNhy    False
+    7EL7ifncK2PWFYThJjzR25    False
+    1umsRbM7L4ju7rn9aU8Ju6    False
+    4SKqOHKYU5pgHr5UiVKiQN    False
+                              ...
+    46bXU7Sgj7104ZoXxzz9tM     True
+    0he2ViGMUO3ajKTxLOfWVT     True
+    72DAt9Lbpy9EUS29OzQLob     True
+    6HXgExFVuE1c3cq9QjFCcU     True
+    6MAAMZImxcvYhRnxDLTufD     True
+    Name: song_name, Length: 35877, dtype: bool
+
+%% Cell type:markdown id: tags:
+
+### Review: `Pandas.Series.value_counts()`
+- Returns a new `Series` with unique values from the original `Series` as keys and the count of those unique values as values.
+- Return value `Series` is ordered using descending order of counts
+
+%% Cell type:code id: tags:
+
+``` python
+# count the number of missing values for song name
+df["song_name"].isna().value_counts()
+```
+
+%% Output
+
+    False    18342
+    True     17535
+    Name: song_name, dtype: int64
+
+%% Cell type:markdown id: tags:
+
+### Missing value manipulation
+Syntax: `df.fillna(<REPLACE>)`
+- Returns a new DataFrame object instance reference.
+
+%% Cell type:code id: tags:
+
+``` python
+# use .fillna to replace missing values
+df["song_name"].fillna("No Song Name")
+
+# to replace the original DataFrame's column, you need to explicitly update that object instance
+df["song_name"] = df["song_name"].fillna("No Song Name")
+df
+```
+
+%% Output
+
+                                                  title     song_name      genre  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p                          NaN     Pathology  Dark Trap
+    0vSWgAlfpye0WCGeNmuNhy                          NaN      Symbiote  Dark Trap
+    7EL7ifncK2PWFYThJjzR25                          NaN     BRAINFOOD  Dark Trap
+    1umsRbM7L4ju7rn9aU8Ju6                          NaN     Sacrifice  Dark Trap
+    4SKqOHKYU5pgHr5UiVKiQN                          NaN      Backpack  Dark Trap
+    ...                                             ...           ...        ...
+    46bXU7Sgj7104ZoXxzz9tM           Euphoric Hardstyle  No Song Name  hardstyle
+    0he2ViGMUO3ajKTxLOfWVT  Greatest Hardstyle Playlist  No Song Name  hardstyle
+    72DAt9Lbpy9EUS29OzQLob       Best of Hardstyle 2020  No Song Name  hardstyle
+    6HXgExFVuE1c3cq9QjFCcU           Euphoric Hardstyle  No Song Name  hardstyle
+    6MAAMZImxcvYhRnxDLTufD       Best of Hardstyle 2020  No Song Name  hardstyle
+    
+                            duration_ms  key  mode  time_signature    tempo  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p       224427    8     1               4  115.080
+    0vSWgAlfpye0WCGeNmuNhy        98821    5     1               4  218.050
+    7EL7ifncK2PWFYThJjzR25       101172    8     1               4  189.938
+    1umsRbM7L4ju7rn9aU8Ju6        96062   10     0               4  139.990
+    4SKqOHKYU5pgHr5UiVKiQN       135079    5     1               4  128.014
+    ...                             ...  ...   ...             ...      ...
+    46bXU7Sgj7104ZoXxzz9tM       269208    4     1               4  150.013
+    0he2ViGMUO3ajKTxLOfWVT       210112    0     0               4  149.928
+    72DAt9Lbpy9EUS29OzQLob       234823    8     1               4  154.935
+    6HXgExFVuE1c3cq9QjFCcU       323200    6     0               4  150.042
+    6MAAMZImxcvYhRnxDLTufD       162161    9     1               4  155.047
+    
+                            acousticness  danceability  energy  instrumentalness  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p      0.401000         0.719   0.493          0.000000
+    0vSWgAlfpye0WCGeNmuNhy      0.013800         0.850   0.893          0.000004
+    7EL7ifncK2PWFYThJjzR25      0.187000         0.864   0.365          0.000000
+    1umsRbM7L4ju7rn9aU8Ju6      0.145000         0.767   0.576          0.000003
+    4SKqOHKYU5pgHr5UiVKiQN      0.007700         0.765   0.726          0.000000
+    ...                              ...           ...     ...               ...
+    46bXU7Sgj7104ZoXxzz9tM      0.031500         0.528   0.693          0.000345
+    0he2ViGMUO3ajKTxLOfWVT      0.022500         0.517   0.768          0.000018
+    72DAt9Lbpy9EUS29OzQLob      0.026000         0.361   0.821          0.000242
+    6HXgExFVuE1c3cq9QjFCcU      0.000551         0.477   0.921          0.029600
+    6MAAMZImxcvYhRnxDLTufD      0.001890         0.529   0.945          0.000055
+    
+                            liveness  loudness  speechiness  valence
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    0.1180    -7.230       0.0794   0.1240
+    0vSWgAlfpye0WCGeNmuNhy    0.3720    -4.783       0.0623   0.0391
+    7EL7ifncK2PWFYThJjzR25    0.1160   -10.219       0.0655   0.0478
+    1umsRbM7L4ju7rn9aU8Ju6    0.0968    -9.683       0.2560   0.1870
+    4SKqOHKYU5pgHr5UiVKiQN    0.6190    -5.580       0.1910   0.2700
+    ...                          ...       ...          ...      ...
+    46bXU7Sgj7104ZoXxzz9tM    0.1210    -5.148       0.0304   0.3940
+    0he2ViGMUO3ajKTxLOfWVT    0.2050    -7.922       0.0479   0.3830
+    72DAt9Lbpy9EUS29OzQLob    0.3850    -3.102       0.0505   0.1240
+    6HXgExFVuE1c3cq9QjFCcU    0.0575    -4.777       0.0392   0.4880
+    6MAAMZImxcvYhRnxDLTufD    0.4140    -5.862       0.0615   0.1340
+    
+    [35877 rows x 16 columns]
+
+%% Cell type:markdown id: tags:
+
+### Dropping missing values
+Syntax: `df.dropna()`
+- Returns a new DataFrame object instance reference.
+
+%% Cell type:code id: tags:
+
+``` python
+# .dropna will drop all rows that contain NaN in them
+df.dropna()
+```
+
+%% Output
+
+                                                  title     song_name      genre  \
+    id
+    5LzAV6KfjN8VhWCedeygfY            Dirtybird Players  No Song Name  techhouse
+    3TsCb6ueD678XBJDiRrvhr                   tech house  No Song Name  techhouse
+    6Y0Fy2buEis7bEOlG0QET1           Tech House Bangerz  No Song Name  techhouse
+    4EJI2XGViSQp6WscLKgYDD                   tech house  No Song Name  techhouse
+    4x6VzOQTLIrkkCWcDPh5Y0           blanc | Tech House  No Song Name  techhouse
+    ...                                             ...           ...        ...
+    46bXU7Sgj7104ZoXxzz9tM           Euphoric Hardstyle  No Song Name  hardstyle
+    0he2ViGMUO3ajKTxLOfWVT  Greatest Hardstyle Playlist  No Song Name  hardstyle
+    72DAt9Lbpy9EUS29OzQLob       Best of Hardstyle 2020  No Song Name  hardstyle
+    6HXgExFVuE1c3cq9QjFCcU           Euphoric Hardstyle  No Song Name  hardstyle
+    6MAAMZImxcvYhRnxDLTufD       Best of Hardstyle 2020  No Song Name  hardstyle
+    
+                            duration_ms  key  mode  time_signature    tempo  \
+    id
+    5LzAV6KfjN8VhWCedeygfY       197499    7     1               4  127.997
+    3TsCb6ueD678XBJDiRrvhr       206000   10     1               4  124.994
+    6Y0Fy2buEis7bEOlG0QET1       199839    4     0               4  124.006
+    4EJI2XGViSQp6WscLKgYDD       173861    8     1               4  125.031
+    4x6VzOQTLIrkkCWcDPh5Y0       394960    8     0               4  127.029
+    ...                             ...  ...   ...             ...      ...
+    46bXU7Sgj7104ZoXxzz9tM       269208    4     1               4  150.013
+    0he2ViGMUO3ajKTxLOfWVT       210112    0     0               4  149.928
+    72DAt9Lbpy9EUS29OzQLob       234823    8     1               4  154.935
+    6HXgExFVuE1c3cq9QjFCcU       323200    6     0               4  150.042
+    6MAAMZImxcvYhRnxDLTufD       162161    9     1               4  155.047
+    
+                            acousticness  danceability  energy  instrumentalness  \
+    id
+    5LzAV6KfjN8VhWCedeygfY      0.000957         0.806   0.950          0.920000
+    3TsCb6ueD678XBJDiRrvhr      0.062300         0.729   0.978          0.908000
+    6Y0Fy2buEis7bEOlG0QET1      0.019100         0.724   0.792          0.812000
+    4EJI2XGViSQp6WscLKgYDD      0.053000         0.700   0.898          0.418000
+    4x6VzOQTLIrkkCWcDPh5Y0      0.000301         0.803   0.919          0.926000
+    ...                              ...           ...     ...               ...
+    46bXU7Sgj7104ZoXxzz9tM      0.031500         0.528   0.693          0.000345
+    0he2ViGMUO3ajKTxLOfWVT      0.022500         0.517   0.768          0.000018
+    72DAt9Lbpy9EUS29OzQLob      0.026000         0.361   0.821          0.000242
+    6HXgExFVuE1c3cq9QjFCcU      0.000551         0.477   0.921          0.029600
+    6MAAMZImxcvYhRnxDLTufD      0.001890         0.529   0.945          0.000055
+    
+                            liveness  loudness  speechiness  valence
+    id
+    5LzAV6KfjN8VhWCedeygfY    0.1130    -6.782       0.0811    0.580
+    3TsCb6ueD678XBJDiRrvhr    0.0353    -6.645       0.0420    0.778
+    6Y0Fy2buEis7bEOlG0QET1    0.1080    -8.555       0.0405    0.346
+    4EJI2XGViSQp6WscLKgYDD    0.5740    -6.099       0.2570    0.791
+    4x6VzOQTLIrkkCWcDPh5Y0    0.1020    -8.667       0.0702    0.754
+    ...                          ...       ...          ...      ...
+    46bXU7Sgj7104ZoXxzz9tM    0.1210    -5.148       0.0304    0.394
+    0he2ViGMUO3ajKTxLOfWVT    0.2050    -7.922       0.0479    0.383
+    72DAt9Lbpy9EUS29OzQLob    0.3850    -3.102       0.0505    0.124
+    6HXgExFVuE1c3cq9QjFCcU    0.0575    -4.777       0.0392    0.488
+    6MAAMZImxcvYhRnxDLTufD    0.4140    -5.862       0.0615    0.134
+    
+    [17529 rows x 16 columns]
+
+%% Cell type:markdown id: tags:
+
+### Review: `Pandas.Series.apply(...)`
+Syntax: `Series.apply(<FUNCTION OBJECT REFERENCE>)`
+- applies input function to every element of the Series.
+- Returns a new `Series` object instance reference.
+
+Let's apply transformation function to `mode` column `Series`:
+- mode = 1 means major modality (sounds happy)
+- mode = 0 means minor modality (sounds sad)
+
+%% Cell type:code id: tags:
+
+``` python
+def replace_mode(m):
+    if m == 1:
+        return "major"
+    else:
+        return "minor"
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df["mode"].apply(replace_mode)
+```
+
+%% Output
+
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    major
+    0vSWgAlfpye0WCGeNmuNhy    major
+    7EL7ifncK2PWFYThJjzR25    major
+    1umsRbM7L4ju7rn9aU8Ju6    minor
+    4SKqOHKYU5pgHr5UiVKiQN    major
+                              ...
+    46bXU7Sgj7104ZoXxzz9tM    major
+    0he2ViGMUO3ajKTxLOfWVT    minor
+    72DAt9Lbpy9EUS29OzQLob    major
+    6HXgExFVuE1c3cq9QjFCcU    minor
+    6MAAMZImxcvYhRnxDLTufD    major
+    Name: mode, Length: 35877, dtype: object
+
+%% Cell type:markdown id: tags:
+
+### `lambda`
+
+Let's write a `lambda` function instead of the `replace_mode` function
+
+%% Cell type:code id: tags:
+
+``` python
+df["mode"].apply(lambda m: "major" if m == 1 else "minor")
+```
+
+%% Output
+
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    major
+    0vSWgAlfpye0WCGeNmuNhy    major
+    7EL7ifncK2PWFYThJjzR25    major
+    1umsRbM7L4ju7rn9aU8Ju6    minor
+    4SKqOHKYU5pgHr5UiVKiQN    major
+                              ...
+    46bXU7Sgj7104ZoXxzz9tM    major
+    0he2ViGMUO3ajKTxLOfWVT    minor
+    72DAt9Lbpy9EUS29OzQLob    major
+    6HXgExFVuE1c3cq9QjFCcU    minor
+    6MAAMZImxcvYhRnxDLTufD    major
+    Name: mode, Length: 35877, dtype: object
+
+%% Cell type:markdown id: tags:
+
+Typically transformed columns are added as new columns within the DataFrame.
+Let's add a new `modified_mode` column.
+
+%% Cell type:code id: tags:
+
+``` python
+df["modified_mode"] = df["mode"].apply(lambda m: "major" if m == 1 else "minor")
+df
+```
+
+%% Output
+
+                                                  title     song_name      genre  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p                          NaN     Pathology  Dark Trap
+    0vSWgAlfpye0WCGeNmuNhy                          NaN      Symbiote  Dark Trap
+    7EL7ifncK2PWFYThJjzR25                          NaN     BRAINFOOD  Dark Trap
+    1umsRbM7L4ju7rn9aU8Ju6                          NaN     Sacrifice  Dark Trap
+    4SKqOHKYU5pgHr5UiVKiQN                          NaN      Backpack  Dark Trap
+    ...                                             ...           ...        ...
+    46bXU7Sgj7104ZoXxzz9tM           Euphoric Hardstyle  No Song Name  hardstyle
+    0he2ViGMUO3ajKTxLOfWVT  Greatest Hardstyle Playlist  No Song Name  hardstyle
+    72DAt9Lbpy9EUS29OzQLob       Best of Hardstyle 2020  No Song Name  hardstyle
+    6HXgExFVuE1c3cq9QjFCcU           Euphoric Hardstyle  No Song Name  hardstyle
+    6MAAMZImxcvYhRnxDLTufD       Best of Hardstyle 2020  No Song Name  hardstyle
+    
+                            duration_ms  key  mode  time_signature    tempo  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p       224427    8     1               4  115.080
+    0vSWgAlfpye0WCGeNmuNhy        98821    5     1               4  218.050
+    7EL7ifncK2PWFYThJjzR25       101172    8     1               4  189.938
+    1umsRbM7L4ju7rn9aU8Ju6        96062   10     0               4  139.990
+    4SKqOHKYU5pgHr5UiVKiQN       135079    5     1               4  128.014
+    ...                             ...  ...   ...             ...      ...
+    46bXU7Sgj7104ZoXxzz9tM       269208    4     1               4  150.013
+    0he2ViGMUO3ajKTxLOfWVT       210112    0     0               4  149.928
+    72DAt9Lbpy9EUS29OzQLob       234823    8     1               4  154.935
+    6HXgExFVuE1c3cq9QjFCcU       323200    6     0               4  150.042
+    6MAAMZImxcvYhRnxDLTufD       162161    9     1               4  155.047
+    
+                            acousticness  danceability  energy  instrumentalness  \
+    id
+    7pgJBLVz5VmnL7uGHmRj6p      0.401000         0.719   0.493          0.000000
+    0vSWgAlfpye0WCGeNmuNhy      0.013800         0.850   0.893          0.000004
+    7EL7ifncK2PWFYThJjzR25      0.187000         0.864   0.365          0.000000
+    1umsRbM7L4ju7rn9aU8Ju6      0.145000         0.767   0.576          0.000003
+    4SKqOHKYU5pgHr5UiVKiQN      0.007700         0.765   0.726          0.000000
+    ...                              ...           ...     ...               ...
+    46bXU7Sgj7104ZoXxzz9tM      0.031500         0.528   0.693          0.000345
+    0he2ViGMUO3ajKTxLOfWVT      0.022500         0.517   0.768          0.000018
+    72DAt9Lbpy9EUS29OzQLob      0.026000         0.361   0.821          0.000242
+    6HXgExFVuE1c3cq9QjFCcU      0.000551         0.477   0.921          0.029600
+    6MAAMZImxcvYhRnxDLTufD      0.001890         0.529   0.945          0.000055
+    
+                            liveness  loudness  speechiness  valence modified_mode
+    id
+    7pgJBLVz5VmnL7uGHmRj6p    0.1180    -7.230       0.0794   0.1240         major
+    0vSWgAlfpye0WCGeNmuNhy    0.3720    -4.783       0.0623   0.0391         major
+    7EL7ifncK2PWFYThJjzR25    0.1160   -10.219       0.0655   0.0478         major
+    1umsRbM7L4ju7rn9aU8Ju6    0.0968    -9.683       0.2560   0.1870         minor
+    4SKqOHKYU5pgHr5UiVKiQN    0.6190    -5.580       0.1910   0.2700         major
+    ...                          ...       ...          ...      ...           ...
+    46bXU7Sgj7104ZoXxzz9tM    0.1210    -5.148       0.0304   0.3940         major
+    0he2ViGMUO3ajKTxLOfWVT    0.2050    -7.922       0.0479   0.3830         minor
+    72DAt9Lbpy9EUS29OzQLob    0.3850    -3.102       0.0505   0.1240         major
+    6HXgExFVuE1c3cq9QjFCcU    0.0575    -4.777       0.0392   0.4880         minor
+    6MAAMZImxcvYhRnxDLTufD    0.4140    -5.862       0.0615   0.1340         major
+    
+    [35877 rows x 17 columns]
+
+%% Cell type:markdown id: tags:
+
+#### Let's go back to the original table from the SQL database
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("SELECT * FROM spotify")
+df
+```
+
+%% Output
+
+                               id                        title  song_name  \
+    0      7pgJBLVz5VmnL7uGHmRj6p                               Pathology
+    1      0vSWgAlfpye0WCGeNmuNhy                                Symbiote
+    2      7EL7ifncK2PWFYThJjzR25                               BRAINFOOD
+    3      1umsRbM7L4ju7rn9aU8Ju6                               Sacrifice
+    4      4SKqOHKYU5pgHr5UiVKiQN                                Backpack
+    ...                       ...                          ...        ...
+    35872  46bXU7Sgj7104ZoXxzz9tM           Euphoric Hardstyle
+    35873  0he2ViGMUO3ajKTxLOfWVT  Greatest Hardstyle Playlist
+    35874  72DAt9Lbpy9EUS29OzQLob       Best of Hardstyle 2020
+    35875  6HXgExFVuE1c3cq9QjFCcU           Euphoric Hardstyle
+    35876  6MAAMZImxcvYhRnxDLTufD       Best of Hardstyle 2020
+    
+               genre  duration_ms  key  mode  time_signature    tempo  \
+    0      Dark Trap       224427    8     1               4  115.080
+    1      Dark Trap        98821    5     1               4  218.050
+    2      Dark Trap       101172    8     1               4  189.938
+    3      Dark Trap        96062   10     0               4  139.990
+    4      Dark Trap       135079    5     1               4  128.014
+    ...          ...          ...  ...   ...             ...      ...
+    35872  hardstyle       269208    4     1               4  150.013
+    35873  hardstyle       210112    0     0               4  149.928
+    35874  hardstyle       234823    8     1               4  154.935
+    35875  hardstyle       323200    6     0               4  150.042
+    35876  hardstyle       162161    9     1               4  155.047
+    
+           acousticness  danceability  energy  instrumentalness  liveness  \
+    0          0.401000         0.719   0.493          0.000000    0.1180
+    1          0.013800         0.850   0.893          0.000004    0.3720
+    2          0.187000         0.864   0.365          0.000000    0.1160
+    3          0.145000         0.767   0.576          0.000003    0.0968
+    4          0.007700         0.765   0.726          0.000000    0.6190
+    ...             ...           ...     ...               ...       ...
+    35872      0.031500         0.528   0.693          0.000345    0.1210
+    35873      0.022500         0.517   0.768          0.000018    0.2050
+    35874      0.026000         0.361   0.821          0.000242    0.3850
+    35875      0.000551         0.477   0.921          0.029600    0.0575
+    35876      0.001890         0.529   0.945          0.000055    0.4140
+    
+           loudness  speechiness  valence
+    0        -7.230       0.0794   0.1240
+    1        -4.783       0.0623   0.0391
+    2       -10.219       0.0655   0.0478
+    3        -9.683       0.2560   0.1870
+    4        -5.580       0.1910   0.2700
+    ...         ...          ...      ...
+    35872    -5.148       0.0304   0.3940
+    35873    -7.922       0.0479   0.3830
+    35874    -3.102       0.0505   0.1240
+    35875    -4.777       0.0392   0.4880
+    35876    -5.862       0.0615   0.1340
+    
+    [35877 rows x 17 columns]
+
+%% Cell type:markdown id: tags:
+
+Extract just the "genre" and "duration_ms" columns from `df`.
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]]
+```
+
+%% Output
+
+               genre  duration_ms
+    0      Dark Trap       224427
+    1      Dark Trap        98821
+    2      Dark Trap       101172
+    3      Dark Trap        96062
+    4      Dark Trap       135079
+    ...          ...          ...
+    35872  hardstyle       269208
+    35873  hardstyle       210112
+    35874  hardstyle       234823
+    35875  hardstyle       323200
+    35876  hardstyle       162161
+    
+    [35877 rows x 2 columns]
+
+%% Cell type:markdown id: tags:
+
+### `Pandas.DataFrame.groupby(...)`
+
+Syntax: `DataFrame.groupby(<COLUMN>)`
+- Returns a `groupby` object instance reference
+- Need to apply aggregation methods to use the return value of `groupby`
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]].groupby("genre")
+```
+
+%% Output
+
+    <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbc472bad90>
+
+%% Cell type:markdown id: tags:
+
+### What is the average duration for each genre ordered based on decreasing order of averages?
+#### v1: using `df` (`pandas`) to answer the question
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]].groupby("genre").mean()
+```
+
+%% Output
+
+                       duration_ms
+    genre
+    Dark Trap        196059.938997
+    Emo              218370.989519
+    Hiphop           227885.028411
+    Pop              211558.052980
+    Rap              200816.798836
+    RnB              225628.556955
+    Trap Metal       145940.519467
+    Underground Rap  175506.191224
+    dnb              288860.138811
+    hardstyle        232828.626542
+    psytrance        445770.492075
+    techhouse        298395.587596
+    techno           399123.187453
+    trance           288729.366262
+    trap             225149.277731
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]].groupby("genre").mean().sort_values(by = "duration_ms", ascending = False)
+```
+
+%% Output
+
+                       duration_ms
+    genre
+    psytrance        445770.492075
+    techno           399123.187453
+    techhouse        298395.587596
+    dnb              288860.138811
+    trance           288729.366262
+    hardstyle        232828.626542
+    Hiphop           227885.028411
+    RnB              225628.556955
+    trap             225149.277731
+    Emo              218370.989519
+    Pop              211558.052980
+    Rap              200816.798836
+    Dark Trap        196059.938997
+    Underground Rap  175506.191224
+    Trap Metal       145940.519467
+
+%% Cell type:markdown id: tags:
+
+One way to check whether `groupby` works would be to use `value_counts` on the same column `Series`.
+
+%% Cell type:code id: tags:
+
+``` python
+df["genre"].value_counts()
+```
+
+%% Output
+
+    Underground Rap    4330
+    Dark Trap          3590
+    Hiphop             3027
+    trance             2804
+    psytrance          2650
+    techno             2646
+    dnb                2507
+    trap               2362
+    hardstyle          2351
+    techhouse          2209
+    RnB                1905
+    Trap Metal         1875
+    Emo                1622
+    Rap                1546
+    Pop                 453
+    Name: genre, dtype: int64
+
+%% Cell type:markdown id: tags:
+
+### What is the average duration for each genre ordered based on decreasing order of averages?
+#### v2: using SQL query to answer the question
+
+%% Cell type:code id: tags:
+
+``` python
+# SQL equivalent query of the above Pandas query
+avg_duration_per_genre = qry("""
+SELECT genre, AVG(duration_ms) as avg_duration
+FROM spotify
+GROUP BY genre
+ORDER BY avg_duration DESC
+""")
+
+# How can we get make the SQL query output to be exactly same as df.groupby?
+avg_duration_per_genre = avg_duration_per_genre.set_index("genre")
+avg_duration_per_genre
+```
+
+%% Output
+
+                      avg_duration
+    genre
+    psytrance        445770.492075
+    techno           399123.187453
+    techhouse        298395.587596
+    dnb              288860.138811
+    trance           288729.366262
+    hardstyle        232828.626542
+    Hiphop           227885.028411
+    RnB              225628.556955
+    trap             225149.277731
+    Emo              218370.989519
+    Pop              211558.052980
+    Rap              200816.798836
+    Dark Trap        196059.938997
+    Underground Rap  175506.191224
+    Trap Metal       145940.519467
+
+%% Cell type:markdown id: tags:
+
+### What is the average speechiness for each mode, time signature pair?
+#### v1: pandas
+
+%% Cell type:code id: tags:
+
+``` python
+# use a list to indicate all the columns you want to groupby
+df[["mode", "time_signature", "speechiness"]].groupby(["mode", "time_signature"]).mean()
+```
+
+%% Output
+
+                         speechiness
+    mode time_signature
+    0    1                  0.181224
+         3                  0.121837
+         4                  0.126688
+         5                  0.204890
+    1    1                  0.173138
+         3                  0.129512
+         4                  0.139170
+         5                  0.220177
+
+%% Cell type:code id: tags:
+
+``` python
+# SQL equivalent query of the above Pandas query
+qry("""
+SELECT mode, time_signature, AVG(speechiness) as avg_speechiness
+FROM spotify
+GROUP BY mode, time_signature
+""")
+```
+
+%% Output
+
+       mode  time_signature  avg_speechiness
+    0     0               1         0.181224
+    1     0               3         0.121837
+    2     0               4         0.126688
+    3     0               5         0.204890
+    4     1               1         0.173138
+    5     1               3         0.129512
+    6     1               4         0.139170
+    7     1               5         0.220177
+
+%% Cell type:markdown id: tags:
+
+### Self-practice
+
+%% Cell type:markdown id: tags:
+
+### Which songs have a tempo greater than 150 and what are their genre?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+fast_songs = df[df["tempo"] > 150]
+fast_songs[["song_name", "genre"]]
+```
+
+%% Output
+
+                   song_name      genre
+    1               Symbiote  Dark Trap
+    2              BRAINFOOD  Dark Trap
+    18     FunnyToSeeYouHere  Dark Trap
+    19                Killer  Dark Trap
+    20                   608  Dark Trap
+    ...                  ...        ...
+    35871                     hardstyle
+    35872                     hardstyle
+    35874                     hardstyle
+    35875                     hardstyle
+    35876                     hardstyle
+    
+    [13753 rows x 2 columns]
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+
+qry("""
+SELECT song_name, genre
+FROM spotify
+WHERE tempo > 150
+""")
+```
+
+%% Output
+
+                   song_name      genre
+    0               Symbiote  Dark Trap
+    1              BRAINFOOD  Dark Trap
+    2      FunnyToSeeYouHere  Dark Trap
+    3                 Killer  Dark Trap
+    4                    608  Dark Trap
+    ...                  ...        ...
+    13748                     hardstyle
+    13749                     hardstyle
+    13750                     hardstyle
+    13751                     hardstyle
+    13752                     hardstyle
+    
+    [13753 rows x 2 columns]
+
+%% Cell type:markdown id: tags:
+
+### What is the sum of danceability and liveness for "Hiphop" genre songs?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+hiphop_songs = df[df["genre"] == "Hiphop"]
+hiphop_songs["danceability"] + hiphop_songs["liveness"]
+```
+
+%% Output
+
+    15321    0.8416
+    15322    0.9201
+    15323    0.8580
+    15324    0.8240
+    15325    0.9348
+              ...
+    18343    0.6690
+    18344    0.5370
+    18345    0.8850
+    18346    0.8770
+    18347    0.8703
+    Length: 3027, dtype: float64
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+hiphop_songs = qry("""
+SELECT danceability + liveness as song_score
+FROM spotify
+WHERE genre = "Hiphop"
+""")
+hiphop_songs["song_score"]
+```
+
+%% Output
+
+    0       0.8416
+    1       0.9201
+    2       0.8580
+    3       0.8240
+    4       0.9348
+             ...
+    3022    0.6690
+    3023    0.5370
+    3024    0.8850
+    3025    0.8770
+    3026    0.8703
+    Name: song_score, Length: 3027, dtype: float64
+
+%% Cell type:markdown id: tags:
+
+### Find all song_name ordered by ascending order of duration_ms. Eliminate songs which don't have a song_name
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+songs_by_duration = list(df.sort_values(by = "duration_ms")["song_name"])
+# [song for song in songs_by_duration if song != ""] # uncomment to see the output
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v2
+songs_by_duration = qry("""
+SELECT song_name
+FROM spotify
+ORDER BY duration_ms
+""")
+songs_by_duration = list(songs_by_duration["song_name"])
+# [song for song in songs_by_duration if song != ""] # uncomment to see the output
+```
+
+%% Cell type:markdown id: tags:
+
+### How many distinct "genre"s are there in the dataset?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+list(set(list(df["genre"])))
+```
+
+%% Output
+
+    ['trance',
+     'techno',
+     'dnb',
+     'Trap Metal',
+     'RnB',
+     'Pop',
+     'psytrance',
+     'techhouse',
+     'trap',
+     'Dark Trap',
+     'Emo',
+     'Underground Rap',
+     'Rap',
+     'Hiphop',
+     'hardstyle']
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+genres = qry("""
+SELECT DISTINCT genre
+FROM spotify
+""")
+list(genres["genre"])
+```
+
+%% Output
+
+    ['Dark Trap',
+     'Underground Rap',
+     'Trap Metal',
+     'Emo',
+     'Rap',
+     'RnB',
+     'Pop',
+     'Hiphop',
+     'techhouse',
+     'techno',
+     'trance',
+     'psytrance',
+     'trap',
+     'dnb',
+     'hardstyle']
+
+%% Cell type:markdown id: tags:
+
+### Considering only songs with energy greater than 0.5, what is the maximum energy for each "genre" with song count greater than 2000?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+high_energy_songs = df[df["energy"] > 0.5]
+genre_groups = high_energy_songs[["genre", "energy"]].groupby("genre")
+max_energy = genre_groups.max()
+max_energy["energy"]
+```
+
+%% Output
+
+    genre
+    Dark Trap          0.998
+    Emo                0.995
+    Hiphop             0.978
+    Pop                0.977
+    Rap                0.980
+    RnB                0.974
+    Trap Metal         0.999
+    Underground Rap    0.997
+    dnb                0.999
+    hardstyle          0.999
+    psytrance          0.999
+    techhouse          0.999
+    techno             1.000
+    trance             1.000
+    trap               1.000
+    Name: energy, dtype: float64
+
+%% Cell type:code id: tags:
+
+``` python
+genre_counts = genre_groups.count()
+genre_counts["energy_max"] = max_energy["energy"]
+filtered_genre_counts = genre_counts[genre_counts["energy"] > 2000]
+filtered_genre_counts
+```
+
+%% Output
+
+                     energy  energy_max
+    genre
+    Dark Trap          2757       0.998
+    Hiphop             2497       0.978
+    Underground Rap    3420       0.997
+    dnb                2496       0.999
+    hardstyle          2345       0.999
+    psytrance          2642       0.999
+    techhouse          2164       0.999
+    techno             2534       1.000
+    trance             2786       1.000
+    trap               2346       1.000
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+qry("""
+SELECT genre, COUNT(*) as song_count, MAX("energy") as energy_max
+FROM spotify
+WHERE energy > 0.5
+GROUP BY genre
+HAVING song_count > 2000
+""")
+```
+
+%% Output
+
+                 genre  song_count  energy_max
+    0        Dark Trap        2757       0.998
+    1           Hiphop        2497       0.978
+    2  Underground Rap        3420       0.997
+    3              dnb        2496       0.999
+    4        hardstyle        2345       0.999
+    5        psytrance        2642       0.999
+    6        techhouse        2164       0.999
+    7           techno        2534       1.000
+    8           trance        2786       1.000
+    9             trap        2346       1.000
+
+%% Cell type:code id: tags:
+
+``` python
+# Close the database connection here
+conn.close()
+```
--- a/f22/meena_lec_notes/lec-35/lec_35_pandas3_data_transformation_template.ipynb
+++ b/f22/meena_lec_notes/lec-35/lec_35_pandas3_data_transformation_template.ipynb
+%% Cell type:code id: tags:
+
+``` python
+# known import statements
+import pandas as pd
+import sqlite3 as sql # note that we are renaming to sql
+import os
+
+# new import statement
+import numpy as np
+```
+
+%% Cell type:markdown id: tags:
+
+# Lecture 35 Pandas 3: Data Transformation
+* Data transformation is the process of changing the format, structure, or values of data.
+* Often needed during data cleaning and sometimes during data analysis
+
+%% Cell type:markdown id: tags:
+
+# Today's Learning Objectives:
+
+* Setting column as index for pandas `DataFrame`
+* Identify, drop, or fill missing values (`np.NaN`) using Pandas `isna`, `dropna`, and `fillna`
+* Applying transformations to `DataFrame`:
+  * Use `apply` on pandas `Series` to apply a transformation function
+  * Use `replace` to replace all target values in Pandas `Series` and `DataFrame` rows / columns
+* Filter, aggregate, group, and summarize information in a `DataFrame` with `groupby`
+* Convert .groupby examples to SQL
+* Solving the same question using SQL and pandas `DataFrame` manipulations:
+  * filtering, grouping, and aggregation / summarization
+
+%% Cell type:markdown id: tags:
+
+# The dataset: Spotify songs
+Adapted from https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify.
+
+If you are interested in digging deeper in this dataset, here's a [blog post](https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a) that explain each column in details.
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 1: Establish a connection to the spotify.db database
+
+%% Cell type:code id: tags:
+
+``` python
+# open up the spotify database
+db_pathname = "spotify.db"
+assert ???
+conn = sql.connect(db_pathname)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def qry(sql):
+    return pd.read_sql(sql, conn)
+```
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 2: Identify the table name(s) inside the database
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("")
+df
+```
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 3: Use pandas lookup expression to extract the "sql" column and display the full query using .iloc lookup
+
+%% Cell type:code id: tags:
+
+``` python
+print()
+```
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 4: Store the data inside `spotify` table inside a variable called `df`
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("")
+df
+```
+
+%% Cell type:markdown id: tags:
+
+### Setting a column as row indices for the `DataFrame`
+
+- Syntax: `df.set_index("<COLUMN>")`
+- Returns a new DataFrame object instance reference.
+- WARNING: executing this twice will result in `KeyError` being thrown. Once you set a column as row index, it will no longer be a column within the `DataFrame`. If you tried this, go back and execute the above cell and update `df` once more and then execute the below cell exactly once.
+
+%% Cell type:code id: tags:
+
+``` python
+# Set the id column as row indices
+df =
+df
+```
+
+%% Cell type:markdown id: tags:
+
+### Not a Number
+
+- `np.NaN` is the floating point representation of Not a Number
+- You do not need to know / learn the details about the `numpy` package
+
+### Replacing / modifying values within the `DataFrame`
+
+Syntax: `df.replace(<TARGET>, <REPLACE>)`
+- Your target can be `str`, `int`, `float`, `None` (there are other possiblities, but those are too advanced for this course)
+- Returns a new DataFrame object instance reference.
+
+Let's now replace the missing values (empty strings) with `np.NAN`
+
+%% Cell type:code id: tags:
+
+``` python
+df =
+df.head(10) # title is the album name
+```
+
+%% Cell type:markdown id: tags:
+
+### Checking for missing values
+
+Syntax: `Series.isna()`
+- Returns a boolean Series
+
+Let's check if any of the "song_name"(s) are missing
+
+%% Cell type:code id: tags:
+
+``` python
+df["song_name"]
+```
+
+%% Cell type:markdown id: tags:
+
+### Review: `Pandas.Series.value_counts()`
+- Returns a new `Series` with unique values from the original `Series` as keys and the count of those unique values as values.
+- Return value `Series` is ordered using descending order of counts
+
+%% Cell type:code id: tags:
+
+``` python
+# count the number of missing values for song name
+df["song_name"]
+```
+
+%% Cell type:markdown id: tags:
+
+### Missing value manipulation
+Syntax: `df.fillna(<REPLACE>)`
+- Returns a new DataFrame object instance reference.
+
+%% Cell type:code id: tags:
+
+``` python
+# use .fillna to replace missing values
+df["song_name"]
+
+# to replace the original DataFrame's column, you need to explicitly update that object instance
+# TODO: uncomment the below lines and update the code
+#df["song_name"] = ???
+#df
+```
+
+%% Cell type:markdown id: tags:
+
+### Dropping missing values
+Syntax: `df.dropna()`
+- Returns a new DataFrame object instance reference.
+
+%% Cell type:code id: tags:
+
+``` python
+# .dropna will drop all rows that contain NaN in them
+df.dropna()
+```
+
+%% Cell type:markdown id: tags:
+
+### Review: `Pandas.Series.apply(...)`
+Syntax: `Series.apply(<FUNCTION OBJECT REFERENCE>)`
+- applies input function to every element of the Series.
+- Returns a new `Series` object instance reference.
+
+Let's apply transformation function to `mode` column `Series`:
+- mode = 1 means major modality (sounds happy)
+- mode = 0 means minor modality (sounds sad)
+
+%% Cell type:code id: tags:
+
+``` python
+def replace_mode(m):
+    if m == 1:
+        return "major"
+    else:
+        return "minor"
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df["mode"]
+```
+
+%% Cell type:markdown id: tags:
+
+### `lambda`
+
+Let's write a `lambda` function instead of the `replace_mode` function
+
+%% Cell type:code id: tags:
+
+``` python
+df["mode"].apply(???)
+```
+
+%% Cell type:markdown id: tags:
+
+Typically transformed columns are added as new columns within the DataFrame.
+Let's add a new `modified_mode` column.
+
+%% Cell type:code id: tags:
+
+``` python
+df["modified_mode"] = df["mode"].apply(lambda m: "major" if m == 1 else "minor")
+df
+```
+
+%% Cell type:markdown id: tags:
+
+#### Let's go back to the original table from the SQL database
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("SELECT * FROM spotify")
+df
+```
+
+%% Cell type:markdown id: tags:
+
+Extract just the "genre" and "duration_ms" columns from `df`.
+
+%% Cell type:code id: tags:
+
+``` python
+df[???]
+```
+
+%% Cell type:markdown id: tags:
+
+### `Pandas.DataFrame.groupby(...)`
+
+Syntax: `DataFrame.groupby(<COLUMN>)`
+- Returns a `groupby` object instance reference
+- Need to apply aggregation methods to use the return value of `groupby`
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]]
+```
+
+%% Cell type:markdown id: tags:
+
+### What is the average duration for each genre ordered based on decreasing order of averages?
+#### v1: using `df` (`pandas`) to answer the question
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]]
+```
+
+%% Cell type:markdown id: tags:
+
+One way to check whether `groupby` works would be to use `value_counts` on the same column `Series`.
+
+%% Cell type:code id: tags:
+
+``` python
+df["genre"].value_counts()
+```
+
+%% Cell type:markdown id: tags:
+
+### What is the average duration for each genre ordered based on decreasing order of averages?
+#### v2: using SQL query to answer the question
+
+%% Cell type:code id: tags:
+
+``` python
+# SQL equivalent query of the above Pandas query
+avg_duration_per_genre = qry("""
+
+""")
+
+# How can we get make the SQL query output to be exactly same as df.groupby?
+avg_duration_per_genre = avg_duration_per_genre.set_index("genre")
+avg_duration_per_genre
+```
+
+%% Cell type:markdown id: tags:
+
+### What is the average speechiness for each mode, time signature pair?
+#### v1: pandas
+
+%% Cell type:code id: tags:
+
+``` python
+# use a list to indicate all the columns you want to groupby
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# SQL equivalent query of the above Pandas query
+qry("""
+
+""")
+```
+
+%% Cell type:markdown id: tags:
+
+### Self-practice
+
+%% Cell type:markdown id: tags:
+
+### Which songs have a tempo greater than 150 and what are their genre?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+fast_songs =
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+
+qry("""
+
+""")
+```
+
+%% Cell type:markdown id: tags:
+
+### What is the sum of danceability and liveness for "Hiphop" genre songs?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+hiphop_songs =
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+hiphop_songs = qry("""
+
+""")
+hiphop_songs
+```
+
+%% Cell type:markdown id: tags:
+
+### Find all song_name ordered by ascending order of duration_ms. Eliminate songs which don't have a song_name
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+songs_by_duration =
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v2
+songs_by_duration = qry("""
+
+""")
+songs_by_duration
+```
+
+%% Cell type:markdown id: tags:
+
+### How many distinct "genre"s are there in the dataset?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+genres = qry("""
+
+""")
+```
+
+%% Cell type:markdown id: tags:
+
+### Considering only songs with energy greater than 0.5, what is the maximum energy for each "genre" with song count greater than 2000?
+
+%% Cell type:code id: tags:
+
+``` python
+genre_groups =
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+high_energy_songs = ???
+genre_groups = ???
+max_energy = ???
+max_energy["energy"]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+genre_counts = ???
+genre_counts["energy_max"] = max_energy["energy"]
+filtered_genre_counts = ???
+filtered_genre_counts
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+qry("""
+
+""")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Close the database connection here
+```
+%% Cell type:code id: tags:
+
+``` python
+# known import statements
+import pandas as pd
+import sqlite3 as sql # note that we are renaming to sql
+import os
+
+# new import statement
+import numpy as np
+```
+
+%% Cell type:markdown id: tags:
+
+# Lecture 35 Pandas 3: Data Transformation
+* Data transformation is the process of changing the format, structure, or values of data.
+* Often needed during data cleaning and sometimes during data analysis
+
+%% Cell type:markdown id: tags:
+
+# Today's Learning Objectives:
+
+* Setting column as index for pandas `DataFrame`
+* Identify, drop, or fill missing values (`np.NaN`) using Pandas `isna`, `dropna`, and `fillna`
+* Applying transformations to `DataFrame`:
+  * Use `apply` on pandas `Series` to apply a transformation function
+  * Use `replace` to replace all target values in Pandas `Series` and `DataFrame` rows / columns
+* Filter, aggregate, group, and summarize information in a `DataFrame` with `groupby`
+* Convert .groupby examples to SQL
+* Solving the same question using SQL and pandas `DataFrame` manipulations:
+  * filtering, grouping, and aggregation / summarization
+
+%% Cell type:markdown id: tags:
+
+# The dataset: Spotify songs
+Adapted from https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify.
+
+If you are interested in digging deeper in this dataset, here's a [blog post](https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a) that explain each column in details.
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 1: Establish a connection to the spotify.db database
+
+%% Cell type:code id: tags:
+
+``` python
+# open up the spotify database
+db_pathname = "spotify.db"
+assert ???
+conn = sql.connect(db_pathname)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def qry(sql):
+    return pd.read_sql(sql, conn)
+```
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 2: Identify the table name(s) inside the database
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("")
+df
+```
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 3: Use pandas lookup expression to extract the "sql" column and display the full query using .iloc lookup
+
+%% Cell type:code id: tags:
+
+``` python
+print()
+```
+
+%% Cell type:markdown id: tags:
+
+### WARMUP 4: Store the data inside `spotify` table inside a variable called `df`
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("")
+df
+```
+
+%% Cell type:markdown id: tags:
+
+### Setting a column as row indices for the `DataFrame`
+
+- Syntax: `df.set_index("<COLUMN>")`
+- Returns a new DataFrame object instance reference.
+- WARNING: executing this twice will result in `KeyError` being thrown. Once you set a column as row index, it will no longer be a column within the `DataFrame`. If you tried this, go back and execute the above cell and update `df` once more and then execute the below cell exactly once.
+
+%% Cell type:code id: tags:
+
+``` python
+# Set the id column as row indices
+df =
+df
+```
+
+%% Cell type:markdown id: tags:
+
+### Not a Number
+
+- `np.NaN` is the floating point representation of Not a Number
+- You do not need to know / learn the details about the `numpy` package
+
+### Replacing / modifying values within the `DataFrame`
+
+Syntax: `df.replace(<TARGET>, <REPLACE>)`
+- Your target can be `str`, `int`, `float`, `None` (there are other possiblities, but those are too advanced for this course)
+- Returns a new DataFrame object instance reference.
+
+Let's now replace the missing values (empty strings) with `np.NAN`
+
+%% Cell type:code id: tags:
+
+``` python
+df =
+df.head(10) # title is the album name
+```
+
+%% Cell type:markdown id: tags:
+
+### Checking for missing values
+
+Syntax: `Series.isna()`
+- Returns a boolean Series
+
+Let's check if any of the "song_name"(s) are missing
+
+%% Cell type:code id: tags:
+
+``` python
+df["song_name"]
+```
+
+%% Cell type:markdown id: tags:
+
+### Review: `Pandas.Series.value_counts()`
+- Returns a new `Series` with unique values from the original `Series` as keys and the count of those unique values as values.
+- Return value `Series` is ordered using descending order of counts
+
+%% Cell type:code id: tags:
+
+``` python
+# count the number of missing values for song name
+df["song_name"]
+```
+
+%% Cell type:markdown id: tags:
+
+### Missing value manipulation
+Syntax: `df.fillna(<REPLACE>)`
+- Returns a new DataFrame object instance reference.
+
+%% Cell type:code id: tags:
+
+``` python
+# use .fillna to replace missing values
+df["song_name"]
+
+# to replace the original DataFrame's column, you need to explicitly update that object instance
+# TODO: uncomment the below lines and update the code
+#df["song_name"] = ???
+#df
+```
+
+%% Cell type:markdown id: tags:
+
+### Dropping missing values
+Syntax: `df.dropna()`
+- Returns a new DataFrame object instance reference.
+
+%% Cell type:code id: tags:
+
+``` python
+# .dropna will drop all rows that contain NaN in them
+df.dropna()
+```
+
+%% Cell type:markdown id: tags:
+
+### Review: `Pandas.Series.apply(...)`
+Syntax: `Series.apply(<FUNCTION OBJECT REFERENCE>)`
+- applies input function to every element of the Series.
+- Returns a new `Series` object instance reference.
+
+Let's apply transformation function to `mode` column `Series`:
+- mode = 1 means major modality (sounds happy)
+- mode = 0 means minor modality (sounds sad)
+
+%% Cell type:code id: tags:
+
+``` python
+def replace_mode(m):
+    if m == 1:
+        return "major"
+    else:
+        return "minor"
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df["mode"]
+```
+
+%% Cell type:markdown id: tags:
+
+### `lambda`
+
+Let's write a `lambda` function instead of the `replace_mode` function
+
+%% Cell type:code id: tags:
+
+``` python
+df["mode"].apply(???)
+```
+
+%% Cell type:markdown id: tags:
+
+Typically transformed columns are added as new columns within the DataFrame.
+Let's add a new `modified_mode` column.
+
+%% Cell type:code id: tags:
+
+``` python
+df["modified_mode"] = df["mode"].apply(lambda m: "major" if m == 1 else "minor")
+df
+```
+
+%% Cell type:markdown id: tags:
+
+#### Let's go back to the original table from the SQL database
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("SELECT * FROM spotify")
+df
+```
+
+%% Cell type:markdown id: tags:
+
+Extract just the "genre" and "duration_ms" columns from `df`.
+
+%% Cell type:code id: tags:
+
+``` python
+df[???]
+```
+
+%% Cell type:markdown id: tags:
+
+### `Pandas.DataFrame.groupby(...)`
+
+Syntax: `DataFrame.groupby(<COLUMN>)`
+- Returns a `groupby` object instance reference
+- Need to apply aggregation methods to use the return value of `groupby`
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]]
+```
+
+%% Cell type:markdown id: tags:
+
+### What is the average duration for each genre ordered based on decreasing order of averages?
+#### v1: using `df` (`pandas`) to answer the question
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df[["genre", "duration_ms"]]
+```
+
+%% Cell type:markdown id: tags:
+
+One way to check whether `groupby` works would be to use `value_counts` on the same column `Series`.
+
+%% Cell type:code id: tags:
+
+``` python
+df["genre"].value_counts()
+```
+
+%% Cell type:markdown id: tags:
+
+### What is the average duration for each genre ordered based on decreasing order of averages?
+#### v2: using SQL query to answer the question
+
+%% Cell type:code id: tags:
+
+``` python
+# SQL equivalent query of the above Pandas query
+avg_duration_per_genre = qry("""
+
+""")
+
+# How can we get make the SQL query output to be exactly same as df.groupby?
+avg_duration_per_genre = avg_duration_per_genre.set_index("genre")
+avg_duration_per_genre
+```
+
+%% Cell type:markdown id: tags:
+
+### What is the average speechiness for each mode, time signature pair?
+#### v1: pandas
+
+%% Cell type:code id: tags:
+
+``` python
+# use a list to indicate all the columns you want to groupby
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# SQL equivalent query of the above Pandas query
+qry("""
+
+""")
+```
+
+%% Cell type:markdown id: tags:
+
+### Self-practice
+
+%% Cell type:markdown id: tags:
+
+### Which songs have a tempo greater than 150 and what are their genre?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+fast_songs =
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+
+qry("""
+
+""")
+```
+
+%% Cell type:markdown id: tags:
+
+### What is the sum of danceability and liveness for "Hiphop" genre songs?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+hiphop_songs =
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+hiphop_songs = qry("""
+
+""")
+hiphop_songs
+```
+
+%% Cell type:markdown id: tags:
+
+### Find all song_name ordered by ascending order of duration_ms. Eliminate songs which don't have a song_name
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+songs_by_duration =
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v2
+songs_by_duration = qry("""
+
+""")
+songs_by_duration
+```
+
+%% Cell type:markdown id: tags:
+
+### How many distinct "genre"s are there in the dataset?
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+genres = qry("""
+
+""")
+```
+
+%% Cell type:markdown id: tags:
+
+### Considering only songs with energy greater than 0.5, what is the maximum energy for each "genre" with song count greater than 2000?
+
+%% Cell type:code id: tags:
+
+``` python
+genre_groups =
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v1: pandas
+high_energy_songs = ???
+genre_groups = ???
+max_energy = ???
+max_energy["energy"]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+genre_counts = ???
+genre_counts["energy_max"] = max_energy["energy"]
+filtered_genre_counts = ???
+filtered_genre_counts
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# v2: SQL
+qry("""
+
+""")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Close the database connection here
+```
--- a/f22/meena_lec_notes/lec-35/spotify.db
+++ b/f22/meena_lec_notes/lec-35/spotify.db
--- a/f22/meena_lec_notes/lec-35/lec_35_plotting1_bar_plots.ipynb
+++ b/f22/meena_lec_notes/lec-35/lec_35_plotting1_bar_plots.ipynb
No results found