Compare revisions

f3bdeb7c · f3bdeb7c · f3bdeb7c · f3bdeb7c · f3bdeb7c · f3bdeb7c
--- a/f22/andy_lec_notes/lec_35/lec35_plotting1_complete.ipynb
+++ b/f22/andy_lec_notes/lec_35/lec35_plotting1_complete.ipynb
--- a/f22/andy_lec_notes/lec_35/lec35_plotting1_template.ipynb
+++ b/f22/andy_lec_notes/lec_35/lec35_plotting1_template.ipynb
--- a/f22/andy_lec_notes/lec_35/readme.md
+++ b/f22/andy_lec_notes/lec_35/readme.md
--- a/f22/andy_lec_notes/lec_36/iris-flowers.db
+++ b/f22/andy_lec_notes/lec_36/iris-flowers.db
--- a/f22/andy_lec_notes/lec_36/iris.csv
+++ b/f22/andy_lec_notes/lec_36/iris.csv
--- a/f22/andy_lec_notes/lec37_Dec07_Plotting2/lec_37_plotting2_scatter_plots.ipynb
+++ b/f22/andy_lec_notes/lec37_Dec07_Plotting2/lec_37_plotting2_scatter_plots.ipynb
--- a/f22/andy_lec_notes/lec37_Dec07_Plotting2/lec_37_plotting2_scatter_plots_template.ipynb
+++ b/f22/andy_lec_notes/lec37_Dec07_Plotting2/lec_37_plotting2_scatter_plots_template.ipynb
+%% Cell type:code id: tags:
+
+``` python
+# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
+from IPython.core.display import display, HTML
+display(HTML('<style>em { color: red; }</style> <style>.container { width:100% !important; }</style>'))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+%matplotlib inline
+```
+
+%% Cell type:code id: tags:
+
+``` python
+import pandas as pd
+from pandas import DataFrame, Series
+
+import sqlite3
+import os
+
+import matplotlib
+# new import statement
+from matplotlib import pyplot as plt
+
+import requests
+matplotlib.rcParams["font.size"] = 12
+```
+
+%% Cell type:markdown id: tags:
+
+#### Wrapping up bus dataset example
+
+%% Cell type:markdown id: tags:
+
+#### What are the top routes, and how many people ride them daily?
+
+%% Cell type:code id: tags:
+
+``` python
+path = "bus.db"
+# assert existence of path
+assert os.path.exists(path)
+
+# establish connection to bus.db
+conn = sqlite3.connect(path)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df = pd.read_sql("""
+SELECT Route, SUM(DailyBoardings) AS daily
+FROM boarding
+GROUP BY Route
+ORDER BY daily DESC
+""", conn)
+
+df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's extract daily column from df
+df["daily"]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's create a bar plot from daily column Series
+df["daily"].plot.bar()
+
+# Oops wrong x-axis labels!
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df = ???
+
+# let's plot for top 5 routes alone
+???
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's use slicing to aggregate the rest of the data
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's plot the bars
+ax = (s / 1000).plot.bar(color = "k")
+ax.set_ylabel("Rides / Day (Thousands)")
+None
+```
+
+%% Cell type:code id: tags:
+
+``` python
+conn.close()
+```
+
+%% Cell type:markdown id: tags:
+
+### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
+- This set of data is used in beginning Machine Learning Courses
+- You can train a ML algorithm to use the values to predict the class of iris
+- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 1:  Downloading IRIS dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data)
+
+%% Cell type:code id: tags:
+
+``` python
+# use requests to get this URL
+url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
+response = ???
+
+# check that the request was successful
+???
+
+# open a file called "iris.csv" for writing the data locally
+file_obj = open("iris.csv", ???)
+
+# write the text of response to the file object
+file_obj.write(???)
+
+# close the file object
+file_obj.close()
+
+# Look at the file you downloaded. What's wrong with it?
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 2: Making a DataFrame
+
+%% Cell type:code id: tags:
+
+``` python
+# read the "iris.csv" file into a Pandas dataframe
+iris_df = ???
+
+# display the head of the data frame
+iris_df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 3: Our CSV file has no header. Let's add column names.
+- Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
+
+%% Cell type:code id: tags:
+
+``` python
+# Attribute Information:
+# 1. sepal length in cm
+# 2. sepal width in cm
+# 3. petal length in cm
+# 4. petal width in cm
+# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
+
+# These should be our headers
+# ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
+
+iris_df = pd.read_csv("iris.csv",
+                 ???)
+iris_df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 4: Connect to our database version of this data!
+
+%% Cell type:code id: tags:
+
+``` python
+iris_conn = sqlite3.connect("iris-flowers.db")
+pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
+Break any ties by ordering by the shortest sepal width.
+
+%% Cell type:code id: tags:
+
+``` python
+pd.read_sql("""
+    SELECT
+    FROM
+    WHERE
+    ORDER BY
+    LIMIT 10
+""", iris_conn)
+```
+
+%% Cell type:markdown id: tags:
+
+# Lecture 36:  Scatter Plots
+**Learning Objectives**
+- Set the marker, color, and size of scatter plot data
+- Calculate correlation between DataFrame columns
+- Use subplots to group scatterplot data
+
+%% Cell type:markdown id: tags:
+
+## Set the marker, color, and size of scatter plot data
+
+To start, let's look at some made-up data about Trees.
+The city of Madison maintains a database of all the trees they care for.
+
+%% Cell type:code id: tags:
+
+``` python
+trees = [
+    {"age": 1, "height": 1.5, "diameter": 0.8},
+    {"age": 1, "height": 1.9, "diameter": 1.2},
+    {"age": 1, "height": 1.8, "diameter": 1.4},
+    {"age": 2, "height": 1.8, "diameter": 0.9},
+    {"age": 2, "height": 2.5, "diameter": 1.5},
+    {"age": 2, "height": 3, "diameter": 1.8},
+    {"age": 2, "height": 2.9, "diameter": 1.7},
+    {"age": 3, "height": 3.2, "diameter": 2.1},
+    {"age": 3, "height": 3, "diameter": 2},
+    {"age": 3, "height": 2.4, "diameter": 2.2},
+    {"age": 2, "height": 3.1, "diameter": 2.9},
+    {"age": 4, "height": 2.5, "diameter": 3.1},
+    {"age": 4, "height": 3.9, "diameter": 3.1},
+    {"age": 4, "height": 4.9, "diameter": 2.8},
+    {"age": 4, "height": 5.2, "diameter": 3.5},
+    {"age": 4, "height": 4.8, "diameter": 4},
+]
+trees_df = DataFrame(trees)
+trees_df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+### Scatter Plots
+We can make a scatter plot of a DataFrame using the following function...
+
+`df_name.plot.scatter(x = "x_col_name", y = "y_col_name", \
+                     color = "red", marker = "*", s = 50)`
+
+%% Cell type:markdown id: tags:
+
+Plot the trees data comparing a tree's age to its height...
+ - What is `df_name`?
+ - What is `x_col_name`?
+ - What is `y_col_name`?
+
+%% Cell type:code id: tags:
+
+``` python
+# TODO: change y to diameter
+```
+
+%% Cell type:markdown id: tags:
+
+Now plot with a little more beautification...
+ - Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
+ - Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
+ - Change the size (any int)
+
+%% Cell type:code id: tags:
+
+``` python
+# Plot with some more beautification options.
+trees_df.plot.scatter(x = "age", y = "height", color = "r",  marker = "D", s = 50)
+# D for diamond
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Add a title to your plot.
+ax = trees_df.plot.scatter(x = "age", y = "height", color = "r", marker = "D", s = 50)
+# D for diamond
+ax.set_title("Tree Age vs Height")
+```
+
+%% Cell type:markdown id: tags:
+
+#### Correlation
+
+%% Cell type:code id: tags:
+
+``` python
+# What is the correlation between our DataFrame columns?
+corr_df = trees_df.corr()
+corr_df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# What is the correlation between age and height (don't use .iloc)
+# Using index in this case isn't considered as hardcoding
+corr_df['age']['height']
+```
+
+%% Cell type:markdown id: tags:
+
+### Variating Stylistic Parameters
+
+%% Cell type:code id: tags:
+
+``` python
+# Option 1:
+trees_df.plot.scatter(x = "age", y = "height",  marker = "H", s = "diameter")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Option 2:
+# this way allows you to make it bigger
+trees_df.plot.scatter(x = "age", y = "height", marker = "H", s = trees_df["diameter"] * 50)
+```
+
+%% Cell type:markdown id: tags:
+
+## Use subplots to group scatterplot data
+
+%% Cell type:markdown id: tags:
+
+### Re-visit the Iris Data
+
+%% Cell type:code id: tags:
+
+``` python
+iris_df
+```
+
+%% Cell type:markdown id: tags:
+
+### How do we create a *scatter plot* for various *class types*?
+First, gather all the class types.
+
+%% Cell type:code id: tags:
+
+``` python
+# In Pandas
+varietes = list(set(iris_df["class"]))
+varietes
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# In SQL
+varietes = list(pd.read_sql("""
+    SELECT DISTINCT class
+    FROM iris
+""", iris_conn)["class"])
+varietes
+```
+
+%% Cell type:markdown id: tags:
+
+In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
+
+%% Cell type:code id: tags:
+
+``` python
+# If you want to continue using SQL instead, don't close the connection!
+iris_conn.close()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Change this scatter plot so that the data is only for class ='Iris-setosa'
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Write a for loop that iterates through each variety in classes
+# and makes a plot for only that class
+
+# For each class add a color and a marker style
+colors = ["blue", "green", "red"]
+markers = ["o", "^", "v"]
+
+for i in range(len(varietes)):
+    ???
+```
+
+%% Cell type:markdown id: tags:
+
+Did you notice that it made 3 plots?!?! What's decieving about this?
+
+%% Cell type:markdown id: tags:
+
+### We can make Subplots in plots, called an AxesSubplot, keyword ax
+1. if AxesSuplot ax passed, then plot in that subplot
+2. if ax is None, create a new AxesSubplot
+3. return AxesSubplot that was used
+
+%% Cell type:code id: tags:
+
+``` python
+# complete this code to make 3 plots in one
+
+plot_area = None   # don't change this...look at this variable in line 12
+colors = ["blue", "green", "red"]
+markers = ["o", "^", "v"]
+for i in range(len(varietes)):
+    ???
+```
+
+%% Cell type:markdown id: tags:
+
+### Let's focus on "Iris-virginica" data
+
+%% Cell type:code id: tags:
+
+``` python
+iris_virginica = ???
+assert(len(iris_virginica) == 50)
+iris_virginica.head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+iris_virginica.plot.scatter(x = "pet-width", y = "pet-length")
+```
+
+%% Cell type:markdown id: tags:
+
+### Let's learn about *xlim* and *ylim*
+- Allows us to set x-axis and y-axis limits
+- Takes either a single value (LOWER-BOUND) or a tuple containing two values (LOWER-BOUND, UPPER-BOUND)
+- You need to be careful about setting the UPPER-BOUND
+
+%% Cell type:code id: tags:
+
+``` python
+iris_virginica.plot.scatter(x = "pet-width", y = "pet-length", xlim = ???, ylim = ???)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
+                    xlim = (0, 6), ylim = (0, 6),
+                    figsize = (3, 3))
+
+# What is wrong with this plot?
+```
+
+%% Cell type:markdown id: tags:
+
+What is the maximum pet-len?
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+ax.get_ylim()
+```
+
+%% Cell type:markdown id: tags:
+
+Let's include assert statements to make sure we don't crop the plot!
+
+%% Cell type:code id: tags:
+
+``` python
+ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
+                     xlim = (0, 6), ylim = (0, 6),
+                     figsize = (3, 3))
+assert iris_virginica["pet-length"].max() <= ax.get_ylim()[1]
+```
+
+%% Cell type:markdown id: tags:
+
+### Now let's try all 4 assert statements
+
+```
+assert iris_virginica[ax.get_xlabel()].min() >= ax.get_xlim()[0]
+assert iris_virginica[ax.get_xlabel()].max() <= ax.get_xlim()[1]
+assert iris_virginica[ax.get_ylabel()].min() >= ax.get_ylim()[0]
+assert iris_virginica[ax.get_ylabel()].max() <= ax.get_ylim()[1]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
+                     xlim = (0, 7), ylim = (0, 7),
+                     figsize = (3, 3))
+assert iris_virginica[ax.get_xlabel()].min() >= ax.get_xlim()[0]
+assert iris_virginica[ax.get_xlabel()].max() <= ax.get_xlim()[1]
+assert iris_virginica[ax.get_ylabel()].min() >= ax.get_ylim()[0]
+assert iris_virginica[ax.get_ylabel()].max() <= ax.get_ylim()[1]
+```
+
+%% Cell type:markdown id: tags:
+
+### Time-Permitting
+Plot this data in an interesting/meaningful way & identify any correlations.
+
+%% Cell type:code id: tags:
+
+``` python
+students = pd.DataFrame({
+    "name": [
+        "Cole",
+        "Cynthia",
+        "Alice",
+        "Seth"
+    ],
+    "grade": [
+        "C",
+        "AB",
+        "B",
+        "BC"
+    ],
+    "gpa": [
+        2.0,
+        3.5,
+        3.0,
+        2.5
+    ],
+    "attendance": [
+        4,
+        11,
+        10,
+        6
+    ],
+    "height": [
+        68,
+        66,
+        60,
+        72
+    ]
+})
+students
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Min, Max, and Overall Difference in Student Height
+min_height = students["height"].min()
+max_height = students["height"].max()
+diff_height = max_height - min_height
+
+# Normalize students heights on a scale of [0, 1] (black to white)
+height_colors = (students["height"] - min_height) / diff_height
+
+# Normalize students heights on a scale of [0, 0.5] (black to gray)
+height_colors = height_colors / 2
+
+# Color must be a string (e.g. c='0.34')
+height_colors = height_colors.astype("string")
+
+height_colors
+```
+
+%% Cell type:code id: tags:
+
+``` python
+students.plot.scatter(x="attendance", y="gpa", c=height_colors)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+students.corr()
+```
+%% Cell type:code id: tags:
+
+``` python
+# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
+from IPython.core.display import display, HTML
+display(HTML('<style>em { color: red; }</style> <style>.container { width:100% !important; }</style>'))
+```
+
+%% Cell type:code id: tags:
+
+``` python
+%matplotlib inline
+```
+
+%% Cell type:code id: tags:
+
+``` python
+import pandas as pd
+from pandas import DataFrame, Series
+
+import sqlite3
+import os
+
+import matplotlib
+# new import statement
+from matplotlib import pyplot as plt
+
+import requests
+matplotlib.rcParams["font.size"] = 12
+```
+
+%% Cell type:markdown id: tags:
+
+#### Wrapping up bus dataset example
+
+%% Cell type:markdown id: tags:
+
+#### What are the top routes, and how many people ride them daily?
+
+%% Cell type:code id: tags:
+
+``` python
+path = "bus.db"
+# assert existence of path
+assert os.path.exists(path)
+
+# establish connection to bus.db
+conn = sqlite3.connect(path)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df = pd.read_sql("""
+SELECT Route, SUM(DailyBoardings) AS daily
+FROM boarding
+GROUP BY Route
+ORDER BY daily DESC
+""", conn)
+
+df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's extract daily column from df
+df["daily"]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's create a bar plot from daily column Series
+df["daily"].plot.bar()
+
+# Oops wrong x-axis labels!
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df = ???
+
+# let's plot for top 5 routes alone
+???
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's use slicing to aggregate the rest of the data
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# let's plot the bars
+ax = (s / 1000).plot.bar(color = "k")
+ax.set_ylabel("Rides / Day (Thousands)")
+None
+```
+
+%% Cell type:code id: tags:
+
+``` python
+conn.close()
+```
+
+%% Cell type:markdown id: tags:
+
+### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
+- This set of data is used in beginning Machine Learning Courses
+- You can train a ML algorithm to use the values to predict the class of iris
+- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 1:  Downloading IRIS dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data)
+
+%% Cell type:code id: tags:
+
+``` python
+# use requests to get this URL
+url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
+response = ???
+
+# check that the request was successful
+???
+
+# open a file called "iris.csv" for writing the data locally
+file_obj = open("iris.csv", ???)
+
+# write the text of response to the file object
+file_obj.write(???)
+
+# close the file object
+file_obj.close()
+
+# Look at the file you downloaded. What's wrong with it?
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 2: Making a DataFrame
+
+%% Cell type:code id: tags:
+
+``` python
+# read the "iris.csv" file into a Pandas dataframe
+iris_df = ???
+
+# display the head of the data frame
+iris_df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 3: Our CSV file has no header. Let's add column names.
+- Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
+
+%% Cell type:code id: tags:
+
+``` python
+# Attribute Information:
+# 1. sepal length in cm
+# 2. sepal width in cm
+# 3. petal length in cm
+# 4. petal width in cm
+# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
+
+# These should be our headers
+# ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
+
+iris_df = pd.read_csv("iris.csv",
+                 ???)
+iris_df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 4: Connect to our database version of this data!
+
+%% Cell type:code id: tags:
+
+``` python
+iris_conn = sqlite3.connect("iris-flowers.db")
+pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
+```
+
+%% Cell type:markdown id: tags:
+
+#### Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
+Break any ties by ordering by the shortest sepal width.
+
+%% Cell type:code id: tags:
+
+``` python
+pd.read_sql("""
+    SELECT
+    FROM
+    WHERE
+    ORDER BY
+    LIMIT 10
+""", iris_conn)
+```
+
+%% Cell type:markdown id: tags:
+
+# Lecture 36:  Scatter Plots
+**Learning Objectives**
+- Set the marker, color, and size of scatter plot data
+- Calculate correlation between DataFrame columns
+- Use subplots to group scatterplot data
+
+%% Cell type:markdown id: tags:
+
+## Set the marker, color, and size of scatter plot data
+
+To start, let's look at some made-up data about Trees.
+The city of Madison maintains a database of all the trees they care for.
+
+%% Cell type:code id: tags:
+
+``` python
+trees = [
+    {"age": 1, "height": 1.5, "diameter": 0.8},
+    {"age": 1, "height": 1.9, "diameter": 1.2},
+    {"age": 1, "height": 1.8, "diameter": 1.4},
+    {"age": 2, "height": 1.8, "diameter": 0.9},
+    {"age": 2, "height": 2.5, "diameter": 1.5},
+    {"age": 2, "height": 3, "diameter": 1.8},
+    {"age": 2, "height": 2.9, "diameter": 1.7},
+    {"age": 3, "height": 3.2, "diameter": 2.1},
+    {"age": 3, "height": 3, "diameter": 2},
+    {"age": 3, "height": 2.4, "diameter": 2.2},
+    {"age": 2, "height": 3.1, "diameter": 2.9},
+    {"age": 4, "height": 2.5, "diameter": 3.1},
+    {"age": 4, "height": 3.9, "diameter": 3.1},
+    {"age": 4, "height": 4.9, "diameter": 2.8},
+    {"age": 4, "height": 5.2, "diameter": 3.5},
+    {"age": 4, "height": 4.8, "diameter": 4},
+]
+trees_df = DataFrame(trees)
+trees_df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+### Scatter Plots
+We can make a scatter plot of a DataFrame using the following function...
+
+`df_name.plot.scatter(x = "x_col_name", y = "y_col_name", \
+                     color = "red", marker = "*", s = 50)`
+
+%% Cell type:markdown id: tags:
+
+Plot the trees data comparing a tree's age to its height...
+ - What is `df_name`?
+ - What is `x_col_name`?
+ - What is `y_col_name`?
+
+%% Cell type:code id: tags:
+
+``` python
+# TODO: change y to diameter
+```
+
+%% Cell type:markdown id: tags:
+
+Now plot with a little more beautification...
+ - Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
+ - Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
+ - Change the size (any int)
+
+%% Cell type:code id: tags:
+
+``` python
+# Plot with some more beautification options.
+trees_df.plot.scatter(x = "age", y = "height", color = "r",  marker = "D", s = 50)
+# D for diamond
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Add a title to your plot.
+ax = trees_df.plot.scatter(x = "age", y = "height", color = "r", marker = "D", s = 50)
+# D for diamond
+ax.set_title("Tree Age vs Height")
+```
+
+%% Cell type:markdown id: tags:
+
+#### Correlation
+
+%% Cell type:code id: tags:
+
+``` python
+# What is the correlation between our DataFrame columns?
+corr_df = trees_df.corr()
+corr_df
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# What is the correlation between age and height (don't use .iloc)
+# Using index in this case isn't considered as hardcoding
+corr_df['age']['height']
+```
+
+%% Cell type:markdown id: tags:
+
+### Variating Stylistic Parameters
+
+%% Cell type:code id: tags:
+
+``` python
+# Option 1:
+trees_df.plot.scatter(x = "age", y = "height",  marker = "H", s = "diameter")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Option 2:
+# this way allows you to make it bigger
+trees_df.plot.scatter(x = "age", y = "height", marker = "H", s = trees_df["diameter"] * 50)
+```
+
+%% Cell type:markdown id: tags:
+
+## Use subplots to group scatterplot data
+
+%% Cell type:markdown id: tags:
+
+### Re-visit the Iris Data
+
+%% Cell type:code id: tags:
+
+``` python
+iris_df
+```
+
+%% Cell type:markdown id: tags:
+
+### How do we create a *scatter plot* for various *class types*?
+First, gather all the class types.
+
+%% Cell type:code id: tags:
+
+``` python
+# In Pandas
+varietes = list(set(iris_df["class"]))
+varietes
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# In SQL
+varietes = list(pd.read_sql("""
+    SELECT DISTINCT class
+    FROM iris
+""", iris_conn)["class"])
+varietes
+```
+
+%% Cell type:markdown id: tags:
+
+In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
+
+%% Cell type:code id: tags:
+
+``` python
+# If you want to continue using SQL instead, don't close the connection!
+iris_conn.close()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Change this scatter plot so that the data is only for class ='Iris-setosa'
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Write a for loop that iterates through each variety in classes
+# and makes a plot for only that class
+
+# For each class add a color and a marker style
+colors = ["blue", "green", "red"]
+markers = ["o", "^", "v"]
+
+for i in range(len(varietes)):
+    ???
+```
+
+%% Cell type:markdown id: tags:
+
+Did you notice that it made 3 plots?!?! What's decieving about this?
+
+%% Cell type:markdown id: tags:
+
+### We can make Subplots in plots, called an AxesSubplot, keyword ax
+1. if AxesSuplot ax passed, then plot in that subplot
+2. if ax is None, create a new AxesSubplot
+3. return AxesSubplot that was used
+
+%% Cell type:code id: tags:
+
+``` python
+# complete this code to make 3 plots in one
+
+plot_area = None   # don't change this...look at this variable in line 12
+colors = ["blue", "green", "red"]
+markers = ["o", "^", "v"]
+for i in range(len(varietes)):
+    ???
+```
+
+%% Cell type:markdown id: tags:
+
+### Let's focus on "Iris-virginica" data
+
+%% Cell type:code id: tags:
+
+``` python
+iris_virginica = ???
+assert(len(iris_virginica) == 50)
+iris_virginica.head()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+iris_virginica.plot.scatter(x = "pet-width", y = "pet-length")
+```
+
+%% Cell type:markdown id: tags:
+
+### Let's learn about *xlim* and *ylim*
+- Allows us to set x-axis and y-axis limits
+- Takes either a single value (LOWER-BOUND) or a tuple containing two values (LOWER-BOUND, UPPER-BOUND)
+- You need to be careful about setting the UPPER-BOUND
+
+%% Cell type:code id: tags:
+
+``` python
+iris_virginica.plot.scatter(x = "pet-width", y = "pet-length", xlim = ???, ylim = ???)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
+                    xlim = (0, 6), ylim = (0, 6),
+                    figsize = (3, 3))
+
+# What is wrong with this plot?
+```
+
+%% Cell type:markdown id: tags:
+
+What is the maximum pet-len?
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+ax.get_ylim()
+```
+
+%% Cell type:markdown id: tags:
+
+Let's include assert statements to make sure we don't crop the plot!
+
+%% Cell type:code id: tags:
+
+``` python
+ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
+                     xlim = (0, 6), ylim = (0, 6),
+                     figsize = (3, 3))
+assert iris_virginica["pet-length"].max() <= ax.get_ylim()[1]
+```
+
+%% Cell type:markdown id: tags:
+
+### Now let's try all 4 assert statements
+
+```
+assert iris_virginica[ax.get_xlabel()].min() >= ax.get_xlim()[0]
+assert iris_virginica[ax.get_xlabel()].max() <= ax.get_xlim()[1]
+assert iris_virginica[ax.get_ylabel()].min() >= ax.get_ylim()[0]
+assert iris_virginica[ax.get_ylabel()].max() <= ax.get_ylim()[1]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
+                     xlim = (0, 7), ylim = (0, 7),
+                     figsize = (3, 3))
+assert iris_virginica[ax.get_xlabel()].min() >= ax.get_xlim()[0]
+assert iris_virginica[ax.get_xlabel()].max() <= ax.get_xlim()[1]
+assert iris_virginica[ax.get_ylabel()].min() >= ax.get_ylim()[0]
+assert iris_virginica[ax.get_ylabel()].max() <= ax.get_ylim()[1]
+```
+
+%% Cell type:markdown id: tags:
+
+### Time-Permitting
+Plot this data in an interesting/meaningful way & identify any correlations.
+
+%% Cell type:code id: tags:
+
+``` python
+students = pd.DataFrame({
+    "name": [
+        "Cole",
+        "Cynthia",
+        "Alice",
+        "Seth"
+    ],
+    "grade": [
+        "C",
+        "AB",
+        "B",
+        "BC"
+    ],
+    "gpa": [
+        2.0,
+        3.5,
+        3.0,
+        2.5
+    ],
+    "attendance": [
+        4,
+        11,
+        10,
+        6
+    ],
+    "height": [
+        68,
+        66,
+        60,
+        72
+    ]
+})
+students
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# Min, Max, and Overall Difference in Student Height
+min_height = students["height"].min()
+max_height = students["height"].max()
+diff_height = max_height - min_height
+
+# Normalize students heights on a scale of [0, 1] (black to white)
+height_colors = (students["height"] - min_height) / diff_height
+
+# Normalize students heights on a scale of [0, 0.5] (black to gray)
+height_colors = height_colors / 2
+
+# Color must be a string (e.g. c='0.34')
+height_colors = height_colors.astype("string")
+
+height_colors
+```
+
+%% Cell type:code id: tags:
+
+``` python
+students.plot.scatter(x="attendance", y="gpa", c=height_colors)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+students.corr()
+```
--- a/f22/andy_lec_notes/lec_36/readme.md
+++ b/f22/andy_lec_notes/lec_36/readme.md
--- a/f22/andy_lec_notes/lec_37/fire_hydrants.csv
+++ b/f22/andy_lec_notes/lec_37/fire_hydrants.csv
--- a/f22/andy_lec_notes/lec_37/lec37_plotting3_complete.ipynb
+++ b/f22/andy_lec_notes/lec_37/lec37_plotting3_complete.ipynb
--- a/f22/andy_lec_notes/lec_37/lec37_plotting3_template.ipynb
+++ b/f22/andy_lec_notes/lec_37/lec37_plotting3_template.ipynb
--- a/f22/andy_lec_notes/lec_37/readme.md
+++ b/f22/andy_lec_notes/lec_37/readme.md
--- a/f22/andy_lec_notes/lec_38/lec38_plotting4_complete.ipynb
+++ b/f22/andy_lec_notes/lec_38/lec38_plotting4_complete.ipynb
--- a/f22/andy_lec_notes/lec_38/lec38_plotting4_template.ipynb
+++ b/f22/andy_lec_notes/lec_38/lec38_plotting4_template.ipynb
--- a/f22/andy_lec_notes/lec_38/readme.md
+++ b/f22/andy_lec_notes/lec_38/readme.md
--- a/f22/andy_lec_notes/lec_36/lec36_plotting2_850.ipynb
+++ b/f22/andy_lec_notes/lec_36/lec36_plotting2_850.ipynb
-%% Cell type:code id: tags:
-
-``` python
-# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
-from IPython.core.display import display, HTML
-display(HTML('<style>em { color: red; }</style> <style>.container { width:100% !important; }</style>'))
-```
-
-%% Output
-
-
-%% Cell type:code id: tags:
-
-``` python
-import pandas as pd
-from pandas import DataFrame, Series
-
-import sqlite3
-import os
-
-import matplotlib
-from matplotlib import pyplot as plt
-
-import requests
-matplotlib.rcParams["font.size"] = 12
-```
-
-%% Cell type:markdown id: tags:
-
-### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
- This set of data is used in beginning Machine Learning Courses
- You can train a ML algorithm to use the values to predict the class of iris
- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 1:  Requests and file writing
-
-# use requests to get this file  "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
-response = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
-
-# check that the request was successful
-response.raise_for_status()
-
-# open a file called "iris.csv" for writing the data locally to avoid spamming their server
-file_obj = open("iris.csv", "w")
-
-# write the text of response to the file object
-file_obj.write(response.text)
-
-# close the file object
-file_obj.close()
-
-# Look at the file you downloaded. What's wrong with it?
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 2:  Making a DataFrame
-
-# read the "iris.csv" file into a Pandas dataframe
-
-# display the head of the data frame
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 3: Our CSV file has no header....let's add column names.
-#           Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
-
-# Attribute Information:
-# 1. sepal length in cm
-# 2. sepal width in cm
-# 3. petal length in cm
-# 4. petal width in cm
-# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
-
-# These should be our headers ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 4: Connect to our database version of this data
-iris_conn = sqlite3.connect("iris-flowers.db")
-
-# find out the name of the table
-pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
-#           Break any ties by ordering by the shortest sepal width.
-
-pd.read_sql("""
-
-""", iris_conn)
-```
-
-%% Cell type:markdown id: tags:
-
-# Lecture 36:  Scatter Plots
-**Learning Objectives**
- Set the marker, color, and size of scatter plot data
- Calculate correlation between DataFrame columns
- Use subplots to group scatterplot data
-
-%% Cell type:markdown id: tags:
-
-## Set the marker, color, and size of scatter plot data
-
-To start, let's look at some made-up data about Trees.
-The city of Madison maintains a database of all the trees they care for.
-
-%% Cell type:code id: tags:
-
-``` python
-trees = [
-    {"age": 1, "height": 1.5, "diameter": 0.8},
-    {"age": 1, "height": 1.9, "diameter": 1.2},
-    {"age": 1, "height": 1.8, "diameter": 1.4},
-    {"age": 2, "height": 1.8, "diameter": 0.9},
-    {"age": 2, "height": 2.5, "diameter": 1.5},
-    {"age": 2, "height": 3, "diameter": 1.8},
-    {"age": 2, "height": 2.9, "diameter": 1.7},
-    {"age": 3, "height": 3.2, "diameter": 2.1},
-    {"age": 3, "height": 3, "diameter": 2},
-    {"age": 3, "height": 2.4, "diameter": 2.2},
-    {"age": 2, "height": 3.1, "diameter": 2.9},
-    {"age": 4, "height": 2.5, "diameter": 3.1},
-    {"age": 4, "height": 3.9, "diameter": 3.1},
-    {"age": 4, "height": 4.9, "diameter": 2.8},
-    {"age": 4, "height": 5.2, "diameter": 3.5},
-    {"age": 4, "height": 4.8, "diameter": 4},
-]
-trees_df = DataFrame(trees)
-trees_df.head()
-```
-
-%% Cell type:markdown id: tags:
-
-### Scatter Plots
-We can make a scatter plot of a DataFrame using the following function...
-
-`df_name.plot.scatter(x="x_col_name", y="y_col_name", color="peachpuff")`
-
-Plot the trees data comparing a tree's age to its height...
- - What is `df_name`?
- - What is `x_col_name`?
- - What is `y_col_name`?
-
-%% Cell type:code id: tags:
-
-``` python
-```
-
-%% Cell type:markdown id: tags:
-
-Now plot with a little more beautification...
- - Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
- - Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
- - Change the size (any int)
-
-%% Cell type:code id: tags:
-
-``` python
-# Plot with some more beautification options.
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Add a title to your plot.
-```
-
-%% Cell type:markdown id: tags:
-
-#### Correlation
-
-%% Cell type:code id: tags:
-
-``` python
-# What is the correlation between our DataFrame columns?
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# What is the correlation between age and height (don't use .iloc)
-```
-
-%% Cell type:markdown id: tags:
-
-### The Size can be based on a DataFrame value
-
-%% Cell type:code id: tags:
-
-``` python
-# Option 1:
-trees_df.plot.scatter(x="age", y="height",  marker="H", s="diameter")
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Option 2:
-trees_df.plot.scatter(x="age", y="height", marker = "H", s=trees_df["diameter"] * 50) # this way allows you to make it bigger
-```
-
-%% Cell type:markdown id: tags:
-
-## Use subplots to group scatterplot data
-
-%% Cell type:markdown id: tags:
-
-### Re-visit the Iris Data
-
-%% Cell type:code id: tags:
-
-``` python
-iris_df
-```
-
-%% Cell type:markdown id: tags:
-
-### How do we create a *scatter plot* for various *class types*?
-First, gather all the class types.
-
-%% Cell type:code id: tags:
-
-``` python
-# In Pandas
-varieties = ???
-varieties
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# In SQL
-varietes = pd.read_sql("""
-
-""", iris_conn)
-varietes
-```
-
-%% Cell type:markdown id: tags:
-
-In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
-
-%% Cell type:code id: tags:
-
-``` python
-# If you want to continue using SQL instead, don't close the connection!
-iris_conn.close()
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Change this scatter plot so that the data is only for class ='Iris-setosa'
-iris_df.plot.scatter(x = "pet-width", y = "pet-length")
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Write a for loop that iterates through each variety in classes
-# and makes a plot for only that class
-
-for i in range(len(varietes)):
-    variety = varietes[i]
-    pass
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# copy/paste the code above, but this time make each plot a different color
-colors = ["blue", "green", "red"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# copy/paste the code above, but this time make each plot a different color AND marker
-colors = ["blue", "green", "red"]
-markers = ["o", "^", "v"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Did you notice that it made 3 plots?!?! What's deceiving about this?
-```
-
-%% Cell type:code id: tags:
-
-``` python
-colors = ["blue", "green", "red"]
-markers = ["o", "^", "v"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Have to be VERY careful to not crop out data.
-# We'll talk about this next lecture.
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Better yet, we could combine these.
-```
-
-%% Cell type:markdown id: tags:
-
-### We can make Subplots in plots, called an AxesSubplot, keyword ax
-1. if AxesSuplot ax passed, then plot in that subplot
-2. if ax is None, create a new AxesSubplot
-3. return AxesSubplot that was used
-
-%% Cell type:code id: tags:
-
-``` python
-# complete this code to make 3 plots in one
-
-plot_area = None   # don't change this...look at this variable in line 12
-colors = ["blue", "green", "red"]
-markers = ["o", "^", "v"]
-```
-
-%% Cell type:markdown id: tags:
-
-### Time-Permitting
-Plot this data in an interesting/meaningful way & identify any correlations.
-
-%% Cell type:code id: tags:
-
-``` python
-students = pd.DataFrame({
-    "name": [
-        "Cole",
-        "Cynthia",
-        "Alice",
-        "Seth"
-    ],
-    "grade": [
-        "C",
-        "AB",
-        "B",
-        "BC"
-    ],
-    "gpa": [
-        2.0,
-        3.5,
-        3.0,
-        2.5
-    ],
-    "attendance": [
-        4,
-        11,
-        10,
-        6
-    ],
-    "height": [
-        68,
-        66,
-        60,
-        72
-    ]
-})
-students
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Min, Max, and Overall Difference in Student Height
-min_height = students["height"].min()
-max_height = students["height"].max()
-diff_height = max_height - min_height
-
-# Normalize students heights on a scale of [0, 1] (black to white)
-height_colors = (students["height"] - min_height) / diff_height
-
-# Normalize students heights on a scale of [0, 0.5] (black to gray)
-height_colors = height_colors / 2
-
-# Color must be a string (e.g. c='0.34')
-height_colors = height_colors.astype("string")
-
-height_colors
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Plot!
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# What are the correlations?
-```
-
-%% Cell type:markdown id: tags:
-
-![image.png](attachment:image.png)
-
-%% Cell type:markdown id: tags:
-
-https://www.researchgate.net/publication/247907373_Stupid_Data_Miner_Tricks_Overfitting_the_SP_500
-%% Cell type:code id: tags:
-
-``` python
-# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
-from IPython.core.display import display, HTML
-display(HTML('<style>em { color: red; }</style> <style>.container { width:100% !important; }</style>'))
-```
-
-%% Output
-
-
-%% Cell type:code id: tags:
-
-``` python
-import pandas as pd
-from pandas import DataFrame, Series
-
-import sqlite3
-import os
-
-import matplotlib
-from matplotlib import pyplot as plt
-
-import requests
-matplotlib.rcParams["font.size"] = 12
-```
-
-%% Cell type:markdown id: tags:
-
-### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
- This set of data is used in beginning Machine Learning Courses
- You can train a ML algorithm to use the values to predict the class of iris
- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 1:  Requests and file writing
-
-# use requests to get this file  "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
-response = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
-
-# check that the request was successful
-response.raise_for_status()
-
-# open a file called "iris.csv" for writing the data locally to avoid spamming their server
-file_obj = open("iris.csv", "w")
-
-# write the text of response to the file object
-file_obj.write(response.text)
-
-# close the file object
-file_obj.close()
-
-# Look at the file you downloaded. What's wrong with it?
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 2:  Making a DataFrame
-
-# read the "iris.csv" file into a Pandas dataframe
-
-# display the head of the data frame
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 3: Our CSV file has no header....let's add column names.
-#           Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
-
-# Attribute Information:
-# 1. sepal length in cm
-# 2. sepal width in cm
-# 3. petal length in cm
-# 4. petal width in cm
-# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
-
-# These should be our headers ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 4: Connect to our database version of this data
-iris_conn = sqlite3.connect("iris-flowers.db")
-
-# find out the name of the table
-pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
-#           Break any ties by ordering by the shortest sepal width.
-
-pd.read_sql("""
-
-""", iris_conn)
-```
-
-%% Cell type:markdown id: tags:
-
-# Lecture 36:  Scatter Plots
-**Learning Objectives**
- Set the marker, color, and size of scatter plot data
- Calculate correlation between DataFrame columns
- Use subplots to group scatterplot data
-
-%% Cell type:markdown id: tags:
-
-## Set the marker, color, and size of scatter plot data
-
-To start, let's look at some made-up data about Trees.
-The city of Madison maintains a database of all the trees they care for.
-
-%% Cell type:code id: tags:
-
-``` python
-trees = [
-    {"age": 1, "height": 1.5, "diameter": 0.8},
-    {"age": 1, "height": 1.9, "diameter": 1.2},
-    {"age": 1, "height": 1.8, "diameter": 1.4},
-    {"age": 2, "height": 1.8, "diameter": 0.9},
-    {"age": 2, "height": 2.5, "diameter": 1.5},
-    {"age": 2, "height": 3, "diameter": 1.8},
-    {"age": 2, "height": 2.9, "diameter": 1.7},
-    {"age": 3, "height": 3.2, "diameter": 2.1},
-    {"age": 3, "height": 3, "diameter": 2},
-    {"age": 3, "height": 2.4, "diameter": 2.2},
-    {"age": 2, "height": 3.1, "diameter": 2.9},
-    {"age": 4, "height": 2.5, "diameter": 3.1},
-    {"age": 4, "height": 3.9, "diameter": 3.1},
-    {"age": 4, "height": 4.9, "diameter": 2.8},
-    {"age": 4, "height": 5.2, "diameter": 3.5},
-    {"age": 4, "height": 4.8, "diameter": 4},
-]
-trees_df = DataFrame(trees)
-trees_df.head()
-```
-
-%% Cell type:markdown id: tags:
-
-### Scatter Plots
-We can make a scatter plot of a DataFrame using the following function...
-
-`df_name.plot.scatter(x="x_col_name", y="y_col_name", color="peachpuff")`
-
-Plot the trees data comparing a tree's age to its height...
- - What is `df_name`?
- - What is `x_col_name`?
- - What is `y_col_name`?
-
-%% Cell type:code id: tags:
-
-``` python
-```
-
-%% Cell type:markdown id: tags:
-
-Now plot with a little more beautification...
- - Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
- - Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
- - Change the size (any int)
-
-%% Cell type:code id: tags:
-
-``` python
-# Plot with some more beautification options.
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Add a title to your plot.
-```
-
-%% Cell type:markdown id: tags:
-
-#### Correlation
-
-%% Cell type:code id: tags:
-
-``` python
-# What is the correlation between our DataFrame columns?
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# What is the correlation between age and height (don't use .iloc)
-```
-
-%% Cell type:markdown id: tags:
-
-### The Size can be based on a DataFrame value
-
-%% Cell type:code id: tags:
-
-``` python
-# Option 1:
-trees_df.plot.scatter(x="age", y="height",  marker="H", s="diameter")
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Option 2:
-trees_df.plot.scatter(x="age", y="height", marker = "H", s=trees_df["diameter"] * 50) # this way allows you to make it bigger
-```
-
-%% Cell type:markdown id: tags:
-
-## Use subplots to group scatterplot data
-
-%% Cell type:markdown id: tags:
-
-### Re-visit the Iris Data
-
-%% Cell type:code id: tags:
-
-``` python
-iris_df
-```
-
-%% Cell type:markdown id: tags:
-
-### How do we create a *scatter plot* for various *class types*?
-First, gather all the class types.
-
-%% Cell type:code id: tags:
-
-``` python
-# In Pandas
-varieties = ???
-varieties
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# In SQL
-varietes = pd.read_sql("""
-
-""", iris_conn)
-varietes
-```
-
-%% Cell type:markdown id: tags:
-
-In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
-
-%% Cell type:code id: tags:
-
-``` python
-# If you want to continue using SQL instead, don't close the connection!
-iris_conn.close()
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Change this scatter plot so that the data is only for class ='Iris-setosa'
-iris_df.plot.scatter(x = "pet-width", y = "pet-length")
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Write a for loop that iterates through each variety in classes
-# and makes a plot for only that class
-
-for i in range(len(varietes)):
-    variety = varietes[i]
-    pass
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# copy/paste the code above, but this time make each plot a different color
-colors = ["blue", "green", "red"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# copy/paste the code above, but this time make each plot a different color AND marker
-colors = ["blue", "green", "red"]
-markers = ["o", "^", "v"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Did you notice that it made 3 plots?!?! What's deceiving about this?
-```
-
-%% Cell type:code id: tags:
-
-``` python
-colors = ["blue", "green", "red"]
-markers = ["o", "^", "v"]
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Have to be VERY careful to not crop out data.
-# We'll talk about this next lecture.
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Better yet, we could combine these.
-```
-
-%% Cell type:markdown id: tags:
-
-### We can make Subplots in plots, called an AxesSubplot, keyword ax
-1. if AxesSuplot ax passed, then plot in that subplot
-2. if ax is None, create a new AxesSubplot
-3. return AxesSubplot that was used
-
-%% Cell type:code id: tags:
-
-``` python
-# complete this code to make 3 plots in one
-
-plot_area = None   # don't change this...look at this variable in line 12
-colors = ["blue", "green", "red"]
-markers = ["o", "^", "v"]
-```
-
-%% Cell type:markdown id: tags:
-
-### Time-Permitting
-Plot this data in an interesting/meaningful way & identify any correlations.
-
-%% Cell type:code id: tags:
-
-``` python
-students = pd.DataFrame({
-    "name": [
-        "Cole",
-        "Cynthia",
-        "Alice",
-        "Seth"
-    ],
-    "grade": [
-        "C",
-        "AB",
-        "B",
-        "BC"
-    ],
-    "gpa": [
-        2.0,
-        3.5,
-        3.0,
-        2.5
-    ],
-    "attendance": [
-        4,
-        11,
-        10,
-        6
-    ],
-    "height": [
-        68,
-        66,
-        60,
-        72
-    ]
-})
-students
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Min, Max, and Overall Difference in Student Height
-min_height = students["height"].min()
-max_height = students["height"].max()
-diff_height = max_height - min_height
-
-# Normalize students heights on a scale of [0, 1] (black to white)
-height_colors = (students["height"] - min_height) / diff_height
-
-# Normalize students heights on a scale of [0, 0.5] (black to gray)
-height_colors = height_colors / 2
-
-# Color must be a string (e.g. c='0.34')
-height_colors = height_colors.astype("string")
-
-height_colors
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# Plot!
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# What are the correlations?
-```
-
-%% Cell type:markdown id: tags:
-
-![image.png](attachment:image.png)
-
-%% Cell type:markdown id: tags:
-
-https://www.researchgate.net/publication/247907373_Stupid_Data_Miner_Tricks_Overfitting_the_SP_500
--- a/f22/meena_lec_notes/lec-32/.ipynb_checkpoints/demo_lec_31-checkpoint.ipynb
+++ b/f22/meena_lec_notes/lec-32/.ipynb_checkpoints/demo_lec_31-checkpoint.ipynb
-%% Cell type:code id: tags:
-
-``` python
-from IPython.core.display import display, HTML
-display(HTML("<style>.container { width:100% !important; }</style>"))
-```
-
-%% Output
-
-
-%% Cell type:code id: tags:
-
-``` python
-import csv
-import os
-import csv
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# copied from https://automatetheboringstuff.com/2e/chapter16/
-def process_csv(filename):
-    exampleFile = open(filename)
-    exampleReader = csv.reader(exampleFile)
-    exampleData = list(exampleReader)
-    return exampleData
-```
-
-%% Cell type:markdown id: tags:
-
-## Example 1: List Visualization
-### Write a gen_html function
- Input: shopping_list and path to shopping.html
- Outcome: create shopping.html file
-
-### Pseudocode
-1. Open "shopping.html" in write mode.
-2. Write \<ul\> tag into the html file
-3. Iterate over each item in shopping list.
-4. Write each item with <\li\> tag.
-5. After you are done iterating, write \</ul\> tag.
-6. Close the file object.
-
-%% Cell type:code id: tags:
-
-``` python
-def gen_html(shopping_list, html_path):
-    f = open(html_path, "w")
-    f.write("<ul>\n")
-    for item in shopping_list:
-        f.write("<li>" + str(item) + "\n")
-    f.write("</ul>\n")
-    f.close()
-
-gen_html(["apples", "oranges", "milk", "banana"], "shopping.html")
-```
-
-%% Cell type:markdown id: tags:
-
-## Example 2: Dictionary Visualization
-### Write a csv_to_html function
- Input: path to review1.csv and path to reviews.html
- Outcome 1: create a html file for each review
- Outcome 2: create reviews.html file containing link to a html file for each review
-
-### Pseudocode
-1. Create data_html folder using os.mkdir. Make sure to use try ... except blocks to catch FileExistsError
-2. Use process_csv function to read csv data and split the header and the data
-3. For each review, extract review id, review title, review text.
-4. generate the \<rid\>.html for each review inside data_html folder.
-   - Open \<rid\>.html in write mode
-   - Add review title using \<h1\> tag
-   - Add review text inside\<p\> tag
-   - Close \<rid\>.html file object
-5. generate a reviews.html file which has link to each review html page \<rid\>.html
-   - Open reviews.html file in write mode
-   - Add each \<rid\>.html as hyperlink using \<a\> tag.
-   - Close reviews.html file
-
-%% Cell type:code id: tags:
-
-``` python
-def csv_to_html(csv_path, html_path):
-    try:
-        os.mkdir("data_html")
-    except FileExistsError:
-        pass
-
-    reviews_data = process_csv(csv_path)
-    reviews_header = reviews_data[0]
-    reviews_data = reviews_data[1:]
-
-    reviews_file = open(html_path, "w")
-    reviews_file.write("<ul>\n")
-
-    for row in reviews_data:
-        rid = row[reviews_header.index("review id")]
-        title = row[reviews_header.index("review title")]
-        text = row[reviews_header.index("review text")]
-
-        # STEP 4: generate the <rid>.html for each review inside data folder
-        review_path = os.path.join("data_html", str(rid) + ".html")
-        html_file = open(review_path, "w")
-        html_file.write("<h1>{}</h1><p>{}</p>".format(title, text))
-        html_file.close()
-
-        # STEP 5: generate a reviews.html file which has link to each review html page <rid>.html
-        reviews_file.write('<li><a href = "{}">{}</a>'.format(review_path, str(rid) + ":" + str(title)) + "<br>\n")
-
-    reviews_file.write("</ul>\n")
-    reviews_file.close()
-
-csv_to_html(os.path.join("data", "review1.csv"), "reviews.html")
-```
-
-%% Cell type:code id: tags:
-
-``` python
-```
-%% Cell type:code id: tags:
-
-``` python
-from IPython.core.display import display, HTML
-display(HTML("<style>.container { width:100% !important; }</style>"))
-```
-
-%% Output
-
-
-%% Cell type:code id: tags:
-
-``` python
-import csv
-import os
-import csv
-```
-
-%% Cell type:code id: tags:
-
-``` python
-# copied from https://automatetheboringstuff.com/2e/chapter16/
-def process_csv(filename):
-    exampleFile = open(filename)
-    exampleReader = csv.reader(exampleFile)
-    exampleData = list(exampleReader)
-    return exampleData
-```
-
-%% Cell type:markdown id: tags:
-
-## Example 1: List Visualization
-### Write a gen_html function
- Input: shopping_list and path to shopping.html
- Outcome: create shopping.html file
-
-### Pseudocode
-1. Open "shopping.html" in write mode.
-2. Write \<ul\> tag into the html file
-3. Iterate over each item in shopping list.
-4. Write each item with <\li\> tag.
-5. After you are done iterating, write \</ul\> tag.
-6. Close the file object.
-
-%% Cell type:code id: tags:
-
-``` python
-def gen_html(shopping_list, html_path):
-    f = open(html_path, "w")
-    f.write("<ul>\n")
-    for item in shopping_list:
-        f.write("<li>" + str(item) + "\n")
-    f.write("</ul>\n")
-    f.close()
-
-gen_html(["apples", "oranges", "milk", "banana"], "shopping.html")
-```
-
-%% Cell type:markdown id: tags:
-
-## Example 2: Dictionary Visualization
-### Write a csv_to_html function
- Input: path to review1.csv and path to reviews.html
- Outcome 1: create a html file for each review
- Outcome 2: create reviews.html file containing link to a html file for each review
-
-### Pseudocode
-1. Create data_html folder using os.mkdir. Make sure to use try ... except blocks to catch FileExistsError
-2. Use process_csv function to read csv data and split the header and the data
-3. For each review, extract review id, review title, review text.
-4. generate the \<rid\>.html for each review inside data_html folder.
-   - Open \<rid\>.html in write mode
-   - Add review title using \<h1\> tag
-   - Add review text inside\<p\> tag
-   - Close \<rid\>.html file object
-5. generate a reviews.html file which has link to each review html page \<rid\>.html
-   - Open reviews.html file in write mode
-   - Add each \<rid\>.html as hyperlink using \<a\> tag.
-   - Close reviews.html file
-
-%% Cell type:code id: tags:
-
-``` python
-def csv_to_html(csv_path, html_path):
-    try:
-        os.mkdir("data_html")
-    except FileExistsError:
-        pass
-
-    reviews_data = process_csv(csv_path)
-    reviews_header = reviews_data[0]
-    reviews_data = reviews_data[1:]
-
-    reviews_file = open(html_path, "w")
-    reviews_file.write("<ul>\n")
-
-    for row in reviews_data:
-        rid = row[reviews_header.index("review id")]
-        title = row[reviews_header.index("review title")]
-        text = row[reviews_header.index("review text")]
-
-        # STEP 4: generate the <rid>.html for each review inside data folder
-        review_path = os.path.join("data_html", str(rid) + ".html")
-        html_file = open(review_path, "w")
-        html_file.write("<h1>{}</h1><p>{}</p>".format(title, text))
-        html_file.close()
-
-        # STEP 5: generate a reviews.html file which has link to each review html page <rid>.html
-        reviews_file.write('<li><a href = "{}">{}</a>'.format(review_path, str(rid) + ":" + str(title)) + "<br>\n")
-
-    reviews_file.write("</ul>\n")
-    reviews_file.close()
-
-csv_to_html(os.path.join("data", "review1.csv"), "reviews.html")
-```
-
-%% Cell type:code id: tags:
-
-``` python
-```
--- a/f22/meena_lec_notes/lec-32/.ipynb_checkpoints/demo_lec_31_template-checkpoint.ipynb
+++ b/f22/meena_lec_notes/lec-32/.ipynb_checkpoints/demo_lec_31_template-checkpoint.ipynb
-%% Cell type:code id: tags:
-
-``` python
-from IPython.core.display import display, HTML
-display(HTML("<style>.container { width:100% !important; }</style>"))
-```
-
-%% Cell type:code id: tags:
-
-``` python
-import csv
-import os
-```
-
-%% Cell type:markdown id: tags:
-
-## Example 1: List Visualization
-### Write a gen_html function
- Input: shopping_list and path to shopping.html
- Outcome: create shopping.html file
-
-### Pseudocode
-1. Open "shopping.html" in write mode.
-2. Write \<ul\> tag into the html file
-3. Iterate over each item in shopping list.
-4. Write each item with \<li\> tag.
-5. After you are done iterating, write \</ul\> tag.
-6. Close the file object.
-
-%% Cell type:code id: tags:
-
-``` python
-```
-
-%% Cell type:markdown id: tags:
-
-## Example 2: Dictionary Visualization
-### Write a csv_to_html function
- Input: path to review1.csv and path to reviews.html
- Outcome 1: create a html file for each review
- Outcome 2: create reviews.html file containing link to a html file for each review
-
-### Pseudocode
-1. Create data_html folder using os.mkdir. Make sure to use try ... except blocks to catch FileExistsError
-2. Use process_csv function to read csv data and split the header and the data
-3. For each review, extract review id, review title, review text.
-4. generate the \<rid\>.html for each review inside data_html folder.
-   - Open \<rid\>.html in write mode
-   - Add review title using \<h1\> tag
-   - Add review text inside\<p\> tag
-   - Close \<rid\>.html file object
-5. generate a reviews.html file which has link to each review html page \<rid\>.html
-   - Open reviews.html file in write mode
-   - Add each \<rid\>.html as hyperlink using \<a\> tag.
-   - Close reviews.html file
-
-%% Cell type:code id: tags:
-
-``` python
-```
-
-%% Cell type:code id: tags:
-
-``` python
-```
-%% Cell type:code id: tags:
-
-``` python
-from IPython.core.display import display, HTML
-display(HTML("<style>.container { width:100% !important; }</style>"))
-```
-
-%% Cell type:code id: tags:
-
-``` python
-import csv
-import os
-```
-
-%% Cell type:markdown id: tags:
-
-## Example 1: List Visualization
-### Write a gen_html function
- Input: shopping_list and path to shopping.html
- Outcome: create shopping.html file
-
-### Pseudocode
-1. Open "shopping.html" in write mode.
-2. Write \<ul\> tag into the html file
-3. Iterate over each item in shopping list.
-4. Write each item with \<li\> tag.
-5. After you are done iterating, write \</ul\> tag.
-6. Close the file object.
-
-%% Cell type:code id: tags:
-
-``` python
-```
-
-%% Cell type:markdown id: tags:
-
-## Example 2: Dictionary Visualization
-### Write a csv_to_html function
- Input: path to review1.csv and path to reviews.html
- Outcome 1: create a html file for each review
- Outcome 2: create reviews.html file containing link to a html file for each review
-
-### Pseudocode
-1. Create data_html folder using os.mkdir. Make sure to use try ... except blocks to catch FileExistsError
-2. Use process_csv function to read csv data and split the header and the data
-3. For each review, extract review id, review title, review text.
-4. generate the \<rid\>.html for each review inside data_html folder.
-   - Open \<rid\>.html in write mode
-   - Add review title using \<h1\> tag
-   - Add review text inside\<p\> tag
-   - Close \<rid\>.html file object
-5. generate a reviews.html file which has link to each review html page \<rid\>.html
-   - Open reviews.html file in write mode
-   - Add each \<rid\>.html as hyperlink using \<a\> tag.
-   - Close reviews.html file
-
-%% Cell type:code id: tags:
-
-``` python
-```
-
-%% Cell type:code id: tags:
-
-``` python
-```
--- a/f22/meena_lec_notes/lec-33/.ipynb_checkpoints/web3-checkpoint.ipynb
+++ b/f22/meena_lec_notes/lec-33/.ipynb_checkpoints/web3-checkpoint.ipynb
-%% Cell type:markdown id: tags:
-
-# Web 3
- HTML parsing using BeautifulSoup
-
-%% Cell type:code id: tags:
-
-``` python
-from IPython.core.display import display, HTML
-display(HTML("<style>.container { width:100% !important; }</style>"))
-```
-
-%% Output
-
-
-%% Cell type:code id: tags:
-
-``` python
-import requests                #For downloading the HTML content using HTTP GET request
-from bs4 import BeautifulSoup  #For parsing the HTML content and searching through the HTML
-import os
-import pandas as pd
-```
-
-%% Cell type:markdown id: tags:
-
-# STAGE 1: extract all state URLs from the states page
-## Stage 1 pseudocode
-1. Use requests module to send a GET request to https://simple.wikipedia.org/wiki/List_of_U.S._states
-2. Don't forget to raise_for_status to ensure you are getting 200 OK status code
-3. Explore what r.text gives you
-
-%% Cell type:code id: tags:
-
-``` python
-url = "https://simple.wikipedia.org/wiki/List_of_U.S._states"
-r = requests.get(url)
-r.raise_for_status()
-#print(r.text) #Uncomment this line to see the output
-```
-
-%% Cell type:markdown id: tags:
-
-## Stage 1 pseudocode continued...
-4. Check out what type you are getting from r.text
-
-%% Cell type:code id: tags:
-
-``` python
-print(type(r.text))
-```
-
-%% Output
-
-    <class 'str'>
-
-%% Cell type:markdown id: tags:
-
-## Stage 1 pseudocode continued...
-5. Create BeautifulSoup object by passing r.text, "html.parser" as arguments and capture return value into a variable called doc
-6. Try prettify() method call --- still not that pretty, right?
-
-%% Cell type:code id: tags:
-
-``` python
-doc = BeautifulSoup(r.text, "html.parser")
-#print(doc.prettify()) #Uncomment this line to see the output
-```
-
-%% Cell type:markdown id: tags:
-
-## Stage 1 pseudocode continued...
-7. (Not a code step) Open "https://simple.wikipedia.org/wiki/List_of_U.S._states" on Google Chrome.
-    - Right click on one of the state pages
-    - Click on "Inspect" --- this opens developer tools
-    - This tool let's you explore the html source code
-    - Explore the \<table\> and sub tags like \<th\>, \<tr\>, \<td\>
-    - Let's go back to coding
-
-%% Cell type:markdown id: tags:
-
-## Stage 1 pseudocode continued...
-7. Find all "table" elements in the document by using doc.find_all(...) function and capture return value into a variable "tables"
-    - explore the length of the value returned from find_all(...) function
-    - check out the type of the value returned from find_all(...) function
-8. Add an assert to check that there is only one table - futuristic assert to make sure the html format hasn't changed on the website
-9. Extract the first table into tbl variable
-    - explore type of tbl
-    - try printing the content of tb1 --- looks like just a string
-
-%% Cell type:code id: tags:
-
-``` python
-tables = doc.find_all("table")
-print(len(tables)) # only one table on the states page!
-print(type(tables))
-#Futuristic assert to make sure the html format hasn't changed on the website
-assert len(tables) == 1
-tbl = tables[0]
-print(type(tbl))
-```
-
-%% Output
-
-    1
-    <class 'bs4.element.ResultSet'>
-    <class 'bs4.element.Tag'>
-
-%% Cell type:code id: tags:
-
-``` python
-#print(tbl) #Uncomment this line to see the output
-```
-
-%% Cell type:markdown id: tags:
-
-## Stage 1 pseudocode continued...
-10. Find all the tr elements by using tbl.find_all(...) function and capture return value into a variable tr.
-    - explore length of trs, type of trs
-    - Add an assert checking that length of trs is at least 50 (For 50 US states)
-
-%% Cell type:code id: tags:
-
-``` python
-trs = tbl.find_all("tr")
-print(len(trs))
-print(type(trs))
-assert len(trs) >= 50
-```
-
-%% Output
-
-    52
-    <class 'bs4.element.ResultSet'>
-
-%% Cell type:markdown id: tags:
-
-## Stage 1 pseudocode continued...
-11. Iterate over each item in trs (going to be a lengthy step!)
-    - print each item (tr tag)
-    - call tr.find(..) to find "th" elements --- this finds th element for every tr element.
-    - capture return value into a variable called th
-    - print th and explore what you are getting.
-    - find each hyperlinks within each th element: call th.find_all("a") and capture return value into a variable called links
-    - explore length of links by printing it --- some of the states have 2 links; go back and explore why that is the case and figure out which link you want
-        - some have 0 links, skip over those entries!
-        - extract first of the hyperlinks into a variable called link
-        - print link to confirm you are able to extract the correct link
-        - explore type of link
-        - print link.get_text() method and get attrs of link by saying link.attrs
-        - capture link.get_text() into a variable state
-        - capture link.attrs into a variable state_url --- we need a full URL. Define a prefix variable holding "https://simple.wikipedia.org" and concatenate prefix + link.attrs
-        - create a new dictionary called state_links --- we are going to use this dict to track each state and its URL. Think carefully about where you have to create this empty dict.
-
-#### Congrats :) stage 1 is done
-
-%% Cell type:code id: tags:
-
-``` python
-prefix = "https://simple.wikipedia.org"
-state_links = {} #KEY: state name; VALUE: link to state page
-
-for tr in trs:
-    th = tr.find("th")
-    links = th.find_all("a")
-    #print(len(links))
-    #print(th.get_text())
-    if len(links) == 0:
-        continue
-    link = links[0]
-    #print(type(link), link)
-    #print(link.get_text(), link.attrs) #link.attrs is a dict
-    state = link.get_text()
-    state_url = prefix + link.attrs["href"]
-    state_links[state] = state_url
-
-state_links
-```
-
-%% Output
-
-    {'postal abbs.': 'https://simple.wikipedia.org/wiki/List_of_U.S._state_abbreviations',
-     'Alabama': 'https://simple.wikipedia.org/wiki/Alabama',
-     'Alaska': 'https://simple.wikipedia.org/wiki/Alaska',
-     'Arizona': 'https://simple.wikipedia.org/wiki/Arizona',
-     'Arkansas': 'https://simple.wikipedia.org/wiki/Arkansas',
-     'California': 'https://simple.wikipedia.org/wiki/California',
-     'Colorado': 'https://simple.wikipedia.org/wiki/Colorado',
-     'Connecticut': 'https://simple.wikipedia.org/wiki/Connecticut',
-     'Delaware': 'https://simple.wikipedia.org/wiki/Delaware',
-     'Florida': 'https://simple.wikipedia.org/wiki/Florida',
-     'Georgia': 'https://simple.wikipedia.org/wiki/Georgia_(U.S._state)',
-     'Hawaii': 'https://simple.wikipedia.org/wiki/Hawaii',
-     'Idaho': 'https://simple.wikipedia.org/wiki/Idaho',
-     'Illinois': 'https://simple.wikipedia.org/wiki/Illinois',
-     'Indiana': 'https://simple.wikipedia.org/wiki/Indiana',
-     'Iowa': 'https://simple.wikipedia.org/wiki/Iowa',
-     'Kansas': 'https://simple.wikipedia.org/wiki/Kansas',
-     'Kentucky': 'https://simple.wikipedia.org/wiki/Kentucky',
-     'Louisiana': 'https://simple.wikipedia.org/wiki/Louisiana',
-     'Maine': 'https://simple.wikipedia.org/wiki/Maine',
-     'Maryland': 'https://simple.wikipedia.org/wiki/Maryland',
-     'Massachusetts': 'https://simple.wikipedia.org/wiki/Massachusetts',
-     'Michigan': 'https://simple.wikipedia.org/wiki/Michigan',
-     'Minnesota': 'https://simple.wikipedia.org/wiki/Minnesota',
-     'Mississippi': 'https://simple.wikipedia.org/wiki/Mississippi',
-     'Missouri': 'https://simple.wikipedia.org/wiki/Missouri',
-     'Montana': 'https://simple.wikipedia.org/wiki/Montana',
-     'Nebraska': 'https://simple.wikipedia.org/wiki/Nebraska',
-     'Nevada': 'https://simple.wikipedia.org/wiki/Nevada',
-     'New Hampshire': 'https://simple.wikipedia.org/wiki/New_Hampshire',
-     'New Jersey': 'https://simple.wikipedia.org/wiki/New_Jersey',
-     'New Mexico': 'https://simple.wikipedia.org/wiki/New_Mexico',
-     'New York': 'https://simple.wikipedia.org/wiki/New_York_(state)',
-     'North Carolina': 'https://simple.wikipedia.org/wiki/North_Carolina',
-     'North Dakota': 'https://simple.wikipedia.org/wiki/North_Dakota',
-     'Ohio': 'https://simple.wikipedia.org/wiki/Ohio',
-     'Oklahoma': 'https://simple.wikipedia.org/wiki/Oklahoma',
-     'Oregon': 'https://simple.wikipedia.org/wiki/Oregon',
-     'Pennsylvania': 'https://simple.wikipedia.org/wiki/Pennsylvania',
-     'Rhode Island': 'https://simple.wikipedia.org/wiki/Rhode_Island',
-     'South Carolina': 'https://simple.wikipedia.org/wiki/South_Carolina',
-     'South Dakota': 'https://simple.wikipedia.org/wiki/South_Dakota',
-     'Tennessee': 'https://simple.wikipedia.org/wiki/Tennessee',
-     'Texas': 'https://simple.wikipedia.org/wiki/Texas',
-     'Utah': 'https://simple.wikipedia.org/wiki/Utah',
-     'Vermont': 'https://simple.wikipedia.org/wiki/Vermont',
-     'Virginia': 'https://simple.wikipedia.org/wiki/Virginia',
-     'Washington': 'https://simple.wikipedia.org/wiki/Washington',
-     'West Virginia': 'https://simple.wikipedia.org/wiki/West_Virginia',
-     'Wisconsin': 'https://simple.wikipedia.org/wiki/Wisconsin',
-     'Wyoming': 'https://simple.wikipedia.org/wiki/Wyoming'}
-
-%% Cell type:markdown id: tags:
-
-# STAGE 2: download the html page for each state
-## Stage 2 pseudocode
-1. Create a directory called "html_files_for_states". Make sure to use try except block to catch FileExistsError exception
-2. Initially convert the keys of state_links dict into a list and work with just first 3 items in the list of keys
-3. Iterate over each key (initially just use 3):
-    1. If key is "postal abbs.", skip processing. What keyword allows you to skip current iteration of the loop?
-    2. To create each state's html file name, concatenate the directory name "html_files_for_states" with current key and add a ".html" to the end.
-    3. Add the html file name into a new dictionary called "state_files". Think carefully about where you have to create this empty dict.
-    4. Use requests module get(...) function call to download the contents of the state URL page.
-    5. Open the state html file in write mode and write r.text into the state html file.
-
-#### Congrats :) stage 2 is done
-
-%% Cell type:code id: tags:
-
-``` python
-html_dir = "html_files_for_states"
-state_files = {} #KEY: state; VALUE: state file
-
-try:
-    os.mkdir(html_dir)
-except FileExistsError:
-    pass
-
-#for state in list(state_links.keys())[:3]: # Use this for initial testing
-for state in state_links.keys():
-    if state == "postal abbs.":
-        continue
-    state_url = state_links[state]
-
-    #html file name
-    state_file = os.path.join(html_dir, state + ".html")
-    state_files[state] = state_file
-
-    #Optimization: if state file already exists, you can perhaps skip downloading it again
-    if os.path.exists(state_file):
-        continue
-
-    #Download
-    r = requests.get(state_url)
-    r.raise_for_status
-    print(state_file)
-
-    #Save to a file
-    f = open(state_file, "w", encoding = "utf-8")
-    f.write(r.text)
-    f.close()
-```
-
-%% Output
-
-    html_files_for_states/Alabama.html
-    html_files_for_states/Alaska.html
-    html_files_for_states/Arizona.html
-    html_files_for_states/Arkansas.html
-    html_files_for_states/California.html
-    html_files_for_states/Colorado.html
-    html_files_for_states/Connecticut.html
-    html_files_for_states/Delaware.html
-    html_files_for_states/Florida.html
-    html_files_for_states/Georgia.html
-    html_files_for_states/Hawaii.html
-    html_files_for_states/Idaho.html
-    html_files_for_states/Illinois.html
-    html_files_for_states/Indiana.html
-    html_files_for_states/Iowa.html
-    html_files_for_states/Kansas.html
-    html_files_for_states/Kentucky.html
-    html_files_for_states/Louisiana.html
-    html_files_for_states/Maine.html
-    html_files_for_states/Maryland.html
-    html_files_for_states/Massachusetts.html
-    html_files_for_states/Michigan.html
-    html_files_for_states/Minnesota.html
-    html_files_for_states/Mississippi.html
-    html_files_for_states/Missouri.html
-    html_files_for_states/Montana.html
-    html_files_for_states/Nebraska.html
-    html_files_for_states/Nevada.html
-    html_files_for_states/New Hampshire.html
-    html_files_for_states/New Jersey.html
-    html_files_for_states/New Mexico.html
-    html_files_for_states/New York.html
-    html_files_for_states/North Carolina.html
-    html_files_for_states/North Dakota.html
-    html_files_for_states/Ohio.html
-    html_files_for_states/Oklahoma.html
-    html_files_for_states/Oregon.html
-    html_files_for_states/Pennsylvania.html
-    html_files_for_states/Rhode Island.html
-    html_files_for_states/South Carolina.html
-    html_files_for_states/South Dakota.html
-    html_files_for_states/Tennessee.html
-    html_files_for_states/Texas.html
-    html_files_for_states/Utah.html
-    html_files_for_states/Vermont.html
-    html_files_for_states/Virginia.html
-    html_files_for_states/Washington.html
-    html_files_for_states/West Virginia.html
-    html_files_for_states/Wisconsin.html
-    html_files_for_states/Wyoming.html
-
-%% Cell type:markdown id: tags:
-
-# STAGE 3: extract details from each state page
-## Stage 3 pseudocode
-1. Write a function state_stats. Input path to 1 state file. Output dict of stats for that state
-2. Open state html file, read its content.
-3. Create a BeautifulSoup object called doc.
-4. doc.find_all("tr") - capture return value into a variable called trs
-5. Iterate over each tr element
-    1. You can retrieve a pair of elements by saying: cells = tr.find_all(["th", "td"])
-    2. Explore length of the cells. Notice that there are some entries have length > 2. Let's skip over those.
-    3. Create a dict called stats, where key is the th element's text and the value is td element's text
-6. Don't forget to return the stats dict
-7. Call state_stats with state_files["Wisconsin"]
-
-%% Cell type:code id: tags:
-
-``` python
-def state_stats(path):
-    stats = {}
-    f = open(path, encoding = "utf-8")
-    html_string = f.read()
-    f.close()
-
-    doc = BeautifulSoup(html_string, "html.parser")
-    trs = doc.find_all("tr")
-    for tr in trs:
-        cells = tr.find_all(["th", "td"])
-        if len(cells) == 2:
-            key = cells[0].get_text()
-            value = cells[1].get_text()
-            stats[key] = value
-    return stats
-
-wi_stats = state_stats(state_files["Wisconsin"])
-print("WI state drink:", wi_stats["Beverage"])
-print("WI state dance:", wi_stats["Dance"])
-```
-
-%% Output
-
-    WI state drink: Milk
-    WI state dance: Polka
-
-%% Cell type:markdown id: tags:
-
-## Stage 3 pseudocode continued
- Iterate over all the state files, call state_stats function, and save the return value into a variable.
- Keep track of each state's stats in a dict called state_details
- Create a pandas DataFrame from the state_details dict
- Explore the DataFrame.
-
-%% Cell type:code id: tags:
-
-``` python
-states_details = {}
-
-for state in state_files.keys():
-    stats = state_stats(state_files[state])
-    states_details[state] = stats
-```
-
-%% Cell type:code id: tags:
-
-``` python
-states_df = pd.DataFrame(states_details)
-states_df
-```
-
-%% Output
-
-                                                                         Alabama  \
-    Country                                                        United States
-    Before statehood                                           Alabama Territory
-    Admitted to the Union                               December 14, 1819 (22nd)
-    Capital                                                           Montgomery
-    Largest city                                                      Birmingham
-    ...                                                                      ...
-    Largest cities (pop. over 50,000)                                        NaN
-    Smaller cities (pop. 15,000 to 50,000)                                   NaN
-    Largest villages (pop. over 15,000)                                      NaN
-    Highest elevation (Gannett Peak[2][3][4])                                NaN
-    Lowest elevation (Belle Fourche River at South ...                       NaN
-    
-                                                                        Alaska  \
-    Country                                                      United States
-    Before statehood                                       Territory of Alaska
-    Admitted to the Union                               January 3, 1959 (49th)
-    Capital                                                             Juneau
-    Largest city                                                     Anchorage
-    ...                                                                    ...
-    Largest cities (pop. over 50,000)                                      NaN
-    Smaller cities (pop. 15,000 to 50,000)                                 NaN
-    Largest villages (pop. over 15,000)                                    NaN
-    Highest elevation (Gannett Peak[2][3][4])                              NaN
-    Lowest elevation (Belle Fourche River at South ...                     NaN
-    
-                                                                         Arizona  \
-    Country                                                        United States
-    Before statehood                                           Arizona Territory
-    Admitted to the Union                               February 14, 1912 (48th)
-    Capital                                                                  NaN
-    Largest city                                                             NaN
-    ...                                                                      ...
-    Largest cities (pop. over 50,000)                                        NaN
-    Smaller cities (pop. 15,000 to 50,000)                                   NaN
-    Largest villages (pop. over 15,000)                                      NaN
-    Highest elevation (Gannett Peak[2][3][4])                                NaN
-    Lowest elevation (Belle Fourche River at South ...                       NaN
-    
-                                                                    Arkansas  \
-    Country                                                    United States
-    Before statehood                                      Arkansas Territory
-    Admitted to the Union                               June 15, 1836 (25th)
-    Capital                                                              NaN
-    Largest city                                                         NaN
-    ...                                                                  ...
-    Largest cities (pop. over 50,000)                                    NaN
-    Smaller cities (pop. 15,000 to 50,000)                               NaN
-    Largest villages (pop. over 15,000)                                  NaN
-    Highest elevation (Gannett Peak[2][3][4])                            NaN
-    Lowest elevation (Belle Fourche River at South ...                   NaN
-    
-                                                                                   California  \
-    Country                                                                     United States
-    Before statehood                                    Mexican Cession unorganized territory
-    Admitted to the Union                                            September 9, 1850 (31st)
-    Capital                                                                     Sacramento[1]
-    Largest city                                                                  Los Angeles
-    ...                                                                                   ...
-    Largest cities (pop. over 50,000)                                                     NaN
-    Smaller cities (pop. 15,000 to 50,000)                                                NaN
-    Largest villages (pop. over 15,000)                                                   NaN
-    Highest elevation (Gannett Peak[2][3][4])                                             NaN
-    Lowest elevation (Belle Fourche River at South ...                                    NaN
-    
-                                                                     Colorado  \
-    Country                                                     United States
-    Before statehood                                                      NaN
-    Admitted to the Union                               August 1, 1876 (38th)
-    Capital                                                               NaN
-    Largest city                                                          NaN
-    ...                                                                   ...
-    Largest cities (pop. over 50,000)                                     NaN
-    Smaller cities (pop. 15,000 to 50,000)                                NaN
-    Largest villages (pop. over 15,000)                                   NaN
-    Highest elevation (Gannett Peak[2][3][4])                             NaN
-    Lowest elevation (Belle Fourche River at South ...                    NaN
-    
-                                                                  Connecticut  \
-    Country                                                     United States
-    Before statehood                                       Connecticut Colony
-    Admitted to the Union                               January 9, 1788 (5th)
-    Capital                                                       Hartford[1]
-    Largest city                                                   Bridgeport
-    ...                                                                   ...
-    Largest cities (pop. over 50,000)                                     NaN
-    Smaller cities (pop. 15,000 to 50,000)                                NaN
-    Largest villages (pop. over 15,000)                                   NaN
-    Highest elevation (Gannett Peak[2][3][4])                             NaN
-    Lowest elevation (Belle Fourche River at South ...                    NaN
-    
-                                                                                           Delaware  \
-    Country                                                                           United States
-    Before statehood                                    Delaware Colony, New Netherland, New Sweden
-    Admitted to the Union                                                    December 7, 1787 (1st)
-    Capital                                                                                   Dover
-    Largest city                                                                         Wilmington
-    ...                                                                                         ...
-    Largest cities (pop. over 50,000)                                                           NaN
-    Smaller cities (pop. 15,000 to 50,000)                                                      NaN
-    Largest villages (pop. over 15,000)                                                         NaN
-    Highest elevation (Gannett Peak[2][3][4])                                                   NaN
-    Lowest elevation (Belle Fourche River at South ...                                          NaN
-    
-                                                                     Florida  \
-    Country                                                    United States
-    Before statehood                                       Florida Territory
-    Admitted to the Union                               March 3, 1845 (27th)
-    Capital                                                   Tallahassee[1]
-    Largest city                                             Jacksonville[5]
-    ...                                                                  ...
-    Largest cities (pop. over 50,000)                                    NaN
-    Smaller cities (pop. 15,000 to 50,000)                               NaN
-    Largest villages (pop. over 15,000)                                  NaN
-    Highest elevation (Gannett Peak[2][3][4])                            NaN
-    Lowest elevation (Belle Fourche River at South ...                   NaN
-    
-                                                                      Georgia  \
-    Country                                                     United States
-    Before statehood                                      Province of Georgia
-    Admitted to the Union                               January 2, 1788 (4th)
-    Capital                                                               NaN
-    Largest city                                                          NaN
-    ...                                                                   ...
-    Largest cities (pop. over 50,000)                                     NaN
-    Smaller cities (pop. 15,000 to 50,000)                                NaN
-    Largest villages (pop. over 15,000)                                   NaN
-    Highest elevation (Gannett Peak[2][3][4])                             NaN
-    Lowest elevation (Belle Fourche River at South ...                    NaN
-    
-                                                        ...  \
-    Country                                             ...
-    Before statehood                                    ...
-    Admitted to the Union                               ...
-    Capital                                             ...
-    Largest city                                        ...
-    ...                                                 ...
-    Largest cities (pop. over 50,000)                   ...
-    Smaller cities (pop. 15,000 to 50,000)              ...
-    Largest villages (pop. over 15,000)                 ...
-    Highest elevation (Gannett Peak[2][3][4])           ...
-    Lowest elevation (Belle Fourche River at South ...  ...
-    
-                                                                           South Dakota  \
-    Country                                                               United States
-    Before statehood                                                   Dakota Territory
-    Admitted to the Union                               November 2, 1889 (39th or 40th)
-    Capital                                                                      Pierre
-    Largest city                                                            Sioux Falls
-    ...                                                                             ...
-    Largest cities (pop. over 50,000)                                               NaN
-    Smaller cities (pop. 15,000 to 50,000)                                          NaN
-    Largest villages (pop. over 15,000)                                             NaN
-    Highest elevation (Gannett Peak[2][3][4])                                       NaN
-    Lowest elevation (Belle Fourche River at South ...                              NaN
-    
-                                                                  Tennessee  \
-    Country                                                   United States
-    Before statehood                                    Southwest Territory
-    Admitted to the Union                               June 1, 1796 (16th)
-    Capital                                                             NaN
-    Largest city                                                        NaN
-    ...                                                                 ...
-    Largest cities (pop. over 50,000)                                   NaN
-    Smaller cities (pop. 15,000 to 50,000)                              NaN
-    Largest villages (pop. over 15,000)                                 NaN
-    Highest elevation (Gannett Peak[2][3][4])                           NaN
-    Lowest elevation (Belle Fourche River at South ...                  NaN
-    
-                                                                           Texas  \
-    Country                                                        United States
-    Before statehood                                           Republic of Texas
-    Admitted to the Union                               December 29, 1845 (28th)
-    Capital                                                               Austin
-    Largest city                                                         Houston
-    ...                                                                      ...
-    Largest cities (pop. over 50,000)                                        NaN
-    Smaller cities (pop. 15,000 to 50,000)                                   NaN
-    Largest villages (pop. over 15,000)                                      NaN
-    Highest elevation (Gannett Peak[2][3][4])                                NaN
-    Lowest elevation (Belle Fourche River at South ...                       NaN
-    
-                                                                          Utah  \
-    Country                                                      United States
-    Before statehood                                            Utah Territory
-    Admitted to the Union                               January 4, 1896 (45th)
-    Capital                                                                NaN
-    Largest city                                                           NaN
-    ...                                                                    ...
-    Largest cities (pop. over 50,000)                                      NaN
-    Smaller cities (pop. 15,000 to 50,000)                                 NaN
-    Largest villages (pop. over 15,000)                                    NaN
-    Highest elevation (Gannett Peak[2][3][4])                              NaN
-    Lowest elevation (Belle Fourche River at South ...                     NaN
-    
-                                                                     Vermont  \
-    Country                                                    United States
-    Before statehood                                        Vermont Republic
-    Admitted to the Union                               March 4, 1791 (14th)
-    Capital                                                       Montpelier
-    Largest city                                                  Burlington
-    ...                                                                  ...
-    Largest cities (pop. over 50,000)                                    NaN
-    Smaller cities (pop. 15,000 to 50,000)                               NaN
-    Largest villages (pop. over 15,000)                                  NaN
-    Highest elevation (Gannett Peak[2][3][4])                            NaN
-    Lowest elevation (Belle Fourche River at South ...                   NaN
-    
-                                                                    Virginia  \
-    Country                                                    United States
-    Before statehood                                      Colony of Virginia
-    Admitted to the Union                               June 25, 1788 (10th)
-    Capital                                                         Richmond
-    Largest city                                              Virginia Beach
-    ...                                                                  ...
-    Largest cities (pop. over 50,000)                                    NaN
-    Smaller cities (pop. 15,000 to 50,000)                               NaN
-    Largest villages (pop. over 15,000)                                  NaN
-    Highest elevation (Gannett Peak[2][3][4])                            NaN
-    Lowest elevation (Belle Fourche River at South ...                   NaN
-    
-                                                                      Washington  \
-    Country                                                        United States
-    Before statehood                                        Washington Territory
-    Admitted to the Union                               November 11, 1889 (42nd)
-    Capital                                                              Olympia
-    Largest city                                                         Seattle
-    ...                                                                      ...
-    Largest cities (pop. over 50,000)                                        NaN
-    Smaller cities (pop. 15,000 to 50,000)                                   NaN
-    Largest villages (pop. over 15,000)                                      NaN
-    Highest elevation (Gannett Peak[2][3][4])                                NaN
-    Lowest elevation (Belle Fourche River at South ...                       NaN
-    
-                                                               West Virginia  \
-    Country                                                    United States
-    Before statehood                                        Part of Virginia
-    Admitted to the Union                               June 20, 1863 (35th)
-    Capital                                                              NaN
-    Largest city                                                         NaN
-    ...                                                                  ...
-    Largest cities (pop. over 50,000)                                    NaN
-    Smaller cities (pop. 15,000 to 50,000)                               NaN
-    Largest villages (pop. over 15,000)                                  NaN
-    Highest elevation (Gannett Peak[2][3][4])                            NaN
-    Lowest elevation (Belle Fourche River at South ...                   NaN
-    
-                                                                                                Wisconsin  \
-    Country                                                                                 United States
-    Before statehood                                                                  Wisconsin Territory
-    Admitted to the Union                                                             May 29, 1848 (30th)
-    Capital                                                                                       Madison
-    Largest city                                                                                Milwaukee
-    ...                                                                                               ...
-    Largest cities (pop. over 50,000)                   \nAppleton\nEau Claire\nGreen Bay\nJanesville\...
-    Smaller cities (pop. 15,000 to 50,000)              \nBeaver Dam\nBeloit\nBrookfield\nCudahy\nDe P...
-    Largest villages (pop. over 15,000)                 \nAshwaubenon\nBellevue\nCaledonia\nFox Crossi...
-    Highest elevation (Gannett Peak[2][3][4])                                                         NaN
-    Lowest elevation (Belle Fourche River at South ...                                                NaN
-    
-                                                                      Wyoming
-    Country                                                     United States
-    Before statehood                                        Wyoming Territory
-    Admitted to the Union                                July 10, 1890 (44th)
-    Capital                                                               NaN
-    Largest city                                                          NaN
-    ...                                                                   ...
-    Largest cities (pop. over 50,000)                                     NaN
-    Smaller cities (pop. 15,000 to 50,000)                                NaN
-    Largest villages (pop. over 15,000)                                   NaN
-    Highest elevation (Gannett Peak[2][3][4])           13,809 ft (4,209.1 m)
-    Lowest elevation (Belle Fourche River at South ...       3,101 ft (945 m)
-    
-    [327 rows x 50 columns]
-
-%% Cell type:code id: tags:
-
-``` python
-states_df.loc["Capital"]
-```
-
-%% Output
-
-    Alabama               Montgomery
-    Alaska                    Juneau
-    Arizona                      NaN
-    Arkansas                     NaN
-    California         Sacramento[1]
-    Colorado                     NaN
-    Connecticut          Hartford[1]
-    Delaware                   Dover
-    Florida           Tallahassee[1]
-    Georgia                      NaN
-    Hawaii                       NaN
-    Idaho                        NaN
-    Illinois                     NaN
-    Indiana                      NaN
-    Iowa                         NaN
-    Kansas                    Topeka
-    Kentucky               Frankfort
-    Louisiana            Baton Rouge
-    Maine                    Augusta
-    Maryland               Annapolis
-    Massachusetts                NaN
-    Michigan                 Lansing
-    Minnesota             Saint Paul
-    Mississippi                  NaN
-    Missouri          Jefferson City
-    Montana                   Helena
-    Nebraska                 Lincoln
-    Nevada               Carson City
-    New Hampshire            Concord
-    New Jersey               Trenton
-    New Mexico              Santa Fe
-    New York                  Albany
-    North Carolina           Raleigh
-    North Dakota            Bismarck
-    Ohio                         NaN
-    Oklahoma                     NaN
-    Oregon                     Salem
-    Pennsylvania          Harrisburg
-    Rhode Island                 NaN
-    South Carolina          Columbia
-    South Dakota              Pierre
-    Tennessee                    NaN
-    Texas                     Austin
-    Utah                         NaN
-    Vermont               Montpelier
-    Virginia                Richmond
-    Washington               Olympia
-    West Virginia                NaN
-    Wisconsin                Madison
-    Wyoming                      NaN
-    Name: Capital, dtype: object
-
-%% Cell type:code id: tags:
-
-``` python
-states_df.T.loc["Wisconsin"]
-```
-
-%% Output
-
-    Country                                                                                                    United States
-    Before statehood                                                                                     Wisconsin Territory
-    Admitted to the Union                                                                                May 29, 1848 (30th)
-    Capital                                                                                                          Madison
-    Largest city                                                                                                   Milwaukee
-                                                                                                 ...
-    Largest cities (pop. over 50,000)                                      \nAppleton\nEau Claire\nGreen Bay\nJanesville\...
-    Smaller cities (pop. 15,000 to 50,000)                                 \nBeaver Dam\nBeloit\nBrookfield\nCudahy\nDe P...
-    Largest villages (pop. over 15,000)                                    \nAshwaubenon\nBellevue\nCaledonia\nFox Crossi...
-    Highest elevation (Gannett Peak[2][3][4])                                                                            NaN
-    Lowest elevation (Belle Fourche River at South Dakota border[3][4])                                                  NaN
-    Name: Wisconsin, Length: 327, dtype: object
-%% Cell type:markdown id: tags:
-
-# Web 3
- HTML parsing using BeautifulSoup
-
-%% Cell type:code id: tags:
-
-``` python
-from IPython.core.display import display, HTML
-display(HTML("<style>.container { width:100% !important; }</style>"))
-```
-
-%% Output
-
-
-%% Cell type:code id: tags:
-
-``` python
-import requests                #For downloading the HTML content using HTTP GET request
-from bs4 import BeautifulSoup  #For parsing the HTML content and searching through the HTML
-import os
-import pandas as pd
-```
-
-%% Cell type:markdown id: tags:
-
-# STAGE 1: extract all state URLs from the states page
-## Stage 1 pseudocode
-1. Use requests module to send a GET request to https://simple.wikipedia.org/wiki/List_of_U.S._states
-2. Don't forget to raise_for_status to ensure you are getting 200 OK status code
-3. Explore what r.text gives you
-
-%% Cell type:code id: tags:
-
-``` python
-url = "https://simple.wikipedia.org/wiki/List_of_U.S._states"
-r = requests.get(url)
-r.raise_for_status()
-#print(r.text) #Uncomment this line to see the output
-```
-
-%% Cell type:markdown id: tags:
-
-## Stage 1 pseudocode continued...
-4. Check out what type you are getting from r.text
-
-%% Cell type:code id: tags:
-
-``` python
-print(type(r.text))
-```
-
-%% Output
-
-    <class 'str'>
-
-%% Cell type:markdown id: tags:
-
-## Stage 1 pseudocode continued...
-5. Create BeautifulSoup object by passing r.text, "html.parser" as arguments and capture return value into a variable called doc
-6. Try prettify() method call --- still not that pretty, right?
-
-%% Cell type:code id: tags:
-
-``` python
-doc = BeautifulSoup(r.text, "html.parser")
-#print(doc.prettify()) #Uncomment this line to see the output
-```
-
-%% Cell type:markdown id: tags:
-
-## Stage 1 pseudocode continued...
-7. (Not a code step) Open "https://simple.wikipedia.org/wiki/List_of_U.S._states" on Google Chrome.
-    - Right click on one of the state pages
-    - Click on "Inspect" --- this opens developer tools
-    - This tool let's you explore the html source code
-    - Explore the \<table\> and sub tags like \<th\>, \<tr\>, \<td\>
-    - Let's go back to coding
-
-%% Cell type:markdown id: tags:
-
-## Stage 1 pseudocode continued...
-7. Find all "table" elements in the document by using doc.find_all(...) function and capture return value into a variable "tables"
-    - explore the length of the value returned from find_all(...) function
-    - check out the type of the value returned from find_all(...) function
-8. Add an assert to check that there is only one table - futuristic assert to make sure the html format hasn't changed on the website
-9. Extract the first table into tbl variable
-    - explore type of tbl
-    - try printing the content of tb1 --- looks like just a string
-
-%% Cell type:code id: tags:
-
-``` python
-tables = doc.find_all("table")
-print(len(tables)) # only one table on the states page!
-print(type(tables))
-#Futuristic assert to make sure the html format hasn't changed on the website
-assert len(tables) == 1
-tbl = tables[0]
-print(type(tbl))
-```
-
-%% Output
-
-    1
-    <class 'bs4.element.ResultSet'>
-    <class 'bs4.element.Tag'>
-
-%% Cell type:code id: tags:
-
-``` python
-#print(tbl) #Uncomment this line to see the output
-```
-
-%% Cell type:markdown id: tags:
-
-## Stage 1 pseudocode continued...
-10. Find all the tr elements by using tbl.find_all(...) function and capture return value into a variable tr.
-    - explore length of trs, type of trs
-    - Add an assert checking that length of trs is at least 50 (For 50 US states)
-
-%% Cell type:code id: tags:
-
-``` python
-trs = tbl.find_all("tr")
-print(len(trs))
-print(type(trs))
-assert len(trs) >= 50
-```
-
-%% Output
-
-    52
-    <class 'bs4.element.ResultSet'>
-
-%% Cell type:markdown id: tags:
-
-## Stage 1 pseudocode continued...
-11. Iterate over each item in trs (going to be a lengthy step!)
-    - print each item (tr tag)
-    - call tr.find(..) to find "th" elements --- this finds th element for every tr element.
-    - capture return value into a variable called th
-    - print th and explore what you are getting.
-    - find each hyperlinks within each th element: call th.find_all("a") and capture return value into a variable called links
-    - explore length of links by printing it --- some of the states have 2 links; go back and explore why that is the case and figure out which link you want
-        - some have 0 links, skip over those entries!
-        - extract first of the hyperlinks into a variable called link
-        - print link to confirm you are able to extract the correct link
-        - explore type of link
-        - print link.get_text() method and get attrs of link by saying link.attrs
-        - capture link.get_text() into a variable state
-        - capture link.attrs into a variable state_url --- we need a full URL. Define a prefix variable holding "https://simple.wikipedia.org" and concatenate prefix + link.attrs
-        - create a new dictionary called state_links --- we are going to use this dict to track each state and its URL. Think carefully about where you have to create this empty dict.
-
-#### Congrats :) stage 1 is done
-
-%% Cell type:code id: tags:
-
-``` python
-prefix = "https://simple.wikipedia.org"
-state_links = {} #KEY: state name; VALUE: link to state page
-
-for tr in trs:
-    th = tr.find("th")
-    links = th.find_all("a")
-    #print(len(links))
-    #print(th.get_text())
-    if len(links) == 0:
-        continue
-    link = links[0]
-    #print(type(link), link)
-    #print(link.get_text(), link.attrs) #link.attrs is a dict
-    state = link.get_text()
-    state_url = prefix + link.attrs["href"]
-    state_links[state] = state_url
-
-state_links
-```
-
-%% Output
-
-    {'postal abbs.': 'https://simple.wikipedia.org/wiki/List_of_U.S._state_abbreviations',
-     'Alabama': 'https://simple.wikipedia.org/wiki/Alabama',
-     'Alaska': 'https://simple.wikipedia.org/wiki/Alaska',
-     'Arizona': 'https://simple.wikipedia.org/wiki/Arizona',
-     'Arkansas': 'https://simple.wikipedia.org/wiki/Arkansas',
-     'California': 'https://simple.wikipedia.org/wiki/California',
-     'Colorado': 'https://simple.wikipedia.org/wiki/Colorado',
-     'Connecticut': 'https://simple.wikipedia.org/wiki/Connecticut',
-     'Delaware': 'https://simple.wikipedia.org/wiki/Delaware',
-     'Florida': 'https://simple.wikipedia.org/wiki/Florida',
-     'Georgia': 'https://simple.wikipedia.org/wiki/Georgia_(U.S._state)',
-     'Hawaii': 'https://simple.wikipedia.org/wiki/Hawaii',
-     'Idaho': 'https://simple.wikipedia.org/wiki/Idaho',
-     'Illinois': 'https://simple.wikipedia.org/wiki/Illinois',
-     'Indiana': 'https://simple.wikipedia.org/wiki/Indiana',
-     'Iowa': 'https://simple.wikipedia.org/wiki/Iowa',
-     'Kansas': 'https://simple.wikipedia.org/wiki/Kansas',
-     'Kentucky': 'https://simple.wikipedia.org/wiki/Kentucky',
-     'Louisiana': 'https://simple.wikipedia.org/wiki/Louisiana',
-     'Maine': 'https://simple.wikipedia.org/wiki/Maine',
-     'Maryland': 'https://simple.wikipedia.org/wiki/Maryland',
-     'Massachusetts': 'https://simple.wikipedia.org/wiki/Massachusetts',
-     'Michigan': 'https://simple.wikipedia.org/wiki/Michigan',
-     'Minnesota': 'https://simple.wikipedia.org/wiki/Minnesota',
-     'Mississippi': 'https://simple.wikipedia.org/wiki/Mississippi',
-     'Missouri': 'https://simple.wikipedia.org/wiki/Missouri',
-     'Montana': 'https://simple.wikipedia.org/wiki/Montana',
-     'Nebraska': 'https://simple.wikipedia.org/wiki/Nebraska',
-     'Nevada': 'https://simple.wikipedia.org/wiki/Nevada',
-     'New Hampshire': 'https://simple.wikipedia.org/wiki/New_Hampshire',
-     'New Jersey': 'https://simple.wikipedia.org/wiki/New_Jersey',
-     'New Mexico': 'https://simple.wikipedia.org/wiki/New_Mexico',
-     'New York': 'https://simple.wikipedia.org/wiki/New_York_(state)',
-     'North Carolina': 'https://simple.wikipedia.org/wiki/North_Carolina',
-     'North Dakota': 'https://simple.wikipedia.org/wiki/North_Dakota',
-     'Ohio': 'https://simple.wikipedia.org/wiki/Ohio',
-     'Oklahoma': 'https://simple.wikipedia.org/wiki/Oklahoma',
-     'Oregon': 'https://simple.wikipedia.org/wiki/Oregon',
-     'Pennsylvania': 'https://simple.wikipedia.org/wiki/Pennsylvania',
-     'Rhode Island': 'https://simple.wikipedia.org/wiki/Rhode_Island',
-     'South Carolina': 'https://simple.wikipedia.org/wiki/South_Carolina',
-     'South Dakota': 'https://simple.wikipedia.org/wiki/South_Dakota',
-     'Tennessee': 'https://simple.wikipedia.org/wiki/Tennessee',
-     'Texas': 'https://simple.wikipedia.org/wiki/Texas',
-     'Utah': 'https://simple.wikipedia.org/wiki/Utah',
-     'Vermont': 'https://simple.wikipedia.org/wiki/Vermont',
-     'Virginia': 'https://simple.wikipedia.org/wiki/Virginia',
-     'Washington': 'https://simple.wikipedia.org/wiki/Washington',
-     'West Virginia': 'https://simple.wikipedia.org/wiki/West_Virginia',
-     'Wisconsin': 'https://simple.wikipedia.org/wiki/Wisconsin',
-     'Wyoming': 'https://simple.wikipedia.org/wiki/Wyoming'}
-
-%% Cell type:markdown id: tags:
-
-# STAGE 2: download the html page for each state
-## Stage 2 pseudocode
-1. Create a directory called "html_files_for_states". Make sure to use try except block to catch FileExistsError exception
-2. Initially convert the keys of state_links dict into a list and work with just first 3 items in the list of keys
-3. Iterate over each key (initially just use 3):
-    1. If key is "postal abbs.", skip processing. What keyword allows you to skip current iteration of the loop?
-    2. To create each state's html file name, concatenate the directory name "html_files_for_states" with current key and add a ".html" to the end.
-    3. Add the html file name into a new dictionary called "state_files". Think carefully about where you have to create this empty dict.
-    4. Use requests module get(...) function call to download the contents of the state URL page.
-    5. Open the state html file in write mode and write r.text into the state html file.
-
-#### Congrats :) stage 2 is done
-
-%% Cell type:code id: tags:
-
-``` python
-html_dir = "html_files_for_states"
-state_files = {} #KEY: state; VALUE: state file
-
-try:
-    os.mkdir(html_dir)
-except FileExistsError:
-    pass
-
-#for state in list(state_links.keys())[:3]: # Use this for initial testing
-for state in state_links.keys():
-    if state == "postal abbs.":
-        continue
-    state_url = state_links[state]
-
-    #html file name
-    state_file = os.path.join(html_dir, state + ".html")
-    state_files[state] = state_file
-
-    #Optimization: if state file already exists, you can perhaps skip downloading it again
-    if os.path.exists(state_file):
-        continue
-
-    #Download
-    r = requests.get(state_url)
-    r.raise_for_status
-    print(state_file)
-
-    #Save to a file
-    f = open(state_file, "w", encoding = "utf-8")
-    f.write(r.text)
-    f.close()
-```
-
-%% Output
-
-    html_files_for_states/Alabama.html
-    html_files_for_states/Alaska.html
-    html_files_for_states/Arizona.html
-    html_files_for_states/Arkansas.html
-    html_files_for_states/California.html
-    html_files_for_states/Colorado.html
-    html_files_for_states/Connecticut.html
-    html_files_for_states/Delaware.html
-    html_files_for_states/Florida.html
-    html_files_for_states/Georgia.html
-    html_files_for_states/Hawaii.html
-    html_files_for_states/Idaho.html
-    html_files_for_states/Illinois.html
-    html_files_for_states/Indiana.html
-    html_files_for_states/Iowa.html
-    html_files_for_states/Kansas.html
-    html_files_for_states/Kentucky.html
-    html_files_for_states/Louisiana.html
-    html_files_for_states/Maine.html
-    html_files_for_states/Maryland.html
-    html_files_for_states/Massachusetts.html
-    html_files_for_states/Michigan.html
-    html_files_for_states/Minnesota.html
-    html_files_for_states/Mississippi.html
-    html_files_for_states/Missouri.html
-    html_files_for_states/Montana.html
-    html_files_for_states/Nebraska.html
-    html_files_for_states/Nevada.html
-    html_files_for_states/New Hampshire.html
-    html_files_for_states/New Jersey.html
-    html_files_for_states/New Mexico.html
-    html_files_for_states/New York.html
-    html_files_for_states/North Carolina.html
-    html_files_for_states/North Dakota.html
-    html_files_for_states/Ohio.html
-    html_files_for_states/Oklahoma.html
-    html_files_for_states/Oregon.html
-    html_files_for_states/Pennsylvania.html
-    html_files_for_states/Rhode Island.html
-    html_files_for_states/South Carolina.html
-    html_files_for_states/South Dakota.html
-    html_files_for_states/Tennessee.html
-    html_files_for_states/Texas.html
-    html_files_for_states/Utah.html
-    html_files_for_states/Vermont.html
-    html_files_for_states/Virginia.html
-    html_files_for_states/Washington.html
-    html_files_for_states/West Virginia.html
-    html_files_for_states/Wisconsin.html
-    html_files_for_states/Wyoming.html
-
-%% Cell type:markdown id: tags:
-
-# STAGE 3: extract details from each state page
-## Stage 3 pseudocode
-1. Write a function state_stats. Input path to 1 state file. Output dict of stats for that state
-2. Open state html file, read its content.
-3. Create a BeautifulSoup object called doc.
-4. doc.find_all("tr") - capture return value into a variable called trs
-5. Iterate over each tr element
-    1. You can retrieve a pair of elements by saying: cells = tr.find_all(["th", "td"])
-    2. Explore length of the cells. Notice that there are some entries have length > 2. Let's skip over those.
-    3. Create a dict called stats, where key is the th element's text and the value is td element's text
-6. Don't forget to return the stats dict
-7. Call state_stats with state_files["Wisconsin"]
-
-%% Cell type:code id: tags:
-
-``` python
-def state_stats(path):
-    stats = {}
-    f = open(path, encoding = "utf-8")
-    html_string = f.read()
-    f.close()
-
-    doc = BeautifulSoup(html_string, "html.parser")
-    trs = doc.find_all("tr")
-    for tr in trs:
-        cells = tr.find_all(["th", "td"])
-        if len(cells) == 2:
-            key = cells[0].get_text()
-            value = cells[1].get_text()
-            stats[key] = value
-    return stats
-
-wi_stats = state_stats(state_files["Wisconsin"])
-print("WI state drink:", wi_stats["Beverage"])
-print("WI state dance:", wi_stats["Dance"])
-```
-
-%% Output
-
-    WI state drink: Milk
-    WI state dance: Polka
-
-%% Cell type:markdown id: tags:
-
-## Stage 3 pseudocode continued
- Iterate over all the state files, call state_stats function, and save the return value into a variable.
- Keep track of each state's stats in a dict called state_details
- Create a pandas DataFrame from the state_details dict
- Explore the DataFrame.
-
-%% Cell type:code id: tags:
-
-``` python
-states_details = {}
-
-for state in state_files.keys():
-    stats = state_stats(state_files[state])
-    states_details[state] = stats
-```
-
-%% Cell type:code id: tags:
-
-``` python
-states_df = pd.DataFrame(states_details)
-states_df
-```
-
-%% Output
-
-                                                                         Alabama  \
-    Country                                                        United States
-    Before statehood                                           Alabama Territory
-    Admitted to the Union                               December 14, 1819 (22nd)
-    Capital                                                           Montgomery
-    Largest city                                                      Birmingham
-    ...                                                                      ...
-    Largest cities (pop. over 50,000)                                        NaN
-    Smaller cities (pop. 15,000 to 50,000)                                   NaN
-    Largest villages (pop. over 15,000)                                      NaN
-    Highest elevation (Gannett Peak[2][3][4])                                NaN
-    Lowest elevation (Belle Fourche River at South ...                       NaN
-    
-                                                                        Alaska  \
-    Country                                                      United States
-    Before statehood                                       Territory of Alaska
-    Admitted to the Union                               January 3, 1959 (49th)
-    Capital                                                             Juneau
-    Largest city                                                     Anchorage
-    ...                                                                    ...
-    Largest cities (pop. over 50,000)                                      NaN
-    Smaller cities (pop. 15,000 to 50,000)                                 NaN
-    Largest villages (pop. over 15,000)                                    NaN
-    Highest elevation (Gannett Peak[2][3][4])                              NaN
-    Lowest elevation (Belle Fourche River at South ...                     NaN
-    
-                                                                         Arizona  \
-    Country                                                        United States
-    Before statehood                                           Arizona Territory
-    Admitted to the Union                               February 14, 1912 (48th)
-    Capital                                                                  NaN
-    Largest city                                                             NaN
-    ...                                                                      ...
-    Largest cities (pop. over 50,000)                                        NaN
-    Smaller cities (pop. 15,000 to 50,000)                                   NaN
-    Largest villages (pop. over 15,000)                                      NaN
-    Highest elevation (Gannett Peak[2][3][4])                                NaN
-    Lowest elevation (Belle Fourche River at South ...                       NaN
-    
-                                                                    Arkansas  \
-    Country                                                    United States
-    Before statehood                                      Arkansas Territory
-    Admitted to the Union                               June 15, 1836 (25th)
-    Capital                                                              NaN
-    Largest city                                                         NaN
-    ...                                                                  ...
-    Largest cities (pop. over 50,000)                                    NaN
-    Smaller cities (pop. 15,000 to 50,000)                               NaN
-    Largest villages (pop. over 15,000)                                  NaN
-    Highest elevation (Gannett Peak[2][3][4])                            NaN
-    Lowest elevation (Belle Fourche River at South ...                   NaN
-    
-                                                                                   California  \
-    Country                                                                     United States
-    Before statehood                                    Mexican Cession unorganized territory
-    Admitted to the Union                                            September 9, 1850 (31st)
-    Capital                                                                     Sacramento[1]
-    Largest city                                                                  Los Angeles
-    ...                                                                                   ...
-    Largest cities (pop. over 50,000)                                                     NaN
-    Smaller cities (pop. 15,000 to 50,000)                                                NaN
-    Largest villages (pop. over 15,000)                                                   NaN
-    Highest elevation (Gannett Peak[2][3][4])                                             NaN
-    Lowest elevation (Belle Fourche River at South ...                                    NaN
-    
-                                                                     Colorado  \
-    Country                                                     United States
-    Before statehood                                                      NaN
-    Admitted to the Union                               August 1, 1876 (38th)
-    Capital                                                               NaN
-    Largest city                                                          NaN
-    ...                                                                   ...
-    Largest cities (pop. over 50,000)                                     NaN
-    Smaller cities (pop. 15,000 to 50,000)                                NaN
-    Largest villages (pop. over 15,000)                                   NaN
-    Highest elevation (Gannett Peak[2][3][4])                             NaN
-    Lowest elevation (Belle Fourche River at South ...                    NaN
-    
-                                                                  Connecticut  \
-    Country                                                     United States
-    Before statehood                                       Connecticut Colony
-    Admitted to the Union                               January 9, 1788 (5th)
-    Capital                                                       Hartford[1]
-    Largest city                                                   Bridgeport
-    ...                                                                   ...
-    Largest cities (pop. over 50,000)                                     NaN
-    Smaller cities (pop. 15,000 to 50,000)                                NaN
-    Largest villages (pop. over 15,000)                                   NaN
-    Highest elevation (Gannett Peak[2][3][4])                             NaN
-    Lowest elevation (Belle Fourche River at South ...                    NaN
-    
-                                                                                           Delaware  \
-    Country                                                                           United States
-    Before statehood                                    Delaware Colony, New Netherland, New Sweden
-    Admitted to the Union                                                    December 7, 1787 (1st)
-    Capital                                                                                   Dover
-    Largest city                                                                         Wilmington
-    ...                                                                                         ...
-    Largest cities (pop. over 50,000)                                                           NaN
-    Smaller cities (pop. 15,000 to 50,000)                                                      NaN
-    Largest villages (pop. over 15,000)                                                         NaN
-    Highest elevation (Gannett Peak[2][3][4])                                                   NaN
-    Lowest elevation (Belle Fourche River at South ...                                          NaN
-    
-                                                                     Florida  \
-    Country                                                    United States
-    Before statehood                                       Florida Territory
-    Admitted to the Union                               March 3, 1845 (27th)
-    Capital                                                   Tallahassee[1]
-    Largest city                                             Jacksonville[5]
-    ...                                                                  ...
-    Largest cities (pop. over 50,000)                                    NaN
-    Smaller cities (pop. 15,000 to 50,000)                               NaN
-    Largest villages (pop. over 15,000)                                  NaN
-    Highest elevation (Gannett Peak[2][3][4])                            NaN
-    Lowest elevation (Belle Fourche River at South ...                   NaN
-    
-                                                                      Georgia  \
-    Country                                                     United States
-    Before statehood                                      Province of Georgia
-    Admitted to the Union                               January 2, 1788 (4th)
-    Capital                                                               NaN
-    Largest city                                                          NaN
-    ...                                                                   ...
-    Largest cities (pop. over 50,000)                                     NaN
-    Smaller cities (pop. 15,000 to 50,000)                                NaN
-    Largest villages (pop. over 15,000)                                   NaN
-    Highest elevation (Gannett Peak[2][3][4])                             NaN
-    Lowest elevation (Belle Fourche River at South ...                    NaN
-    
-                                                        ...  \
-    Country                                             ...
-    Before statehood                                    ...
-    Admitted to the Union                               ...
-    Capital                                             ...
-    Largest city                                        ...
-    ...                                                 ...
-    Largest cities (pop. over 50,000)                   ...
-    Smaller cities (pop. 15,000 to 50,000)              ...
-    Largest villages (pop. over 15,000)                 ...
-    Highest elevation (Gannett Peak[2][3][4])           ...
-    Lowest elevation (Belle Fourche River at South ...  ...
-    
-                                                                           South Dakota  \
-    Country                                                               United States
-    Before statehood                                                   Dakota Territory
-    Admitted to the Union                               November 2, 1889 (39th or 40th)
-    Capital                                                                      Pierre
-    Largest city                                                            Sioux Falls
-    ...                                                                             ...
-    Largest cities (pop. over 50,000)                                               NaN
-    Smaller cities (pop. 15,000 to 50,000)                                          NaN
-    Largest villages (pop. over 15,000)                                             NaN
-    Highest elevation (Gannett Peak[2][3][4])                                       NaN
-    Lowest elevation (Belle Fourche River at South ...                              NaN
-    
-                                                                  Tennessee  \
-    Country                                                   United States
-    Before statehood                                    Southwest Territory
-    Admitted to the Union                               June 1, 1796 (16th)
-    Capital                                                             NaN
-    Largest city                                                        NaN
-    ...                                                                 ...
-    Largest cities (pop. over 50,000)                                   NaN
-    Smaller cities (pop. 15,000 to 50,000)                              NaN
-    Largest villages (pop. over 15,000)                                 NaN
-    Highest elevation (Gannett Peak[2][3][4])                           NaN
-    Lowest elevation (Belle Fourche River at South ...                  NaN
-    
-                                                                           Texas  \
-    Country                                                        United States
-    Before statehood                                           Republic of Texas
-    Admitted to the Union                               December 29, 1845 (28th)
-    Capital                                                               Austin
-    Largest city                                                         Houston
-    ...                                                                      ...
-    Largest cities (pop. over 50,000)                                        NaN
-    Smaller cities (pop. 15,000 to 50,000)                                   NaN
-    Largest villages (pop. over 15,000)                                      NaN
-    Highest elevation (Gannett Peak[2][3][4])                                NaN
-    Lowest elevation (Belle Fourche River at South ...                       NaN
-    
-                                                                          Utah  \
-    Country                                                      United States
-    Before statehood                                            Utah Territory
-    Admitted to the Union                               January 4, 1896 (45th)
-    Capital                                                                NaN
-    Largest city                                                           NaN
-    ...                                                                    ...
-    Largest cities (pop. over 50,000)                                      NaN
-    Smaller cities (pop. 15,000 to 50,000)                                 NaN
-    Largest villages (pop. over 15,000)                                    NaN
-    Highest elevation (Gannett Peak[2][3][4])                              NaN
-    Lowest elevation (Belle Fourche River at South ...                     NaN
-    
-                                                                     Vermont  \
-    Country                                                    United States
-    Before statehood                                        Vermont Republic
-    Admitted to the Union                               March 4, 1791 (14th)
-    Capital                                                       Montpelier
-    Largest city                                                  Burlington
-    ...                                                                  ...
-    Largest cities (pop. over 50,000)                                    NaN
-    Smaller cities (pop. 15,000 to 50,000)                               NaN
-    Largest villages (pop. over 15,000)                                  NaN
-    Highest elevation (Gannett Peak[2][3][4])                            NaN
-    Lowest elevation (Belle Fourche River at South ...                   NaN
-    
-                                                                    Virginia  \
-    Country                                                    United States
-    Before statehood                                      Colony of Virginia
-    Admitted to the Union                               June 25, 1788 (10th)
-    Capital                                                         Richmond
-    Largest city                                              Virginia Beach
-    ...                                                                  ...
-    Largest cities (pop. over 50,000)                                    NaN
-    Smaller cities (pop. 15,000 to 50,000)                               NaN
-    Largest villages (pop. over 15,000)                                  NaN
-    Highest elevation (Gannett Peak[2][3][4])                            NaN
-    Lowest elevation (Belle Fourche River at South ...                   NaN
-    
-                                                                      Washington  \
-    Country                                                        United States
-    Before statehood                                        Washington Territory
-    Admitted to the Union                               November 11, 1889 (42nd)
-    Capital                                                              Olympia
-    Largest city                                                         Seattle
-    ...                                                                      ...
-    Largest cities (pop. over 50,000)                                        NaN
-    Smaller cities (pop. 15,000 to 50,000)                                   NaN
-    Largest villages (pop. over 15,000)                                      NaN
-    Highest elevation (Gannett Peak[2][3][4])                                NaN
-    Lowest elevation (Belle Fourche River at South ...                       NaN
-    
-                                                               West Virginia  \
-    Country                                                    United States
-    Before statehood                                        Part of Virginia
-    Admitted to the Union                               June 20, 1863 (35th)
-    Capital                                                              NaN
-    Largest city                                                         NaN
-    ...                                                                  ...
-    Largest cities (pop. over 50,000)                                    NaN
-    Smaller cities (pop. 15,000 to 50,000)                               NaN
-    Largest villages (pop. over 15,000)                                  NaN
-    Highest elevation (Gannett Peak[2][3][4])                            NaN
-    Lowest elevation (Belle Fourche River at South ...                   NaN
-    
-                                                                                                Wisconsin  \
-    Country                                                                                 United States
-    Before statehood                                                                  Wisconsin Territory
-    Admitted to the Union                                                             May 29, 1848 (30th)
-    Capital                                                                                       Madison
-    Largest city                                                                                Milwaukee
-    ...                                                                                               ...
-    Largest cities (pop. over 50,000)                   \nAppleton\nEau Claire\nGreen Bay\nJanesville\...
-    Smaller cities (pop. 15,000 to 50,000)              \nBeaver Dam\nBeloit\nBrookfield\nCudahy\nDe P...
-    Largest villages (pop. over 15,000)                 \nAshwaubenon\nBellevue\nCaledonia\nFox Crossi...
-    Highest elevation (Gannett Peak[2][3][4])                                                         NaN
-    Lowest elevation (Belle Fourche River at South ...                                                NaN
-    
-                                                                      Wyoming
-    Country                                                     United States
-    Before statehood                                        Wyoming Territory
-    Admitted to the Union                                July 10, 1890 (44th)
-    Capital                                                               NaN
-    Largest city                                                          NaN
-    ...                                                                   ...
-    Largest cities (pop. over 50,000)                                     NaN
-    Smaller cities (pop. 15,000 to 50,000)                                NaN
-    Largest villages (pop. over 15,000)                                   NaN
-    Highest elevation (Gannett Peak[2][3][4])           13,809 ft (4,209.1 m)
-    Lowest elevation (Belle Fourche River at South ...       3,101 ft (945 m)
-    
-    [327 rows x 50 columns]
-
-%% Cell type:code id: tags:
-
-``` python
-states_df.loc["Capital"]
-```
-
-%% Output
-
-    Alabama               Montgomery
-    Alaska                    Juneau
-    Arizona                      NaN
-    Arkansas                     NaN
-    California         Sacramento[1]
-    Colorado                     NaN
-    Connecticut          Hartford[1]
-    Delaware                   Dover
-    Florida           Tallahassee[1]
-    Georgia                      NaN
-    Hawaii                       NaN
-    Idaho                        NaN
-    Illinois                     NaN
-    Indiana                      NaN
-    Iowa                         NaN
-    Kansas                    Topeka
-    Kentucky               Frankfort
-    Louisiana            Baton Rouge
-    Maine                    Augusta
-    Maryland               Annapolis
-    Massachusetts                NaN
-    Michigan                 Lansing
-    Minnesota             Saint Paul
-    Mississippi                  NaN
-    Missouri          Jefferson City
-    Montana                   Helena
-    Nebraska                 Lincoln
-    Nevada               Carson City
-    New Hampshire            Concord
-    New Jersey               Trenton
-    New Mexico              Santa Fe
-    New York                  Albany
-    North Carolina           Raleigh
-    North Dakota            Bismarck
-    Ohio                         NaN
-    Oklahoma                     NaN
-    Oregon                     Salem
-    Pennsylvania          Harrisburg
-    Rhode Island                 NaN
-    South Carolina          Columbia
-    South Dakota              Pierre
-    Tennessee                    NaN
-    Texas                     Austin
-    Utah                         NaN
-    Vermont               Montpelier
-    Virginia                Richmond
-    Washington               Olympia
-    West Virginia                NaN
-    Wisconsin                Madison
-    Wyoming                      NaN
-    Name: Capital, dtype: object
-
-%% Cell type:code id: tags:
-
-``` python
-states_df.T.loc["Wisconsin"]
-```
-
-%% Output
-
-    Country                                                                                                    United States
-    Before statehood                                                                                     Wisconsin Territory
-    Admitted to the Union                                                                                May 29, 1848 (30th)
-    Capital                                                                                                          Madison
-    Largest city                                                                                                   Milwaukee
-                                                                                                 ...
-    Largest cities (pop. over 50,000)                                      \nAppleton\nEau Claire\nGreen Bay\nJanesville\...
-    Smaller cities (pop. 15,000 to 50,000)                                 \nBeaver Dam\nBeloit\nBrookfield\nCudahy\nDe P...
-    Largest villages (pop. over 15,000)                                    \nAshwaubenon\nBellevue\nCaledonia\nFox Crossi...
-    Highest elevation (Gannett Peak[2][3][4])                                                                            NaN
-    Lowest elevation (Belle Fourche River at South Dakota border[3][4])                                                  NaN
-    Name: Wisconsin, Length: 327, dtype: object
--- a/f22/meena_lec_notes/lec-33/lec_33_database2.ipynb
+++ b/f22/meena_lec_notes/lec-33/lec_33_database2.ipynb
-%% Cell type:markdown id: tags:
+%% Cell type:code id: tags:

-## ignore this cell (it's just to make certain text red later, but you don't need to understand it).
+``` python
+# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
 from IPython.core.display import HTML
 HTML('<style>em { color: red; }</style>')
+```
+
+%% Output
+
+    <IPython.core.display.HTML object>

 %% Cell type:code id: tags:

 ``` python
 # import statements
 import sqlite3
 import pandas as pd
 import os
 ```

 %% Cell type:markdown id: tags:

 ## Warmup: SQL query clauses
 **Mandatory SQL clauses**
 - SELECT: column, column, ...  or *
 - FROM: table

 **Optional SQL clauses**
 - WHERE:  boolean expression (if row has ....)
 - can use AND, OR, NOT
 - ORDER BY  column (ASC, DESC)
 - LIMIT: num rows

 %% Cell type:code id: tags:

 ``` python
 # open up the movies database
 movies_path = "movies.db"
 assert os.path.exists(movies_path)
 c = sqlite3.connect(movies_path)
 ```

 %% Cell type:code id: tags:

 ``` python
 # what are the table names?
 df = pd.read_sql("select * from sqlite_master where type='table'", c)
 df
 ```

 %% Output

        type    name tbl_name  rootpage  \
    0  table  movies   movies         2
    
                                                     sql
    0  CREATE TABLE "movies" (\n"Title" TEXT,\n  "Gen...

 %% Cell type:code id: tags:

 ``` python
 # what are the data types?
 print(df["sql"].iloc[0])
 ```

 %% Output

    CREATE TABLE "movies" (
    "Title" TEXT,
      "Genre" TEXT,
      "Director" TEXT,
      "Cast" TEXT,
      "Year" INTEGER,
      "Runtime" INTEGER,
      "Rating" REAL,
      "Revenue" REAL
    )

 %% Cell type:code id: tags:

 ``` python
 # what is all our data?
 pd.read_sql("select * from movies", c)
 ```

 %% Output

                                   Title                         Genre  \
    0            Guardians of the Galaxy       Action,Adventure,Sci-Fi
    1                         Prometheus      Adventure,Mystery,Sci-Fi
    2                              Split               Horror,Thriller
    3                               Sing       Animation,Comedy,Family
    4                      Suicide Squad      Action,Adventure,Fantasy
    ...                              ...                           ...
    1063  Guardians of the Galaxy Vol. 2     Action, Adventure, Comedy
    1064                     Baby Driver          Action, Crime, Drama
    1065                  Only the Brave      Action, Biography, Drama
    1066                   Incredibles 2  Animation, Action, Adventure
    1067                  A Star Is Born         Drama, Music, Romance
    
                      Director                                               Cast  \
    0               James Gunn  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
    1             Ridley Scott  Noomi Rapace, Logan Marshall-Green, Michael   ...
    2       M. Night Shyamalan  James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
    3     Christophe Lourdelet  Matthew McConaughey,Reese Witherspoon, Seth Ma...
    4               David Ayer  Will Smith, Jared Leto, Margot Robbie, Viola D...
    ...                    ...                                                ...
    1063            James Gunn  Chris Pratt, Zoe Saldana, Dave Bautista, Vin D...
    1064          Edgar Wright  Ansel Elgort, Jon Bernthal, Jon Hamm, Eiza Gon...
    1065       Joseph Kosinski  Josh Brolin, Miles Teller, Jeff Bridges, Jenni...
    1066             Brad Bird  Craig T. Nelson, Holly Hunter, Sarah Vowell, H...
    1067        Bradley Cooper  Lady Gaga, Bradley Cooper, Sam Elliott, Greg G...
    
          Year  Runtime  Rating  Revenue
    0     2014      121     8.1   333.13
    1     2012      124     7.0   126.46
    2     2016      117     7.3   138.12
    3     2016      108     7.2   270.32
    4     2016      123     6.2   325.02
    ...    ...      ...     ...      ...
    1063  2017      136     7.6   389.81
    1064  2017      113     7.6   107.83
    1065  2017      134     7.6    18.34
    1066  2018      118     7.6   608.58
    1067  2018      136     7.6   215.29
    
    [1068 rows x 8 columns]

 %% Cell type:code id: tags:

 ``` python
 # this function allows to type less for each query
 def qry(sql, conn = c):
    return pd.read_sql(sql, conn)
 ```

 %% Cell type:markdown id: tags:

 Sample query format:

 ```
 SELECT
 FROM movies
 WHERE
 ORDER BY
 LIMIT
 ```

 %% Cell type:code id: tags:

 ``` python
 # call qry ....copy/paste the query from above
 qry("""
 SELECT *
 FROM movies
 """)
 ```

 %% Output

                                   Title                         Genre  \
    0            Guardians of the Galaxy       Action,Adventure,Sci-Fi
    1                         Prometheus      Adventure,Mystery,Sci-Fi
    2                              Split               Horror,Thriller
    3                               Sing       Animation,Comedy,Family
    4                      Suicide Squad      Action,Adventure,Fantasy
    ...                              ...                           ...
    1063  Guardians of the Galaxy Vol. 2     Action, Adventure, Comedy
    1064                     Baby Driver          Action, Crime, Drama
    1065                  Only the Brave      Action, Biography, Drama
    1066                   Incredibles 2  Animation, Action, Adventure
    1067                  A Star Is Born         Drama, Music, Romance
    
                      Director                                               Cast  \
    0               James Gunn  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
    1             Ridley Scott  Noomi Rapace, Logan Marshall-Green, Michael   ...
    2       M. Night Shyamalan  James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
    3     Christophe Lourdelet  Matthew McConaughey,Reese Witherspoon, Seth Ma...
    4               David Ayer  Will Smith, Jared Leto, Margot Robbie, Viola D...
    ...                    ...                                                ...
    1063            James Gunn  Chris Pratt, Zoe Saldana, Dave Bautista, Vin D...
    1064          Edgar Wright  Ansel Elgort, Jon Bernthal, Jon Hamm, Eiza Gon...
    1065       Joseph Kosinski  Josh Brolin, Miles Teller, Jeff Bridges, Jenni...
    1066             Brad Bird  Craig T. Nelson, Holly Hunter, Sarah Vowell, H...
    1067        Bradley Cooper  Lady Gaga, Bradley Cooper, Sam Elliott, Greg G...
    
          Year  Runtime  Rating  Revenue
    0     2014      121     8.1   333.13
    1     2012      124     7.0   126.46
    2     2016      117     7.3   138.12
    3     2016      108     7.2   270.32
    4     2016      123     6.2   325.02
    ...    ...      ...     ...      ...
    1063  2017      136     7.6   389.81
    1064  2017      113     7.6   107.83
    1065  2017      134     7.6    18.34
    1066  2018      118     7.6   608.58
    1067  2018      136     7.6   215.29
    
    [1068 rows x 8 columns]

 %% Cell type:markdown id: tags:

 ### What's the *Title* of the movie with the highest *Rating*?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT Title, Rating
 FROM movies
 ORDER BY Rating DESC
 LIMIT 1
 """)
 df
 ```

 %% Output

                 Title  Rating
    0  The Dark Knight     9.0

 %% Cell type:code id: tags:

 ``` python
 df.iloc[0]["Title"]
 ```

 %% Output

    'The Dark Knight'

 %% Cell type:markdown id: tags:

 ### Which *Director* made the movie with the shortest *Runtime*?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT Director, Runtime
 FROM movies
 ORDER BY Runtime
 LIMIT 1
 """)
 df
 ```

 %% Output

            Director  Runtime
    0  Claude Barras       66

 %% Cell type:code id: tags:

 ``` python
 df.iloc[0]["Director"]
 ```

 %% Output

    'Claude Barras'

 %% Cell type:markdown id: tags:

 ### What was the *Director*  and *Title* of the movie with the largest *Revenue*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, revenue, title
 FROM movies
 ORDER BY revenue DESC
 LIMIT 1
 """)
 ```

 %% Output

          Director  Revenue                                       Title
    0  J.J. Abrams   936.63  Star Wars: Episode VII - The Force Awakens

 %% Cell type:markdown id: tags:

-### What is the *Title* of the movie with the highest *Revenue* in *Year* 2016?
+### What is the *Title* of the movie with the highest *Revenue* in *Year* 2019?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT title, revenue, year
 FROM movies
-WHERE year = 2016
+WHERE year = 2019
 ORDER BY revenue DESC
 LIMIT 1
 """)
 df
 ```

 %% Output

-           Title  Revenue  Year
-    0  Rogue One   532.17  2016
+                   Title  Revenue  Year
+    0  Avengers: Endgame   858.37  2019

 %% Cell type:code id: tags:

 ``` python
 df.iloc[0]["Title"]
 ```

 %% Output

-    'Rogue One'
+    'Avengers: Endgame'

 %% Cell type:markdown id: tags:

-### Which *3 movies*  had the highest *Revenue* in the *Year* 2016?
+### Which *3 movies*  had the top-3 highest *Revenue* in the *Year* 2019?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT title, revenue
 FROM movies
-WHERE year = 2016
+WHERE year = 2019
 ORDER BY revenue DESC
 LIMIT 3
 """)
 df
 ```

 %% Output

-                            Title  Revenue
-    0                   Rogue One   532.17
-    1                Finding Dory   486.29
-    2  Captain America: Civil War   408.08
+                   Title  Revenue
+    0  Avengers: Endgame   858.37
+    1        Toy Story 4   434.04
+    2              Joker   335.45

 %% Cell type:code id: tags:

 ``` python
-# Extract revenue column and convert to list
-list(df["Revenue"])
+# Extract title column and convert to list
+list(df["Title"])
 ```

 %% Output

-    [532.17, 486.29, 408.08]
+    ['Avengers: Endgame', 'Toy Story 4', 'Joker']

 %% Cell type:markdown id: tags:

 ## Lecture 33: Database 2
 Learning Objectives:
 - Use the AS command to rename a column or a calculation
 - Use SQL Aggregate functions to summarize database columns:
 - SUM, AVG, COUNT, MIN, MAX, DISTINCT
 - Use the GROUP BY command to place database rows into buckets.
 - Use the HAVING command to apply conditions to groups.

 %% Cell type:markdown id: tags:

 ### Which *3 movies* have the highest *rating-to-revenue ratios*?

 The `AS` clause lets us rename a column or a calcuation

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT title, rating / revenue AS ratio
 FROM movies
 ORDER BY ratio DESC
 LIMIT 3
 """)
 ```

 %% Output

             Title  ratio
    0    Wakefield  750.0
    1  Love, Rosie  720.0
    2     Lovesong  640.0

 %% Cell type:markdown id: tags:

 ## Aggregate Queries

 ```
 SUM, AVG, COUNT, MIN, MAX, DISTINCT
 ```

 %% Cell type:markdown id: tags:

 ### How many *rows of movies* are there?
 Note: when we want to count the number of rows, we use COUNT(*)

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT COUNT(*)
 FROM movies
 """)
 ```

 %% Output

       COUNT(*)
    0      1068

 %% Cell type:markdown id: tags:

 ### How many *directors* are there?

 %% Cell type:code id: tags:

 ``` python
-# This doesn't feel correct - it counts duplicates for director names!
 qry("""
 SELECT COUNT(director)
 FROM movies
 """)
+# This doesn't feel correct - it counts duplicates for director names!
 ```

 %% Output

       COUNT(director)
    0             1068

 %% Cell type:markdown id: tags:

 Use COUNT(DISTINCT columname)

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT COUNT(DISTINCT director)
 FROM movies
 """)
 ```

 %% Output

       COUNT(DISTINCT director)
    0                       679

 %% Cell type:markdown id: tags:

+### What are the names of the *directors* (without duplicates)?
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("""
+SELECT DISTINCT director
+FROM movies
+""")
+df
+```
+
+%% Output
+
+                      Director
+    0               James Gunn
+    1             Ridley Scott
+    2       M. Night Shyamalan
+    3     Christophe Lourdelet
+    4               David Ayer
+    ..                     ...
+    674     Andrey Zvyagintsev
+    675             Sean Baker
+    676  Destin Daniel Cretton
+    677           Tyler Nilson
+    678         Bradley Cooper
+    
+    [679 rows x 1 columns]
+
+%% Cell type:code id: tags:
+
+``` python
+# Extract Director column and convert to list
+director_list = list(df["Director"])
+#director_list # uncomment to see the output
+```
+
+%% Cell type:markdown id: tags:
+
 ### What is the total *Revenue* of *all the movies*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT SUM(revenue)
 FROM movies
 """)
 ```

 %% Output

       SUM(revenue)
    0      80668.27

 %% Cell type:markdown id: tags:

 ### What is the *average rating* across *all movies*?

 * v1: with `SUM` and `COUNT`
 * v2: with `AVG`

 %% Cell type:code id: tags:

 ``` python
 # v1
 df = qry("""
 SELECT SUM(rating) / COUNT(*)
 FROM movies
 """)
 df
 ```

 %% Output

       SUM(rating) / COUNT(*)
    0                6.805431

 %% Cell type:code id: tags:

 ``` python
 df.iloc[0][0]
 ```

 %% Output

    6.805430711610491

 %% Cell type:code id: tags:

 ``` python
 # v2
 qry("""
 SELECT AVG(rating)
 FROM movies
 """)
 ```

 %% Output

       AVG(rating)
    0     6.805431

 %% Cell type:markdown id: tags:

 ### What is the *average revenue* and *average runtime* of *all the movies*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT AVG(revenue), AVG(runtime)
 FROM movies
 """)
 ```

 %% Output

       AVG(revenue)  AVG(runtime)
    0     75.532088    114.093633

 %% Cell type:markdown id: tags:

 ### What is the *average revenue* for a *Ridley Scott* movie?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT AVG(revenue)
 FROM movies
 WHERE director = "Ridley Scott"
 """)
 df
 ```

 %% Output

       AVG(revenue)
    0       89.8825

 %% Cell type:code id: tags:

 ``` python
 df.iloc[0][0]
 ```

 %% Output

    89.88250000000001

 %% Cell type:markdown id: tags:

-### *How many movies* were there in *2016*?
+### *How many movies* were there in *2019*?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT COUNT(*)
 FROM movies
-WHERE year = 2016
+WHERE year = 2019
 """)
+df
 ```

+%% Output
+
+       COUNT(*)
+    0        23
+
 %% Cell type:code id: tags:

 ``` python
 df.iloc[0][0]
 ```

 %% Output

-    296
+    23

 %% Cell type:markdown id: tags:

 ### What *percentage* of the *total revenue* came from the *highest-revenue movie*?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT title, MAX(revenue) / SUM(revenue) * 100 AS percentage
 FROM movies
 """)
 df
 ```

 %% Output

                                            Title  percentage
    0  Star Wars: Episode VII - The Force Awakens    1.161088

 %% Cell type:code id: tags:

 ``` python
 df.iloc[0][0]
 ```

 %% Output

    'Star Wars: Episode VII - The Force Awakens'

 %% Cell type:markdown id: tags:

-### What *percentage* of the *revenue* came from the *highest-revenue movie* in *2016*?
+### What *percentage* of the *revenue* came from the *highest-revenue movie* in *2019*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT title, MAX(revenue) / SUM(revenue) * 100 AS percentage
 FROM movies
-WHERE year = 2016
+WHERE year = 2019
 """)
 ```

 %% Output

-           Title  percentage
-    0  Rogue One    4.746581
+                   Title  percentage
+    0  Avengers: Endgame    32.19777

 %% Cell type:markdown id: tags:

 # GROUP BY Queries

 ```sql
 SELECT ???, ??? FROM Movies
 GROUP BY ???
 ```

 Sample query format:

 ```
 SELECT
 FROM movies
 WHERE
 GROUP BY
 ORDER BY
 LIMIT
 ```

 %% Cell type:markdown id: tags:

 ### What is the *total revenue* for each *year*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT year, SUM(revenue)
 FROM movies
 GROUP BY year
 """)
 ```

 %% Output

        Year  SUM(revenue)
    0   2006       3624.46
    1   2007       4306.23
    2   2008       5053.22
    3   2009       5292.26
    4   2010       5989.65
    5   2011       5431.96
    6   2012       6910.29
    7   2013       7544.21
    8   2014       7997.40
    9   2015       8854.12
    10  2016      11211.65
    11  2017       2086.58
    12  2018       2675.12
    13  2019       2665.93
    14  2020       1025.19

 %% Cell type:markdown id: tags:

-### *How many movies* were by each *director*?
+### *How many movies* were directed by the top-10 *director*s?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, COUNT(*) AS mov_count
 FROM movies
 GROUP BY director
 ORDER BY mov_count DESC
 limit 10
 """)
 ```

 %% Output

                 Director  mov_count
    0        Ridley Scott          8
    1  Paul W.S. Anderson          6
    2         Michael Bay          6
    3     Martin Scorsese          6
    4  M. Night Shyamalan          6
    5    Denis Villeneuve          6
    6         David Yates          6
    7   Christopher Nolan          6
    8         Zack Snyder          5
    9         Woody Allen          5

 %% Cell type:markdown id: tags:

 ### What is the *average rating* for each *director*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, AVG(rating)
 FROM movies
 GROUP BY director
 """)
 ```

 %% Output

                    Director  AVG(rating)
    0             Aamir Khan         8.50
    1           Aaron Sorkin         7.80
    2    Abdellatif Kechiche         7.80
    3              Adam Leon         6.50
    4             Adam McKay         7.00
    ..                   ...          ...
    674          Yimou Zhang         6.10
    675     Yorgos Lanthimos         7.20
    676          Zack Snyder         7.04
    677        Zackary Adler         5.10
    678          Zoya Akhtar         8.00
    
    [679 rows x 2 columns]

 %% Cell type:markdown id: tags:

-### How many *unique directors* created a movie in each *year*
+### How many *unique directors* created a movie in each *year*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT year, COUNT(DISTINCT director) AS director_count
 FROM movies
 GROUP BY year
 """)
 ```

 %% Output

        Year  director_count
    0   2006              44
    1   2007              51
    2   2008              51
    3   2009              51
    4   2010              60
    5   2011              63
    6   2012              64
    7   2013              88
    8   2014              97
    9   2015             127
    10  2016             289
    11  2017              22
    12  2018              19
    13  2019              23
    14  2020               6

 %% Cell type:markdown id: tags:

 ## Combining GROUP BY with other CLAUSES

 ![Screen%20Shot%202022-04-21%20at%2011.37.27%20AM.png](attachment:Screen%20Shot%202022-04-21%20at%2011.37.27%20AM.png)

 %% Cell type:markdown id: tags:

-### What is the *total revenue* per *year*, in *recent* years?
+### What is the *total revenue* per *year*, in *recent* years (last 5 years)?

 %% Cell type:code id: tags:

 ``` python
-# recent means 5 years
 qry("""
 SELECT year, SUM(revenue) AS total_revenue
 FROM movies
 GROUP BY Year
 ORDER BY Year DESC
 LIMIT 5
 """)
 ```

 %% Output

       Year  total_revenue
    0  2020        1025.19
    1  2019        2665.93
    2  2018        2675.12
    3  2017        2086.58
    4  2016       11211.65

 %% Cell type:markdown id: tags:

 ### Which 5 *directors* have had the *most number of movies* earning *over 200M dollars*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, COUNT(title) AS count
 FROM movies
 WHERE revenue > 200
 GROUP BY director
 ORDER BY count DESC
 limit 5
 """)
 ```

 %% Output

               Director  count
    0       David Yates      5
    1       Michael Bay      4
    2  Francis Lawrence      4
    3     Anthony Russo      4
    4       Zack Snyder      3

 %% Cell type:markdown id: tags:

-### Which *three* of the *directors* have the *greatest average rating*?
+### Which *three directors* have the *greatest average rating*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, AVG(rating) AS avg_rating
 FROM movies
 GROUP BY director
 ORDER BY avg_rating DESC
 LIMIT 3
 """)
 ```

 %% Output

              Director  avg_rating
    0      Thomas Kail         8.6
    1    Sudha Kongara         8.6
    2  Olivier Nakache         8.6

 %% Cell type:markdown id: tags:

 Why is the above question maybe not the best to ask?

 %% Cell type:code id: tags:

 ``` python
 # These directors could have made just 1 good movie.
 # We would want to consider if the director has multiple great movies, instead of just one.
 ```

 %% Cell type:markdown id: tags:

-### Which *five* of the *directors* have the *greatest average rating* over at *least three movies*?
+### Which *five directors* have the *greatest average rating* over at *least three movies*?

 %% Cell type:markdown id: tags:

 Can you solve this question just using `GROUPBY` and `WHERE`?

 Answer: We cannot use WHERE clause on aggregates because that data doesn't exist in the original table

 %% Cell type:code id: tags:

 ``` python
 # This query wouldn't work

 # qry("""
 # SELECT director, AVG(rating) AS avg_rating, COUNT(*) as count
 # FROM movies
 # WHERE count >= 3
 # GROUP BY director
 # ORDER BY avg_rating DESC
 # LIMIT 3
 # """)
 ```

 %% Cell type:markdown id: tags:

 Need filtering BEFORE and AFTER the GROUP operations
 ![Screen%20Shot%202022-04-21%20at%2011.34.25%20AM.png](attachment:Screen%20Shot%202022-04-21%20at%2011.34.25%20AM.png)

 %% Cell type:markdown id: tags:

 # WHERE vs. HAVING

 * WHERE: filter rows in original table
 * HAVING: filter groups

 %% Cell type:markdown id: tags:

-### Which *five* directors *having* at least 3 movies score the *greatest average rating* ?
+### Which *five* directors *have at least 3 movies* that score the *greatest average rating* ?

 %% Cell type:markdown id: tags:

 ![Screen%20Shot%202022-04-21%20at%2011.39.17%20AM.png](attachment:Screen%20Shot%202022-04-21%20at%2011.39.17%20AM.png)

+%% Cell type:markdown id: tags:
+
+### SQL query sample format (with all main clauses - both mandatory and optional)
+
+```
+SELECT
+FROM movies
+WHERE
+GROUP BY
+HAVING
+ORDER BY
+LIMIT
+```
+
 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, AVG(rating) AS avg_rating, COUNT(*) as count
 FROM movies
 GROUP BY director
 HAVING count >= 3
 ORDER BY avg_rating DESC
 LIMIT 3
 """)
 ```

 %% Output

                Director  avg_rating  count
    0  Christopher Nolan    8.533333      6
    1        Pete Docter    8.200000      3
    2      Anthony Russo    8.125000      4

 %% Cell type:markdown id: tags:

-### Which *directors* have had *more than 3 movies* that have been *since 2010*?
+### Which *directors* have had *more than 3 movies* that have been released *since 2010*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, COUNT(title) AS count
 FROM movies
 WHERE year >= 2010
 GROUP BY director
 HAVING count > 3
 """)
 ```

 %% Output

                  Director  count
    0        Anthony Russo      4
    1        Antoine Fuqua      4
    2    Christopher Nolan      4
    3     David O. Russell      4
    4          David Yates      4
    5     Denis Villeneuve      6
    6            James Wan      4
    7   M. Night Shyamalan      4
    8      Martin Scorsese      5
    9          Michael Bay      4
    10       Mike Flanagan      4
    11           Paul Feig      4
    12  Paul W.S. Anderson      5
    13          Peter Berg      4
    14        Ridley Scott      5
    15         Woody Allen      4

 %% Cell type:markdown id: tags:

-### Which *directors* have more than *two* movies with runtimes under *100* minutes
+### Which *directors* have more than *two* movies with runtimes under *100* minutes?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, COUNT(title) AS count
 FROM movies
 WHERE runtime < 100
 GROUP BY director
 HAVING count > 2
 """)
 ```

 %% Output

               Director  count
    0     Mike Flanagan      3
    1  Nicholas Stoller      3
    2      Wes Anderson      3
    3       Woody Allen      4

 %% Cell type:code id: tags:

 ``` python
 # Don't forget to close the movies.db connection
 c.close()
 ```

-%% Cell type:markdown id: tags:
+%% Cell type:code id: tags:

-## ignore this cell (it's just to make certain text red later, but you don't need to understand it).
+``` python
+# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
 from IPython.core.display import HTML
 HTML('<style>em { color: red; }</style>')
+```
+
+%% Output
+
+    <IPython.core.display.HTML object>

 %% Cell type:code id: tags:

 ``` python
 # import statements
 import sqlite3
 import pandas as pd
 import os
 ```

 %% Cell type:markdown id: tags:

 ## Warmup: SQL query clauses
 **Mandatory SQL clauses**
 - SELECT: column, column, ...  or *
 - FROM: table

 **Optional SQL clauses**
 - WHERE:  boolean expression (if row has ....)
 - can use AND, OR, NOT
 - ORDER BY  column (ASC, DESC)
 - LIMIT: num rows

 %% Cell type:code id: tags:

 ``` python
 # open up the movies database
 movies_path = "movies.db"
 assert os.path.exists(movies_path)
 c = sqlite3.connect(movies_path)
 ```

 %% Cell type:code id: tags:

 ``` python
 # what are the table names?
 df = pd.read_sql("select * from sqlite_master where type='table'", c)
 df
 ```

 %% Output

        type    name tbl_name  rootpage  \
    0  table  movies   movies         2
    
                                                     sql
    0  CREATE TABLE "movies" (\n"Title" TEXT,\n  "Gen...

 %% Cell type:code id: tags:

 ``` python
 # what are the data types?
 print(df["sql"].iloc[0])
 ```

 %% Output

    CREATE TABLE "movies" (
    "Title" TEXT,
      "Genre" TEXT,
      "Director" TEXT,
      "Cast" TEXT,
      "Year" INTEGER,
      "Runtime" INTEGER,
      "Rating" REAL,
      "Revenue" REAL
    )

 %% Cell type:code id: tags:

 ``` python
 # what is all our data?
 pd.read_sql("select * from movies", c)
 ```

 %% Output

                                   Title                         Genre  \
    0            Guardians of the Galaxy       Action,Adventure,Sci-Fi
    1                         Prometheus      Adventure,Mystery,Sci-Fi
    2                              Split               Horror,Thriller
    3                               Sing       Animation,Comedy,Family
    4                      Suicide Squad      Action,Adventure,Fantasy
    ...                              ...                           ...
    1063  Guardians of the Galaxy Vol. 2     Action, Adventure, Comedy
    1064                     Baby Driver          Action, Crime, Drama
    1065                  Only the Brave      Action, Biography, Drama
    1066                   Incredibles 2  Animation, Action, Adventure
    1067                  A Star Is Born         Drama, Music, Romance
    
                      Director                                               Cast  \
    0               James Gunn  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
    1             Ridley Scott  Noomi Rapace, Logan Marshall-Green, Michael   ...
    2       M. Night Shyamalan  James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
    3     Christophe Lourdelet  Matthew McConaughey,Reese Witherspoon, Seth Ma...
    4               David Ayer  Will Smith, Jared Leto, Margot Robbie, Viola D...
    ...                    ...                                                ...
    1063            James Gunn  Chris Pratt, Zoe Saldana, Dave Bautista, Vin D...
    1064          Edgar Wright  Ansel Elgort, Jon Bernthal, Jon Hamm, Eiza Gon...
    1065       Joseph Kosinski  Josh Brolin, Miles Teller, Jeff Bridges, Jenni...
    1066             Brad Bird  Craig T. Nelson, Holly Hunter, Sarah Vowell, H...
    1067        Bradley Cooper  Lady Gaga, Bradley Cooper, Sam Elliott, Greg G...
    
          Year  Runtime  Rating  Revenue
    0     2014      121     8.1   333.13
    1     2012      124     7.0   126.46
    2     2016      117     7.3   138.12
    3     2016      108     7.2   270.32
    4     2016      123     6.2   325.02
    ...    ...      ...     ...      ...
    1063  2017      136     7.6   389.81
    1064  2017      113     7.6   107.83
    1065  2017      134     7.6    18.34
    1066  2018      118     7.6   608.58
    1067  2018      136     7.6   215.29
    
    [1068 rows x 8 columns]

 %% Cell type:code id: tags:

 ``` python
 # this function allows to type less for each query
 def qry(sql, conn = c):
    return pd.read_sql(sql, conn)
 ```

 %% Cell type:markdown id: tags:

 Sample query format:

 ```
 SELECT
 FROM movies
 WHERE
 ORDER BY
 LIMIT
 ```

 %% Cell type:code id: tags:

 ``` python
 # call qry ....copy/paste the query from above
 qry("""
 SELECT *
 FROM movies
 """)
 ```

 %% Output

                                   Title                         Genre  \
    0            Guardians of the Galaxy       Action,Adventure,Sci-Fi
    1                         Prometheus      Adventure,Mystery,Sci-Fi
    2                              Split               Horror,Thriller
    3                               Sing       Animation,Comedy,Family
    4                      Suicide Squad      Action,Adventure,Fantasy
    ...                              ...                           ...
    1063  Guardians of the Galaxy Vol. 2     Action, Adventure, Comedy
    1064                     Baby Driver          Action, Crime, Drama
    1065                  Only the Brave      Action, Biography, Drama
    1066                   Incredibles 2  Animation, Action, Adventure
    1067                  A Star Is Born         Drama, Music, Romance
    
                      Director                                               Cast  \
    0               James Gunn  Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
    1             Ridley Scott  Noomi Rapace, Logan Marshall-Green, Michael   ...
    2       M. Night Shyamalan  James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
    3     Christophe Lourdelet  Matthew McConaughey,Reese Witherspoon, Seth Ma...
    4               David Ayer  Will Smith, Jared Leto, Margot Robbie, Viola D...
    ...                    ...                                                ...
    1063            James Gunn  Chris Pratt, Zoe Saldana, Dave Bautista, Vin D...
    1064          Edgar Wright  Ansel Elgort, Jon Bernthal, Jon Hamm, Eiza Gon...
    1065       Joseph Kosinski  Josh Brolin, Miles Teller, Jeff Bridges, Jenni...
    1066             Brad Bird  Craig T. Nelson, Holly Hunter, Sarah Vowell, H...
    1067        Bradley Cooper  Lady Gaga, Bradley Cooper, Sam Elliott, Greg G...
    
          Year  Runtime  Rating  Revenue
    0     2014      121     8.1   333.13
    1     2012      124     7.0   126.46
    2     2016      117     7.3   138.12
    3     2016      108     7.2   270.32
    4     2016      123     6.2   325.02
    ...    ...      ...     ...      ...
    1063  2017      136     7.6   389.81
    1064  2017      113     7.6   107.83
    1065  2017      134     7.6    18.34
    1066  2018      118     7.6   608.58
    1067  2018      136     7.6   215.29
    
    [1068 rows x 8 columns]

 %% Cell type:markdown id: tags:

 ### What's the *Title* of the movie with the highest *Rating*?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT Title, Rating
 FROM movies
 ORDER BY Rating DESC
 LIMIT 1
 """)
 df
 ```

 %% Output

                 Title  Rating
    0  The Dark Knight     9.0

 %% Cell type:code id: tags:

 ``` python
 df.iloc[0]["Title"]
 ```

 %% Output

    'The Dark Knight'

 %% Cell type:markdown id: tags:

 ### Which *Director* made the movie with the shortest *Runtime*?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT Director, Runtime
 FROM movies
 ORDER BY Runtime
 LIMIT 1
 """)
 df
 ```

 %% Output

            Director  Runtime
    0  Claude Barras       66

 %% Cell type:code id: tags:

 ``` python
 df.iloc[0]["Director"]
 ```

 %% Output

    'Claude Barras'

 %% Cell type:markdown id: tags:

 ### What was the *Director*  and *Title* of the movie with the largest *Revenue*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, revenue, title
 FROM movies
 ORDER BY revenue DESC
 LIMIT 1
 """)
 ```

 %% Output

          Director  Revenue                                       Title
    0  J.J. Abrams   936.63  Star Wars: Episode VII - The Force Awakens

 %% Cell type:markdown id: tags:

-### What is the *Title* of the movie with the highest *Revenue* in *Year* 2016?
+### What is the *Title* of the movie with the highest *Revenue* in *Year* 2019?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT title, revenue, year
 FROM movies
-WHERE year = 2016
+WHERE year = 2019
 ORDER BY revenue DESC
 LIMIT 1
 """)
 df
 ```

 %% Output

-           Title  Revenue  Year
-    0  Rogue One   532.17  2016
+                   Title  Revenue  Year
+    0  Avengers: Endgame   858.37  2019

 %% Cell type:code id: tags:

 ``` python
 df.iloc[0]["Title"]
 ```

 %% Output

-    'Rogue One'
+    'Avengers: Endgame'

 %% Cell type:markdown id: tags:

-### Which *3 movies*  had the highest *Revenue* in the *Year* 2016?
+### Which *3 movies*  had the top-3 highest *Revenue* in the *Year* 2019?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT title, revenue
 FROM movies
-WHERE year = 2016
+WHERE year = 2019
 ORDER BY revenue DESC
 LIMIT 3
 """)
 df
 ```

 %% Output

-                            Title  Revenue
-    0                   Rogue One   532.17
-    1                Finding Dory   486.29
-    2  Captain America: Civil War   408.08
+                   Title  Revenue
+    0  Avengers: Endgame   858.37
+    1        Toy Story 4   434.04
+    2              Joker   335.45

 %% Cell type:code id: tags:

 ``` python
-# Extract revenue column and convert to list
-list(df["Revenue"])
+# Extract title column and convert to list
+list(df["Title"])
 ```

 %% Output

-    [532.17, 486.29, 408.08]
+    ['Avengers: Endgame', 'Toy Story 4', 'Joker']

 %% Cell type:markdown id: tags:

 ## Lecture 33: Database 2
 Learning Objectives:
 - Use the AS command to rename a column or a calculation
 - Use SQL Aggregate functions to summarize database columns:
 - SUM, AVG, COUNT, MIN, MAX, DISTINCT
 - Use the GROUP BY command to place database rows into buckets.
 - Use the HAVING command to apply conditions to groups.

 %% Cell type:markdown id: tags:

 ### Which *3 movies* have the highest *rating-to-revenue ratios*?

 The `AS` clause lets us rename a column or a calcuation

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT title, rating / revenue AS ratio
 FROM movies
 ORDER BY ratio DESC
 LIMIT 3
 """)
 ```

 %% Output

             Title  ratio
    0    Wakefield  750.0
    1  Love, Rosie  720.0
    2     Lovesong  640.0

 %% Cell type:markdown id: tags:

 ## Aggregate Queries

 ```
 SUM, AVG, COUNT, MIN, MAX, DISTINCT
 ```

 %% Cell type:markdown id: tags:

 ### How many *rows of movies* are there?
 Note: when we want to count the number of rows, we use COUNT(*)

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT COUNT(*)
 FROM movies
 """)
 ```

 %% Output

       COUNT(*)
    0      1068

 %% Cell type:markdown id: tags:

 ### How many *directors* are there?

 %% Cell type:code id: tags:

 ``` python
-# This doesn't feel correct - it counts duplicates for director names!
 qry("""
 SELECT COUNT(director)
 FROM movies
 """)
+# This doesn't feel correct - it counts duplicates for director names!
 ```

 %% Output

       COUNT(director)
    0             1068

 %% Cell type:markdown id: tags:

 Use COUNT(DISTINCT columname)

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT COUNT(DISTINCT director)
 FROM movies
 """)
 ```

 %% Output

       COUNT(DISTINCT director)
    0                       679

 %% Cell type:markdown id: tags:

+### What are the names of the *directors* (without duplicates)?
+
+%% Cell type:code id: tags:
+
+``` python
+df = qry("""
+SELECT DISTINCT director
+FROM movies
+""")
+df
+```
+
+%% Output
+
+                      Director
+    0               James Gunn
+    1             Ridley Scott
+    2       M. Night Shyamalan
+    3     Christophe Lourdelet
+    4               David Ayer
+    ..                     ...
+    674     Andrey Zvyagintsev
+    675             Sean Baker
+    676  Destin Daniel Cretton
+    677           Tyler Nilson
+    678         Bradley Cooper
+    
+    [679 rows x 1 columns]
+
+%% Cell type:code id: tags:
+
+``` python
+# Extract Director column and convert to list
+director_list = list(df["Director"])
+#director_list # uncomment to see the output
+```
+
+%% Cell type:markdown id: tags:
+
 ### What is the total *Revenue* of *all the movies*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT SUM(revenue)
 FROM movies
 """)
 ```

 %% Output

       SUM(revenue)
    0      80668.27

 %% Cell type:markdown id: tags:

 ### What is the *average rating* across *all movies*?

 * v1: with `SUM` and `COUNT`
 * v2: with `AVG`

 %% Cell type:code id: tags:

 ``` python
 # v1
 df = qry("""
 SELECT SUM(rating) / COUNT(*)
 FROM movies
 """)
 df
 ```

 %% Output

       SUM(rating) / COUNT(*)
    0                6.805431

 %% Cell type:code id: tags:

 ``` python
 df.iloc[0][0]
 ```

 %% Output

    6.805430711610491

 %% Cell type:code id: tags:

 ``` python
 # v2
 qry("""
 SELECT AVG(rating)
 FROM movies
 """)
 ```

 %% Output

       AVG(rating)
    0     6.805431

 %% Cell type:markdown id: tags:

 ### What is the *average revenue* and *average runtime* of *all the movies*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT AVG(revenue), AVG(runtime)
 FROM movies
 """)
 ```

 %% Output

       AVG(revenue)  AVG(runtime)
    0     75.532088    114.093633

 %% Cell type:markdown id: tags:

 ### What is the *average revenue* for a *Ridley Scott* movie?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT AVG(revenue)
 FROM movies
 WHERE director = "Ridley Scott"
 """)
 df
 ```

 %% Output

       AVG(revenue)
    0       89.8825

 %% Cell type:code id: tags:

 ``` python
 df.iloc[0][0]
 ```

 %% Output

    89.88250000000001

 %% Cell type:markdown id: tags:

-### *How many movies* were there in *2016*?
+### *How many movies* were there in *2019*?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT COUNT(*)
 FROM movies
-WHERE year = 2016
+WHERE year = 2019
 """)
+df
 ```

+%% Output
+
+       COUNT(*)
+    0        23
+
 %% Cell type:code id: tags:

 ``` python
 df.iloc[0][0]
 ```

 %% Output

-    296
+    23

 %% Cell type:markdown id: tags:

 ### What *percentage* of the *total revenue* came from the *highest-revenue movie*?

 %% Cell type:code id: tags:

 ``` python
 df = qry("""
 SELECT title, MAX(revenue) / SUM(revenue) * 100 AS percentage
 FROM movies
 """)
 df
 ```

 %% Output

                                            Title  percentage
    0  Star Wars: Episode VII - The Force Awakens    1.161088

 %% Cell type:code id: tags:

 ``` python
 df.iloc[0][0]
 ```

 %% Output

    'Star Wars: Episode VII - The Force Awakens'

 %% Cell type:markdown id: tags:

-### What *percentage* of the *revenue* came from the *highest-revenue movie* in *2016*?
+### What *percentage* of the *revenue* came from the *highest-revenue movie* in *2019*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT title, MAX(revenue) / SUM(revenue) * 100 AS percentage
 FROM movies
-WHERE year = 2016
+WHERE year = 2019
 """)
 ```

 %% Output

-           Title  percentage
-    0  Rogue One    4.746581
+                   Title  percentage
+    0  Avengers: Endgame    32.19777

 %% Cell type:markdown id: tags:

 # GROUP BY Queries

 ```sql
 SELECT ???, ??? FROM Movies
 GROUP BY ???
 ```

 Sample query format:

 ```
 SELECT
 FROM movies
 WHERE
 GROUP BY
 ORDER BY
 LIMIT
 ```

 %% Cell type:markdown id: tags:

 ### What is the *total revenue* for each *year*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT year, SUM(revenue)
 FROM movies
 GROUP BY year
 """)
 ```

 %% Output

        Year  SUM(revenue)
    0   2006       3624.46
    1   2007       4306.23
    2   2008       5053.22
    3   2009       5292.26
    4   2010       5989.65
    5   2011       5431.96
    6   2012       6910.29
    7   2013       7544.21
    8   2014       7997.40
    9   2015       8854.12
    10  2016      11211.65
    11  2017       2086.58
    12  2018       2675.12
    13  2019       2665.93
    14  2020       1025.19

 %% Cell type:markdown id: tags:

-### *How many movies* were by each *director*?
+### *How many movies* were directed by the top-10 *director*s?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, COUNT(*) AS mov_count
 FROM movies
 GROUP BY director
 ORDER BY mov_count DESC
 limit 10
 """)
 ```

 %% Output

                 Director  mov_count
    0        Ridley Scott          8
    1  Paul W.S. Anderson          6
    2         Michael Bay          6
    3     Martin Scorsese          6
    4  M. Night Shyamalan          6
    5    Denis Villeneuve          6
    6         David Yates          6
    7   Christopher Nolan          6
    8         Zack Snyder          5
    9         Woody Allen          5

 %% Cell type:markdown id: tags:

 ### What is the *average rating* for each *director*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, AVG(rating)
 FROM movies
 GROUP BY director
 """)
 ```

 %% Output

                    Director  AVG(rating)
    0             Aamir Khan         8.50
    1           Aaron Sorkin         7.80
    2    Abdellatif Kechiche         7.80
    3              Adam Leon         6.50
    4             Adam McKay         7.00
    ..                   ...          ...
    674          Yimou Zhang         6.10
    675     Yorgos Lanthimos         7.20
    676          Zack Snyder         7.04
    677        Zackary Adler         5.10
    678          Zoya Akhtar         8.00
    
    [679 rows x 2 columns]

 %% Cell type:markdown id: tags:

-### How many *unique directors* created a movie in each *year*
+### How many *unique directors* created a movie in each *year*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT year, COUNT(DISTINCT director) AS director_count
 FROM movies
 GROUP BY year
 """)
 ```

 %% Output

        Year  director_count
    0   2006              44
    1   2007              51
    2   2008              51
    3   2009              51
    4   2010              60
    5   2011              63
    6   2012              64
    7   2013              88
    8   2014              97
    9   2015             127
    10  2016             289
    11  2017              22
    12  2018              19
    13  2019              23
    14  2020               6

 %% Cell type:markdown id: tags:

 ## Combining GROUP BY with other CLAUSES

 ![Screen%20Shot%202022-04-21%20at%2011.37.27%20AM.png](attachment:Screen%20Shot%202022-04-21%20at%2011.37.27%20AM.png)

 %% Cell type:markdown id: tags:

-### What is the *total revenue* per *year*, in *recent* years?
+### What is the *total revenue* per *year*, in *recent* years (last 5 years)?

 %% Cell type:code id: tags:

 ``` python
-# recent means 5 years
 qry("""
 SELECT year, SUM(revenue) AS total_revenue
 FROM movies
 GROUP BY Year
 ORDER BY Year DESC
 LIMIT 5
 """)
 ```

 %% Output

       Year  total_revenue
    0  2020        1025.19
    1  2019        2665.93
    2  2018        2675.12
    3  2017        2086.58
    4  2016       11211.65

 %% Cell type:markdown id: tags:

 ### Which 5 *directors* have had the *most number of movies* earning *over 200M dollars*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, COUNT(title) AS count
 FROM movies
 WHERE revenue > 200
 GROUP BY director
 ORDER BY count DESC
 limit 5
 """)
 ```

 %% Output

               Director  count
    0       David Yates      5
    1       Michael Bay      4
    2  Francis Lawrence      4
    3     Anthony Russo      4
    4       Zack Snyder      3

 %% Cell type:markdown id: tags:

-### Which *three* of the *directors* have the *greatest average rating*?
+### Which *three directors* have the *greatest average rating*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, AVG(rating) AS avg_rating
 FROM movies
 GROUP BY director
 ORDER BY avg_rating DESC
 LIMIT 3
 """)
 ```

 %% Output

              Director  avg_rating
    0      Thomas Kail         8.6
    1    Sudha Kongara         8.6
    2  Olivier Nakache         8.6

 %% Cell type:markdown id: tags:

 Why is the above question maybe not the best to ask?

 %% Cell type:code id: tags:

 ``` python
 # These directors could have made just 1 good movie.
 # We would want to consider if the director has multiple great movies, instead of just one.
 ```

 %% Cell type:markdown id: tags:

-### Which *five* of the *directors* have the *greatest average rating* over at *least three movies*?
+### Which *five directors* have the *greatest average rating* over at *least three movies*?

 %% Cell type:markdown id: tags:

 Can you solve this question just using `GROUPBY` and `WHERE`?

 Answer: We cannot use WHERE clause on aggregates because that data doesn't exist in the original table

 %% Cell type:code id: tags:

 ``` python
 # This query wouldn't work

 # qry("""
 # SELECT director, AVG(rating) AS avg_rating, COUNT(*) as count
 # FROM movies
 # WHERE count >= 3
 # GROUP BY director
 # ORDER BY avg_rating DESC
 # LIMIT 3
 # """)
 ```

 %% Cell type:markdown id: tags:

 Need filtering BEFORE and AFTER the GROUP operations
 ![Screen%20Shot%202022-04-21%20at%2011.34.25%20AM.png](attachment:Screen%20Shot%202022-04-21%20at%2011.34.25%20AM.png)

 %% Cell type:markdown id: tags:

 # WHERE vs. HAVING

 * WHERE: filter rows in original table
 * HAVING: filter groups

 %% Cell type:markdown id: tags:

-### Which *five* directors *having* at least 3 movies score the *greatest average rating* ?
+### Which *five* directors *have at least 3 movies* that score the *greatest average rating* ?

 %% Cell type:markdown id: tags:

 ![Screen%20Shot%202022-04-21%20at%2011.39.17%20AM.png](attachment:Screen%20Shot%202022-04-21%20at%2011.39.17%20AM.png)

+%% Cell type:markdown id: tags:
+
+### SQL query sample format (with all main clauses - both mandatory and optional)
+
+```
+SELECT
+FROM movies
+WHERE
+GROUP BY
+HAVING
+ORDER BY
+LIMIT
+```
+
 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, AVG(rating) AS avg_rating, COUNT(*) as count
 FROM movies
 GROUP BY director
 HAVING count >= 3
 ORDER BY avg_rating DESC
 LIMIT 3
 """)
 ```

 %% Output

                Director  avg_rating  count
    0  Christopher Nolan    8.533333      6
    1        Pete Docter    8.200000      3
    2      Anthony Russo    8.125000      4

 %% Cell type:markdown id: tags:

-### Which *directors* have had *more than 3 movies* that have been *since 2010*?
+### Which *directors* have had *more than 3 movies* that have been released *since 2010*?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, COUNT(title) AS count
 FROM movies
 WHERE year >= 2010
 GROUP BY director
 HAVING count > 3
 """)
 ```

 %% Output

                  Director  count
    0        Anthony Russo      4
    1        Antoine Fuqua      4
    2    Christopher Nolan      4
    3     David O. Russell      4
    4          David Yates      4
    5     Denis Villeneuve      6
    6            James Wan      4
    7   M. Night Shyamalan      4
    8      Martin Scorsese      5
    9          Michael Bay      4
    10       Mike Flanagan      4
    11           Paul Feig      4
    12  Paul W.S. Anderson      5
    13          Peter Berg      4
    14        Ridley Scott      5
    15         Woody Allen      4

 %% Cell type:markdown id: tags:

-### Which *directors* have more than *two* movies with runtimes under *100* minutes
+### Which *directors* have more than *two* movies with runtimes under *100* minutes?

 %% Cell type:code id: tags:

 ``` python
 qry("""
 SELECT director, COUNT(title) AS count
 FROM movies
 WHERE runtime < 100
 GROUP BY director
 HAVING count > 2
 """)
 ```

 %% Output

               Director  count
    0     Mike Flanagan      3
    1  Nicholas Stoller      3
    2      Wes Anderson      3
    3       Woody Allen      4

 %% Cell type:code id: tags:

 ``` python
 # Don't forget to close the movies.db connection
 c.close()
 ```
No results found