Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • HLI877/cs220-lecture-material
  • DANDAPANTULA/cs220-lecture-material
  • cdis/cs/courses/cs220/cs220-lecture-material
  • GIMOTEA/cs220-lecture-material
  • TWMILLER4/cs220-lecture-material
  • GU227/cs220-lecture-material
  • ABADAL/cs220-lecture-material
  • CMILTON3/cs220-lecture-material
  • BDONG39/cs220-lecture-material
  • JSANDOVAL6/cs220-lecture-material
  • JSABHARWAL2/cs220-lecture-material
  • GFREDERICKS/cs220-lecture-material
  • LMSUN/cs220-lecture-material
  • RBHALE/cs220-lecture-material
  • MILNARIK/cs220-lecture-material
  • SUTTI/cs220-lecture-material
  • NMISHRA4/cs220-lecture-material
  • HXIA36/cs220-lecture-material
  • DEPPELER/cs220-lecture-material
  • KIM2245/cs220-lecture-material
  • SKLEPFER/cs220-lecture-material
  • BANDIERA/cs220-lecture-material
  • JKILPS/cs220-lecture-material
  • SOERGEL/cs220-lecture-material
  • DBAUTISTA2/cs220-lecture-material
  • VLEFTWICH/cs220-lecture-material
  • MOU5/cs220-lecture-material
  • ALJACOBSON3/cs220-lecture-material
  • RCHOUDHARY5/cs220-lecture-material
  • MGERSCH/cs220-lecture-material
  • EKANDERSON8/cs220-lecture-material
  • ZHANG2752/cs220-lecture-material
  • VSANTAMARIA/cs220-lecture-material
  • VILBRANDT/cs220-lecture-material
  • ELADD2/cs220-lecture-material
  • YLIU2328/cs220-lecture-material
  • LMEASNER/cs220-lecture-material
  • ATANG28/cs220-lecture-material
  • AKSCHELLIN/cs220-lecture-material
  • OMBUSH/cs220-lecture-material
  • MJDAVID/cs220-lecture-material
  • AKHATRY/cs220-lecture-material
  • CZHUANG6/cs220-lecture-material
  • JPDEYOUNG/cs220-lecture-material
  • SDREES/cs220-lecture-material
  • CLCAMPBELL3/cs220-lecture-material
  • CJCAMPOS/cs220-lecture-material
  • AMARAN/cs220-lecture-material
  • rmflynn2/cs220-lecture-material
  • zhang2855/cs220-lecture-material
  • imanzoor/cs220-lecture-material
  • TOUSEEF/cs220-lecture-material
  • qchen445/cs220-lecture-material
  • nareed2/cs220-lecture-material
  • younkman/cs220-lecture-material
  • kli382/cs220-lecture-material
  • bsaulnier/cs220-lecture-material
  • isatrom/cs220-lecture-material
  • kgoodrum/cs220-lecture-material
  • mransom2/cs220-lecture-material
  • ahstevens/cs220-lecture-material
  • JRADUECHEL/cs220-lecture-material
  • mpcyr/cs220-lecture-material
  • wmeyrose/cs220-lecture-material
  • mmaltman/cs220-lecture-material
  • lsonntag/cs220-lecture-material
  • ghgallant/cs220-lecture-material
  • agkaiser2/cs220-lecture-material
  • rlgerhardt/cs220-lecture-material
  • chen2552/cs220-lecture-material
  • mickiewicz/cs220-lecture-material
  • cbarnish/cs220-lecture-material
  • alampson/cs220-lecture-material
  • mjwendt4/cs220-lecture-material
  • somsakhein/cs220-lecture-material
  • heppenibanez/cs220-lecture-material
  • szhang926/cs220-lecture-material
  • wewatson/cs220-lecture-material
  • jho34/cs220-lecture-material
  • lmedin/cs220-lecture-material
  • hjiang373/cs220-lecture-material
  • hfry2/cs220-lecture-material
  • ajroberts7/cs220-lecture-material
  • mcerhardt/cs220-lecture-material
  • njtomaszewsk/cs220-lecture-material
  • rwang728/cs220-lecture-material
  • jhansonflore/cs220-lecture-material
  • msajja/cs220-lecture-material
  • bjornson2/cs220-lecture-material
  • ccmclaren/cs220-lecture-material
  • armstrongbag/cs220-lecture-material
  • eloe2/cs220-lecture-material
92 results
Show changes
Showing
with 3703 additions and 1401 deletions
Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:code id: tags:
``` python
# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
from IPython.core.display import display, HTML
display(HTML('<style>em { color: red; }</style> <style>.container { width:100% !important; }</style>'))
```
%% Cell type:code id: tags:
``` python
%matplotlib inline
```
%% Cell type:code id: tags:
``` python
import pandas as pd
from pandas import DataFrame, Series
import sqlite3
import os
import matplotlib
# new import statement
from matplotlib import pyplot as plt
import requests
matplotlib.rcParams["font.size"] = 12
```
%% Cell type:markdown id: tags:
#### Wrapping up bus dataset example
%% Cell type:markdown id: tags:
#### What are the top routes, and how many people ride them daily?
%% Cell type:code id: tags:
``` python
path = "bus.db"
# assert existence of path
assert os.path.exists(path)
# establish connection to bus.db
conn = sqlite3.connect(path)
```
%% Cell type:code id: tags:
``` python
df = pd.read_sql("""
SELECT Route, SUM(DailyBoardings) AS daily
FROM boarding
GROUP BY Route
ORDER BY daily DESC
""", conn)
df
```
%% Cell type:code id: tags:
``` python
# let's extract daily column from df
df["daily"]
```
%% Cell type:code id: tags:
``` python
# let's create a bar plot from daily column Series
df["daily"].plot.bar()
# Oops wrong x-axis labels!
```
%% Cell type:code id: tags:
``` python
df
```
%% Cell type:code id: tags:
``` python
df = ???
# let's plot for top 5 routes alone
???
```
%% Cell type:code id: tags:
``` python
# let's use slicing to aggregate the rest of the data
```
%% Cell type:code id: tags:
``` python
# let's plot the bars
ax = (s / 1000).plot.bar(color = "k")
ax.set_ylabel("Rides / Day (Thousands)")
None
```
%% Cell type:code id: tags:
``` python
conn.close()
```
%% Cell type:markdown id: tags:
### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
- This set of data is used in beginning Machine Learning Courses
- You can train a ML algorithm to use the values to predict the class of iris
- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
%% Cell type:markdown id: tags:
#### Warmup 1: Downloading IRIS dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data)
%% Cell type:code id: tags:
``` python
# use requests to get this URL
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
response = ???
# check that the request was successful
???
# open a file called "iris.csv" for writing the data locally
file_obj = open("iris.csv", ???)
# write the text of response to the file object
file_obj.write(???)
# close the file object
file_obj.close()
# Look at the file you downloaded. What's wrong with it?
```
%% Cell type:markdown id: tags:
#### Warmup 2: Making a DataFrame
%% Cell type:code id: tags:
``` python
# read the "iris.csv" file into a Pandas dataframe
iris_df = ???
# display the head of the data frame
iris_df.head()
```
%% Cell type:markdown id: tags:
#### Warmup 3: Our CSV file has no header. Let's add column names.
- Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
%% Cell type:code id: tags:
``` python
# Attribute Information:
# 1. sepal length in cm
# 2. sepal width in cm
# 3. petal length in cm
# 4. petal width in cm
# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
# These should be our headers
# ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
iris_df = pd.read_csv("iris.csv",
???)
iris_df.head()
```
%% Cell type:markdown id: tags:
#### Warmup 4: Connect to our database version of this data!
%% Cell type:code id: tags:
``` python
iris_conn = sqlite3.connect("iris-flowers.db")
pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
```
%% Cell type:markdown id: tags:
#### Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
Break any ties by ordering by the shortest sepal width.
%% Cell type:code id: tags:
``` python
pd.read_sql("""
SELECT
FROM
WHERE
ORDER BY
LIMIT 10
""", iris_conn)
```
%% Cell type:markdown id: tags:
# Lecture 36: Scatter Plots
**Learning Objectives**
- Set the marker, color, and size of scatter plot data
- Calculate correlation between DataFrame columns
- Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
## Set the marker, color, and size of scatter plot data
To start, let's look at some made-up data about Trees.
The city of Madison maintains a database of all the trees they care for.
%% Cell type:code id: tags:
``` python
trees = [
{"age": 1, "height": 1.5, "diameter": 0.8},
{"age": 1, "height": 1.9, "diameter": 1.2},
{"age": 1, "height": 1.8, "diameter": 1.4},
{"age": 2, "height": 1.8, "diameter": 0.9},
{"age": 2, "height": 2.5, "diameter": 1.5},
{"age": 2, "height": 3, "diameter": 1.8},
{"age": 2, "height": 2.9, "diameter": 1.7},
{"age": 3, "height": 3.2, "diameter": 2.1},
{"age": 3, "height": 3, "diameter": 2},
{"age": 3, "height": 2.4, "diameter": 2.2},
{"age": 2, "height": 3.1, "diameter": 2.9},
{"age": 4, "height": 2.5, "diameter": 3.1},
{"age": 4, "height": 3.9, "diameter": 3.1},
{"age": 4, "height": 4.9, "diameter": 2.8},
{"age": 4, "height": 5.2, "diameter": 3.5},
{"age": 4, "height": 4.8, "diameter": 4},
]
trees_df = DataFrame(trees)
trees_df.head()
```
%% Cell type:markdown id: tags:
### Scatter Plots
We can make a scatter plot of a DataFrame using the following function...
`df_name.plot.scatter(x = "x_col_name", y = "y_col_name", \
color = "red", marker = "*", s = 50)`
%% Cell type:markdown id: tags:
Plot the trees data comparing a tree's age to its height...
- What is `df_name`?
- What is `x_col_name`?
- What is `y_col_name`?
%% Cell type:code id: tags:
``` python
# TODO: change y to diameter
```
%% Cell type:markdown id: tags:
Now plot with a little more beautification...
- Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
- Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
- Change the size (any int)
%% Cell type:code id: tags:
``` python
# Plot with some more beautification options.
trees_df.plot.scatter(x = "age", y = "height", color = "r", marker = "D", s = 50)
# D for diamond
```
%% Cell type:code id: tags:
``` python
# Add a title to your plot.
ax = trees_df.plot.scatter(x = "age", y = "height", color = "r", marker = "D", s = 50)
# D for diamond
ax.set_title("Tree Age vs Height")
```
%% Cell type:markdown id: tags:
#### Correlation
%% Cell type:code id: tags:
``` python
# What is the correlation between our DataFrame columns?
corr_df = trees_df.corr()
corr_df
```
%% Cell type:code id: tags:
``` python
# What is the correlation between age and height (don't use .iloc)
# Using index in this case isn't considered as hardcoding
corr_df['age']['height']
```
%% Cell type:markdown id: tags:
### Variating Stylistic Parameters
%% Cell type:code id: tags:
``` python
# Option 1:
trees_df.plot.scatter(x = "age", y = "height", marker = "H", s = "diameter")
```
%% Cell type:code id: tags:
``` python
# Option 2:
# this way allows you to make it bigger
trees_df.plot.scatter(x = "age", y = "height", marker = "H", s = trees_df["diameter"] * 50)
```
%% Cell type:markdown id: tags:
## Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
### Re-visit the Iris Data
%% Cell type:code id: tags:
``` python
iris_df
```
%% Cell type:markdown id: tags:
### How do we create a *scatter plot* for various *class types*?
First, gather all the class types.
%% Cell type:code id: tags:
``` python
# In Pandas
varietes = list(set(iris_df["class"]))
varietes
```
%% Cell type:code id: tags:
``` python
# In SQL
varietes = list(pd.read_sql("""
SELECT DISTINCT class
FROM iris
""", iris_conn)["class"])
varietes
```
%% Cell type:markdown id: tags:
In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
%% Cell type:code id: tags:
``` python
# If you want to continue using SQL instead, don't close the connection!
iris_conn.close()
```
%% Cell type:code id: tags:
``` python
# Change this scatter plot so that the data is only for class ='Iris-setosa'
```
%% Cell type:code id: tags:
``` python
# Write a for loop that iterates through each variety in classes
# and makes a plot for only that class
# For each class add a color and a marker style
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
for i in range(len(varietes)):
???
```
%% Cell type:markdown id: tags:
Did you notice that it made 3 plots?!?! What's decieving about this?
%% Cell type:markdown id: tags:
### We can make Subplots in plots, called an AxesSubplot, keyword ax
1. if AxesSuplot ax passed, then plot in that subplot
2. if ax is None, create a new AxesSubplot
3. return AxesSubplot that was used
%% Cell type:code id: tags:
``` python
# complete this code to make 3 plots in one
plot_area = None # don't change this...look at this variable in line 12
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
for i in range(len(varietes)):
???
```
%% Cell type:markdown id: tags:
### Let's focus on "Iris-virginica" data
%% Cell type:code id: tags:
``` python
iris_virginica = ???
assert(len(iris_virginica) == 50)
iris_virginica.head()
```
%% Cell type:code id: tags:
``` python
iris_virginica.plot.scatter(x = "pet-width", y = "pet-length")
```
%% Cell type:markdown id: tags:
### Let's learn about *xlim* and *ylim*
- Allows us to set x-axis and y-axis limits
- Takes either a single value (LOWER-BOUND) or a tuple containing two values (LOWER-BOUND, UPPER-BOUND)
- You need to be careful about setting the UPPER-BOUND
%% Cell type:code id: tags:
``` python
iris_virginica.plot.scatter(x = "pet-width", y = "pet-length", xlim = ???, ylim = ???)
```
%% Cell type:code id: tags:
``` python
ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
xlim = (0, 6), ylim = (0, 6),
figsize = (3, 3))
# What is wrong with this plot?
```
%% Cell type:markdown id: tags:
What is the maximum pet-len?
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
ax.get_ylim()
```
%% Cell type:markdown id: tags:
Let's include assert statements to make sure we don't crop the plot!
%% Cell type:code id: tags:
``` python
ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
xlim = (0, 6), ylim = (0, 6),
figsize = (3, 3))
assert iris_virginica["pet-length"].max() <= ax.get_ylim()[1]
```
%% Cell type:markdown id: tags:
### Now let's try all 4 assert statements
```
assert iris_virginica[ax.get_xlabel()].min() >= ax.get_xlim()[0]
assert iris_virginica[ax.get_xlabel()].max() <= ax.get_xlim()[1]
assert iris_virginica[ax.get_ylabel()].min() >= ax.get_ylim()[0]
assert iris_virginica[ax.get_ylabel()].max() <= ax.get_ylim()[1]
```
%% Cell type:code id: tags:
``` python
ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
xlim = (0, 7), ylim = (0, 7),
figsize = (3, 3))
assert iris_virginica[ax.get_xlabel()].min() >= ax.get_xlim()[0]
assert iris_virginica[ax.get_xlabel()].max() <= ax.get_xlim()[1]
assert iris_virginica[ax.get_ylabel()].min() >= ax.get_ylim()[0]
assert iris_virginica[ax.get_ylabel()].max() <= ax.get_ylim()[1]
```
%% Cell type:markdown id: tags:
### Time-Permitting
Plot this data in an interesting/meaningful way & identify any correlations.
%% Cell type:code id: tags:
``` python
students = pd.DataFrame({
"name": [
"Cole",
"Cynthia",
"Alice",
"Seth"
],
"grade": [
"C",
"AB",
"B",
"BC"
],
"gpa": [
2.0,
3.5,
3.0,
2.5
],
"attendance": [
4,
11,
10,
6
],
"height": [
68,
66,
60,
72
]
})
students
```
%% Cell type:code id: tags:
``` python
# Min, Max, and Overall Difference in Student Height
min_height = students["height"].min()
max_height = students["height"].max()
diff_height = max_height - min_height
# Normalize students heights on a scale of [0, 1] (black to white)
height_colors = (students["height"] - min_height) / diff_height
# Normalize students heights on a scale of [0, 0.5] (black to gray)
height_colors = height_colors / 2
# Color must be a string (e.g. c='0.34')
height_colors = height_colors.astype("string")
height_colors
```
%% Cell type:code id: tags:
``` python
students.plot.scatter(x="attendance", y="gpa", c=height_colors)
```
%% Cell type:code id: tags:
``` python
students.corr()
```
%% Cell type:code id: tags:
``` python
# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
from IPython.core.display import display, HTML
display(HTML('<style>em { color: red; }</style> <style>.container { width:100% !important; }</style>'))
```
%% Output
%% Cell type:code id: tags:
``` python
import pandas as pd
from pandas import DataFrame, Series
import sqlite3
import os
import matplotlib
from matplotlib import pyplot as plt
import requests
matplotlib.rcParams["font.size"] = 12
```
%% Cell type:markdown id: tags:
### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
- This set of data is used in beginning Machine Learning Courses
- You can train a ML algorithm to use the values to predict the class of iris
- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
%% Cell type:code id: tags:
``` python
# Warmup 1: Requests and file writing
# use requests to get this file "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
response = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
# check that the request was successful
response.raise_for_status()
# open a file called "iris.csv" for writing the data locally to avoid spamming their server
file_obj = open("iris.csv", "w")
# write the text of response to the file object
file_obj.write(response.text)
# close the file object
file_obj.close()
# Look at the file you downloaded. What's wrong with it?
```
%% Cell type:code id: tags:
``` python
# Warmup 2: Making a DataFrame
# read the "iris.csv" file into a Pandas dataframe
# display the head of the data frame
```
%% Cell type:code id: tags:
``` python
# Warmup 3: Our CSV file has no header....let's add column names.
# Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
# Attribute Information:
# 1. sepal length in cm
# 2. sepal width in cm
# 3. petal length in cm
# 4. petal width in cm
# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
# These should be our headers ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
```
%% Cell type:code id: tags:
``` python
# Warmup 4: Connect to our database version of this data
iris_conn = sqlite3.connect("iris-flowers.db")
# find out the name of the table
pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
```
%% Cell type:code id: tags:
``` python
# Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
# Break any ties by ordering by the shortest sepal width.
pd.read_sql("""
""", iris_conn)
```
%% Cell type:markdown id: tags:
# Lecture 36: Scatter Plots
**Learning Objectives**
- Set the marker, color, and size of scatter plot data
- Calculate correlation between DataFrame columns
- Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
## Set the marker, color, and size of scatter plot data
To start, let's look at some made-up data about Trees.
The city of Madison maintains a database of all the trees they care for.
%% Cell type:code id: tags:
``` python
trees = [
{"age": 1, "height": 1.5, "diameter": 0.8},
{"age": 1, "height": 1.9, "diameter": 1.2},
{"age": 1, "height": 1.8, "diameter": 1.4},
{"age": 2, "height": 1.8, "diameter": 0.9},
{"age": 2, "height": 2.5, "diameter": 1.5},
{"age": 2, "height": 3, "diameter": 1.8},
{"age": 2, "height": 2.9, "diameter": 1.7},
{"age": 3, "height": 3.2, "diameter": 2.1},
{"age": 3, "height": 3, "diameter": 2},
{"age": 3, "height": 2.4, "diameter": 2.2},
{"age": 2, "height": 3.1, "diameter": 2.9},
{"age": 4, "height": 2.5, "diameter": 3.1},
{"age": 4, "height": 3.9, "diameter": 3.1},
{"age": 4, "height": 4.9, "diameter": 2.8},
{"age": 4, "height": 5.2, "diameter": 3.5},
{"age": 4, "height": 4.8, "diameter": 4},
]
trees_df = DataFrame(trees)
trees_df.head()
```
%% Cell type:markdown id: tags:
### Scatter Plots
We can make a scatter plot of a DataFrame using the following function...
`df_name.plot.scatter(x="x_col_name", y="y_col_name", color="peachpuff")`
Plot the trees data comparing a tree's age to its height...
- What is `df_name`?
- What is `x_col_name`?
- What is `y_col_name`?
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
Now plot with a little more beautification...
- Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
- Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
- Change the size (any int)
%% Cell type:code id: tags:
``` python
# Plot with some more beautification options.
```
%% Cell type:code id: tags:
``` python
# Add a title to your plot.
```
%% Cell type:markdown id: tags:
#### Correlation
%% Cell type:code id: tags:
``` python
# What is the correlation between our DataFrame columns?
```
%% Cell type:code id: tags:
``` python
# What is the correlation between age and height (don't use .iloc)
```
%% Cell type:markdown id: tags:
### The Size can be based on a DataFrame value
%% Cell type:code id: tags:
``` python
# Option 1:
trees_df.plot.scatter(x="age", y="height", marker="H", s="diameter")
```
%% Cell type:code id: tags:
``` python
# Option 2:
trees_df.plot.scatter(x="age", y="height", marker = "H", s=trees_df["diameter"] * 50) # this way allows you to make it bigger
```
%% Cell type:markdown id: tags:
## Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
### Re-visit the Iris Data
%% Cell type:code id: tags:
``` python
iris_df
```
%% Cell type:markdown id: tags:
### How do we create a *scatter plot* for various *class types*?
First, gather all the class types.
%% Cell type:code id: tags:
``` python
# In Pandas
varieties = ???
varieties
```
%% Cell type:code id: tags:
``` python
# In SQL
varietes = pd.read_sql("""
""", iris_conn)
varietes
```
%% Cell type:markdown id: tags:
In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
%% Cell type:code id: tags:
``` python
# If you want to continue using SQL instead, don't close the connection!
iris_conn.close()
```
%% Cell type:code id: tags:
``` python
# Change this scatter plot so that the data is only for class ='Iris-setosa'
iris_df.plot.scatter(x = "pet-width", y = "pet-length")
```
%% Cell type:code id: tags:
``` python
# Write a for loop that iterates through each variety in classes
# and makes a plot for only that class
for i in range(len(varietes)):
variety = varietes[i]
pass
```
%% Cell type:code id: tags:
``` python
# copy/paste the code above, but this time make each plot a different color
colors = ["blue", "green", "red"]
```
%% Cell type:code id: tags:
``` python
# copy/paste the code above, but this time make each plot a different color AND marker
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
```
%% Cell type:code id: tags:
``` python
# Did you notice that it made 3 plots?!?! What's deceiving about this?
```
%% Cell type:code id: tags:
``` python
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
```
%% Cell type:code id: tags:
``` python
# Have to be VERY careful to not crop out data.
# We'll talk about this next lecture.
```
%% Cell type:code id: tags:
``` python
# Better yet, we could combine these.
```
%% Cell type:markdown id: tags:
### We can make Subplots in plots, called an AxesSubplot, keyword ax
1. if AxesSuplot ax passed, then plot in that subplot
2. if ax is None, create a new AxesSubplot
3. return AxesSubplot that was used
%% Cell type:code id: tags:
``` python
# complete this code to make 3 plots in one
plot_area = None # don't change this...look at this variable in line 12
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
```
%% Cell type:markdown id: tags:
### Time-Permitting
Plot this data in an interesting/meaningful way & identify any correlations.
%% Cell type:code id: tags:
``` python
students = pd.DataFrame({
"name": [
"Cole",
"Cynthia",
"Alice",
"Seth"
],
"grade": [
"C",
"AB",
"B",
"BC"
],
"gpa": [
2.0,
3.5,
3.0,
2.5
],
"attendance": [
4,
11,
10,
6
],
"height": [
68,
66,
60,
72
]
})
students
```
%% Cell type:code id: tags:
``` python
# Min, Max, and Overall Difference in Student Height
min_height = students["height"].min()
max_height = students["height"].max()
diff_height = max_height - min_height
# Normalize students heights on a scale of [0, 1] (black to white)
height_colors = (students["height"] - min_height) / diff_height
# Normalize students heights on a scale of [0, 0.5] (black to gray)
height_colors = height_colors / 2
# Color must be a string (e.g. c='0.34')
height_colors = height_colors.astype("string")
height_colors
```
%% Cell type:code id: tags:
``` python
# Plot!
```
%% Cell type:code id: tags:
``` python
# What are the correlations?
```
%% Cell type:markdown id: tags:
![image.png](attachment:image.png)
%% Cell type:markdown id: tags:
https://www.researchgate.net/publication/247907373_Stupid_Data_Miner_Tricks_Overfitting_the_SP_500
%% Cell type:code id: tags:
``` python
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
```
%% Output
%% Cell type:code id: tags:
``` python
import csv
import os
import csv
```
%% Cell type:code id: tags:
``` python
# copied from https://automatetheboringstuff.com/2e/chapter16/
def process_csv(filename):
exampleFile = open(filename)
exampleReader = csv.reader(exampleFile)
exampleData = list(exampleReader)
return exampleData
```
%% Cell type:markdown id: tags:
## Example 1: List Visualization
### Write a gen_html function
- Input: shopping_list and path to shopping.html
- Outcome: create shopping.html file
### Pseudocode
1. Open "shopping.html" in write mode.
2. Write \<ul\> tag into the html file
3. Iterate over each item in shopping list.
4. Write each item with <\li\> tag.
5. After you are done iterating, write \</ul\> tag.
6. Close the file object.
%% Cell type:code id: tags:
``` python
def gen_html(shopping_list, html_path):
f = open(html_path, "w")
f.write("<ul>\n")
for item in shopping_list:
f.write("<li>" + str(item) + "\n")
f.write("</ul>\n")
f.close()
gen_html(["apples", "oranges", "milk", "banana"], "shopping.html")
```
%% Cell type:markdown id: tags:
## Example 2: Dictionary Visualization
### Write a csv_to_html function
- Input: path to review1.csv and path to reviews.html
- Outcome 1: create a html file for each review
- Outcome 2: create reviews.html file containing link to a html file for each review
### Pseudocode
1. Create data_html folder using os.mkdir. Make sure to use try ... except blocks to catch FileExistsError
2. Use process_csv function to read csv data and split the header and the data
3. For each review, extract review id, review title, review text.
4. generate the \<rid\>.html for each review inside data_html folder.
- Open \<rid\>.html in write mode
- Add review title using \<h1\> tag
- Add review text inside\<p\> tag
- Close \<rid\>.html file object
5. generate a reviews.html file which has link to each review html page \<rid\>.html
- Open reviews.html file in write mode
- Add each \<rid\>.html as hyperlink using \<a\> tag.
- Close reviews.html file
%% Cell type:code id: tags:
``` python
def csv_to_html(csv_path, html_path):
try:
os.mkdir("data_html")
except FileExistsError:
pass
reviews_data = process_csv(csv_path)
reviews_header = reviews_data[0]
reviews_data = reviews_data[1:]
reviews_file = open(html_path, "w")
reviews_file.write("<ul>\n")
for row in reviews_data:
rid = row[reviews_header.index("review id")]
title = row[reviews_header.index("review title")]
text = row[reviews_header.index("review text")]
# STEP 4: generate the <rid>.html for each review inside data folder
review_path = os.path.join("data_html", str(rid) + ".html")
html_file = open(review_path, "w")
html_file.write("<h1>{}</h1><p>{}</p>".format(title, text))
html_file.close()
# STEP 5: generate a reviews.html file which has link to each review html page <rid>.html
reviews_file.write('<li><a href = "{}">{}</a>'.format(review_path, str(rid) + ":" + str(title)) + "<br>\n")
reviews_file.write("</ul>\n")
reviews_file.close()
csv_to_html(os.path.join("data", "review1.csv"), "reviews.html")
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
```
%% Cell type:code id: tags:
``` python
import csv
import os
```
%% Cell type:markdown id: tags:
## Example 1: List Visualization
### Write a gen_html function
- Input: shopping_list and path to shopping.html
- Outcome: create shopping.html file
### Pseudocode
1. Open "shopping.html" in write mode.
2. Write \<ul\> tag into the html file
3. Iterate over each item in shopping list.
4. Write each item with \<li\> tag.
5. After you are done iterating, write \</ul\> tag.
6. Close the file object.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## Example 2: Dictionary Visualization
### Write a csv_to_html function
- Input: path to review1.csv and path to reviews.html
- Outcome 1: create a html file for each review
- Outcome 2: create reviews.html file containing link to a html file for each review
### Pseudocode
1. Create data_html folder using os.mkdir. Make sure to use try ... except blocks to catch FileExistsError
2. Use process_csv function to read csv data and split the header and the data
3. For each review, extract review id, review title, review text.
4. generate the \<rid\>.html for each review inside data_html folder.
- Open \<rid\>.html in write mode
- Add review title using \<h1\> tag
- Add review text inside\<p\> tag
- Close \<rid\>.html file object
5. generate a reviews.html file which has link to each review html page \<rid\>.html
- Open reviews.html file in write mode
- Add each \<rid\>.html as hyperlink using \<a\> tag.
- Close reviews.html file
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Web 3
- HTML parsing using BeautifulSoup
%% Cell type:code id: tags:
``` python
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
```
%% Output
%% Cell type:code id: tags:
``` python
import requests #For downloading the HTML content using HTTP GET request
from bs4 import BeautifulSoup #For parsing the HTML content and searching through the HTML
import os
import pandas as pd
```
%% Cell type:markdown id: tags:
# STAGE 1: extract all state URLs from the states page
## Stage 1 pseudocode
1. Use requests module to send a GET request to https://simple.wikipedia.org/wiki/List_of_U.S._states
2. Don't forget to raise_for_status to ensure you are getting 200 OK status code
3. Explore what r.text gives you
%% Cell type:code id: tags:
``` python
url = "https://simple.wikipedia.org/wiki/List_of_U.S._states"
r = requests.get(url)
r.raise_for_status()
#print(r.text) #Uncomment this line to see the output
```
%% Cell type:markdown id: tags:
## Stage 1 pseudocode continued...
4. Check out what type you are getting from r.text
%% Cell type:code id: tags:
``` python
print(type(r.text))
```
%% Output
<class 'str'>
%% Cell type:markdown id: tags:
## Stage 1 pseudocode continued...
5. Create BeautifulSoup object by passing r.text, "html.parser" as arguments and capture return value into a variable called doc
6. Try prettify() method call --- still not that pretty, right?
%% Cell type:code id: tags:
``` python
doc = BeautifulSoup(r.text, "html.parser")
#print(doc.prettify()) #Uncomment this line to see the output
```
%% Cell type:markdown id: tags:
## Stage 1 pseudocode continued...
7. (Not a code step) Open "https://simple.wikipedia.org/wiki/List_of_U.S._states" on Google Chrome.
- Right click on one of the state pages
- Click on "Inspect" --- this opens developer tools
- This tool let's you explore the html source code
- Explore the \<table\> and sub tags like \<th\>, \<tr\>, \<td\>
- Let's go back to coding
%% Cell type:markdown id: tags:
## Stage 1 pseudocode continued...
7. Find all "table" elements in the document by using doc.find_all(...) function and capture return value into a variable "tables"
- explore the length of the value returned from find_all(...) function
- check out the type of the value returned from find_all(...) function
8. Add an assert to check that there is only one table - futuristic assert to make sure the html format hasn't changed on the website
9. Extract the first table into tbl variable
- explore type of tbl
- try printing the content of tb1 --- looks like just a string
%% Cell type:code id: tags:
``` python
tables = doc.find_all("table")
print(len(tables)) # only one table on the states page!
print(type(tables))
#Futuristic assert to make sure the html format hasn't changed on the website
assert len(tables) == 1
tbl = tables[0]
print(type(tbl))
```
%% Output
1
<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>
%% Cell type:code id: tags:
``` python
#print(tbl) #Uncomment this line to see the output
```
%% Cell type:markdown id: tags:
## Stage 1 pseudocode continued...
10. Find all the tr elements by using tbl.find_all(...) function and capture return value into a variable tr.
- explore length of trs, type of trs
- Add an assert checking that length of trs is at least 50 (For 50 US states)
%% Cell type:code id: tags:
``` python
trs = tbl.find_all("tr")
print(len(trs))
print(type(trs))
assert len(trs) >= 50
```
%% Output
52
<class 'bs4.element.ResultSet'>
%% Cell type:markdown id: tags:
## Stage 1 pseudocode continued...
11. Iterate over each item in trs (going to be a lengthy step!)
- print each item (tr tag)
- call tr.find(..) to find "th" elements --- this finds th element for every tr element.
- capture return value into a variable called th
- print th and explore what you are getting.
- find each hyperlinks within each th element: call th.find_all("a") and capture return value into a variable called links
- explore length of links by printing it --- some of the states have 2 links; go back and explore why that is the case and figure out which link you want
- some have 0 links, skip over those entries!
- extract first of the hyperlinks into a variable called link
- print link to confirm you are able to extract the correct link
- explore type of link
- print link.get_text() method and get attrs of link by saying link.attrs
- capture link.get_text() into a variable state
- capture link.attrs into a variable state_url --- we need a full URL. Define a prefix variable holding "https://simple.wikipedia.org" and concatenate prefix + link.attrs
- create a new dictionary called state_links --- we are going to use this dict to track each state and its URL. Think carefully about where you have to create this empty dict.
#### Congrats :) stage 1 is done
%% Cell type:code id: tags:
``` python
prefix = "https://simple.wikipedia.org"
state_links = {} #KEY: state name; VALUE: link to state page
for tr in trs:
th = tr.find("th")
links = th.find_all("a")
#print(len(links))
#print(th.get_text())
if len(links) == 0:
continue
link = links[0]
#print(type(link), link)
#print(link.get_text(), link.attrs) #link.attrs is a dict
state = link.get_text()
state_url = prefix + link.attrs["href"]
state_links[state] = state_url
state_links
```
%% Output
{'postal abbs.': 'https://simple.wikipedia.org/wiki/List_of_U.S._state_abbreviations',
'Alabama': 'https://simple.wikipedia.org/wiki/Alabama',
'Alaska': 'https://simple.wikipedia.org/wiki/Alaska',
'Arizona': 'https://simple.wikipedia.org/wiki/Arizona',
'Arkansas': 'https://simple.wikipedia.org/wiki/Arkansas',
'California': 'https://simple.wikipedia.org/wiki/California',
'Colorado': 'https://simple.wikipedia.org/wiki/Colorado',
'Connecticut': 'https://simple.wikipedia.org/wiki/Connecticut',
'Delaware': 'https://simple.wikipedia.org/wiki/Delaware',
'Florida': 'https://simple.wikipedia.org/wiki/Florida',
'Georgia': 'https://simple.wikipedia.org/wiki/Georgia_(U.S._state)',
'Hawaii': 'https://simple.wikipedia.org/wiki/Hawaii',
'Idaho': 'https://simple.wikipedia.org/wiki/Idaho',
'Illinois': 'https://simple.wikipedia.org/wiki/Illinois',
'Indiana': 'https://simple.wikipedia.org/wiki/Indiana',
'Iowa': 'https://simple.wikipedia.org/wiki/Iowa',
'Kansas': 'https://simple.wikipedia.org/wiki/Kansas',
'Kentucky': 'https://simple.wikipedia.org/wiki/Kentucky',
'Louisiana': 'https://simple.wikipedia.org/wiki/Louisiana',
'Maine': 'https://simple.wikipedia.org/wiki/Maine',
'Maryland': 'https://simple.wikipedia.org/wiki/Maryland',
'Massachusetts': 'https://simple.wikipedia.org/wiki/Massachusetts',
'Michigan': 'https://simple.wikipedia.org/wiki/Michigan',
'Minnesota': 'https://simple.wikipedia.org/wiki/Minnesota',
'Mississippi': 'https://simple.wikipedia.org/wiki/Mississippi',
'Missouri': 'https://simple.wikipedia.org/wiki/Missouri',
'Montana': 'https://simple.wikipedia.org/wiki/Montana',
'Nebraska': 'https://simple.wikipedia.org/wiki/Nebraska',
'Nevada': 'https://simple.wikipedia.org/wiki/Nevada',
'New Hampshire': 'https://simple.wikipedia.org/wiki/New_Hampshire',
'New Jersey': 'https://simple.wikipedia.org/wiki/New_Jersey',
'New Mexico': 'https://simple.wikipedia.org/wiki/New_Mexico',
'New York': 'https://simple.wikipedia.org/wiki/New_York_(state)',
'North Carolina': 'https://simple.wikipedia.org/wiki/North_Carolina',
'North Dakota': 'https://simple.wikipedia.org/wiki/North_Dakota',
'Ohio': 'https://simple.wikipedia.org/wiki/Ohio',
'Oklahoma': 'https://simple.wikipedia.org/wiki/Oklahoma',
'Oregon': 'https://simple.wikipedia.org/wiki/Oregon',
'Pennsylvania': 'https://simple.wikipedia.org/wiki/Pennsylvania',
'Rhode Island': 'https://simple.wikipedia.org/wiki/Rhode_Island',
'South Carolina': 'https://simple.wikipedia.org/wiki/South_Carolina',
'South Dakota': 'https://simple.wikipedia.org/wiki/South_Dakota',
'Tennessee': 'https://simple.wikipedia.org/wiki/Tennessee',
'Texas': 'https://simple.wikipedia.org/wiki/Texas',
'Utah': 'https://simple.wikipedia.org/wiki/Utah',
'Vermont': 'https://simple.wikipedia.org/wiki/Vermont',
'Virginia': 'https://simple.wikipedia.org/wiki/Virginia',
'Washington': 'https://simple.wikipedia.org/wiki/Washington',
'West Virginia': 'https://simple.wikipedia.org/wiki/West_Virginia',
'Wisconsin': 'https://simple.wikipedia.org/wiki/Wisconsin',
'Wyoming': 'https://simple.wikipedia.org/wiki/Wyoming'}
%% Cell type:markdown id: tags:
# STAGE 2: download the html page for each state
## Stage 2 pseudocode
1. Create a directory called "html_files_for_states". Make sure to use try except block to catch FileExistsError exception
2. Initially convert the keys of state_links dict into a list and work with just first 3 items in the list of keys
3. Iterate over each key (initially just use 3):
1. If key is "postal abbs.", skip processing. What keyword allows you to skip current iteration of the loop?
2. To create each state's html file name, concatenate the directory name "html_files_for_states" with current key and add a ".html" to the end.
3. Add the html file name into a new dictionary called "state_files". Think carefully about where you have to create this empty dict.
4. Use requests module get(...) function call to download the contents of the state URL page.
5. Open the state html file in write mode and write r.text into the state html file.
#### Congrats :) stage 2 is done
%% Cell type:code id: tags:
``` python
html_dir = "html_files_for_states"
state_files = {} #KEY: state; VALUE: state file
try:
os.mkdir(html_dir)
except FileExistsError:
pass
#for state in list(state_links.keys())[:3]: # Use this for initial testing
for state in state_links.keys():
if state == "postal abbs.":
continue
state_url = state_links[state]
#html file name
state_file = os.path.join(html_dir, state + ".html")
state_files[state] = state_file
#Optimization: if state file already exists, you can perhaps skip downloading it again
if os.path.exists(state_file):
continue
#Download
r = requests.get(state_url)
r.raise_for_status
print(state_file)
#Save to a file
f = open(state_file, "w", encoding = "utf-8")
f.write(r.text)
f.close()
```
%% Output
html_files_for_states/Alabama.html
html_files_for_states/Alaska.html
html_files_for_states/Arizona.html
html_files_for_states/Arkansas.html
html_files_for_states/California.html
html_files_for_states/Colorado.html
html_files_for_states/Connecticut.html
html_files_for_states/Delaware.html
html_files_for_states/Florida.html
html_files_for_states/Georgia.html
html_files_for_states/Hawaii.html
html_files_for_states/Idaho.html
html_files_for_states/Illinois.html
html_files_for_states/Indiana.html
html_files_for_states/Iowa.html
html_files_for_states/Kansas.html
html_files_for_states/Kentucky.html
html_files_for_states/Louisiana.html
html_files_for_states/Maine.html
html_files_for_states/Maryland.html
html_files_for_states/Massachusetts.html
html_files_for_states/Michigan.html
html_files_for_states/Minnesota.html
html_files_for_states/Mississippi.html
html_files_for_states/Missouri.html
html_files_for_states/Montana.html
html_files_for_states/Nebraska.html
html_files_for_states/Nevada.html
html_files_for_states/New Hampshire.html
html_files_for_states/New Jersey.html
html_files_for_states/New Mexico.html
html_files_for_states/New York.html
html_files_for_states/North Carolina.html
html_files_for_states/North Dakota.html
html_files_for_states/Ohio.html
html_files_for_states/Oklahoma.html
html_files_for_states/Oregon.html
html_files_for_states/Pennsylvania.html
html_files_for_states/Rhode Island.html
html_files_for_states/South Carolina.html
html_files_for_states/South Dakota.html
html_files_for_states/Tennessee.html
html_files_for_states/Texas.html
html_files_for_states/Utah.html
html_files_for_states/Vermont.html
html_files_for_states/Virginia.html
html_files_for_states/Washington.html
html_files_for_states/West Virginia.html
html_files_for_states/Wisconsin.html
html_files_for_states/Wyoming.html
%% Cell type:markdown id: tags:
# STAGE 3: extract details from each state page
## Stage 3 pseudocode
1. Write a function state_stats. Input path to 1 state file. Output dict of stats for that state
2. Open state html file, read its content.
3. Create a BeautifulSoup object called doc.
4. doc.find_all("tr") - capture return value into a variable called trs
5. Iterate over each tr element
1. You can retrieve a pair of elements by saying: cells = tr.find_all(["th", "td"])
2. Explore length of the cells. Notice that there are some entries have length > 2. Let's skip over those.
3. Create a dict called stats, where key is the th element's text and the value is td element's text
6. Don't forget to return the stats dict
7. Call state_stats with state_files["Wisconsin"]
%% Cell type:code id: tags:
``` python
def state_stats(path):
stats = {}
f = open(path, encoding = "utf-8")
html_string = f.read()
f.close()
doc = BeautifulSoup(html_string, "html.parser")
trs = doc.find_all("tr")
for tr in trs:
cells = tr.find_all(["th", "td"])
if len(cells) == 2:
key = cells[0].get_text()
value = cells[1].get_text()
stats[key] = value
return stats
wi_stats = state_stats(state_files["Wisconsin"])
print("WI state drink:", wi_stats["Beverage"])
print("WI state dance:", wi_stats["Dance"])
```
%% Output
WI state drink: Milk
WI state dance: Polka
%% Cell type:markdown id: tags:
## Stage 3 pseudocode continued
- Iterate over all the state files, call state_stats function, and save the return value into a variable.
- Keep track of each state's stats in a dict called state_details
- Create a pandas DataFrame from the state_details dict
- Explore the DataFrame.
%% Cell type:code id: tags:
``` python
states_details = {}
for state in state_files.keys():
stats = state_stats(state_files[state])
states_details[state] = stats
```
%% Cell type:code id: tags:
``` python
states_df = pd.DataFrame(states_details)
states_df
```
%% Output
Alabama \
Country United States
Before statehood Alabama Territory
Admitted to the Union December 14, 1819 (22nd)
Capital Montgomery
Largest city Birmingham
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Alaska \
Country United States
Before statehood Territory of Alaska
Admitted to the Union January 3, 1959 (49th)
Capital Juneau
Largest city Anchorage
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Arizona \
Country United States
Before statehood Arizona Territory
Admitted to the Union February 14, 1912 (48th)
Capital NaN
Largest city NaN
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Arkansas \
Country United States
Before statehood Arkansas Territory
Admitted to the Union June 15, 1836 (25th)
Capital NaN
Largest city NaN
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
California \
Country United States
Before statehood Mexican Cession unorganized territory
Admitted to the Union September 9, 1850 (31st)
Capital Sacramento[1]
Largest city Los Angeles
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Colorado \
Country United States
Before statehood NaN
Admitted to the Union August 1, 1876 (38th)
Capital NaN
Largest city NaN
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Connecticut \
Country United States
Before statehood Connecticut Colony
Admitted to the Union January 9, 1788 (5th)
Capital Hartford[1]
Largest city Bridgeport
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Delaware \
Country United States
Before statehood Delaware Colony, New Netherland, New Sweden
Admitted to the Union December 7, 1787 (1st)
Capital Dover
Largest city Wilmington
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Florida \
Country United States
Before statehood Florida Territory
Admitted to the Union March 3, 1845 (27th)
Capital Tallahassee[1]
Largest city Jacksonville[5]
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Georgia \
Country United States
Before statehood Province of Georgia
Admitted to the Union January 2, 1788 (4th)
Capital NaN
Largest city NaN
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
... \
Country ...
Before statehood ...
Admitted to the Union ...
Capital ...
Largest city ...
... ...
Largest cities (pop. over 50,000) ...
Smaller cities (pop. 15,000 to 50,000) ...
Largest villages (pop. over 15,000) ...
Highest elevation (Gannett Peak[2][3][4]) ...
Lowest elevation (Belle Fourche River at South ... ...
South Dakota \
Country United States
Before statehood Dakota Territory
Admitted to the Union November 2, 1889 (39th or 40th)
Capital Pierre
Largest city Sioux Falls
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Tennessee \
Country United States
Before statehood Southwest Territory
Admitted to the Union June 1, 1796 (16th)
Capital NaN
Largest city NaN
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Texas \
Country United States
Before statehood Republic of Texas
Admitted to the Union December 29, 1845 (28th)
Capital Austin
Largest city Houston
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Utah \
Country United States
Before statehood Utah Territory
Admitted to the Union January 4, 1896 (45th)
Capital NaN
Largest city NaN
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Vermont \
Country United States
Before statehood Vermont Republic
Admitted to the Union March 4, 1791 (14th)
Capital Montpelier
Largest city Burlington
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Virginia \
Country United States
Before statehood Colony of Virginia
Admitted to the Union June 25, 1788 (10th)
Capital Richmond
Largest city Virginia Beach
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Washington \
Country United States
Before statehood Washington Territory
Admitted to the Union November 11, 1889 (42nd)
Capital Olympia
Largest city Seattle
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
West Virginia \
Country United States
Before statehood Part of Virginia
Admitted to the Union June 20, 1863 (35th)
Capital NaN
Largest city NaN
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Wisconsin \
Country United States
Before statehood Wisconsin Territory
Admitted to the Union May 29, 1848 (30th)
Capital Madison
Largest city Milwaukee
... ...
Largest cities (pop. over 50,000) \nAppleton\nEau Claire\nGreen Bay\nJanesville\...
Smaller cities (pop. 15,000 to 50,000) \nBeaver Dam\nBeloit\nBrookfield\nCudahy\nDe P...
Largest villages (pop. over 15,000) \nAshwaubenon\nBellevue\nCaledonia\nFox Crossi...
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South ... NaN
Wyoming
Country United States
Before statehood Wyoming Territory
Admitted to the Union July 10, 1890 (44th)
Capital NaN
Largest city NaN
... ...
Largest cities (pop. over 50,000) NaN
Smaller cities (pop. 15,000 to 50,000) NaN
Largest villages (pop. over 15,000) NaN
Highest elevation (Gannett Peak[2][3][4]) 13,809 ft (4,209.1 m)
Lowest elevation (Belle Fourche River at South ... 3,101 ft (945 m)
[327 rows x 50 columns]
%% Cell type:code id: tags:
``` python
states_df.loc["Capital"]
```
%% Output
Alabama Montgomery
Alaska Juneau
Arizona NaN
Arkansas NaN
California Sacramento[1]
Colorado NaN
Connecticut Hartford[1]
Delaware Dover
Florida Tallahassee[1]
Georgia NaN
Hawaii NaN
Idaho NaN
Illinois NaN
Indiana NaN
Iowa NaN
Kansas Topeka
Kentucky Frankfort
Louisiana Baton Rouge
Maine Augusta
Maryland Annapolis
Massachusetts NaN
Michigan Lansing
Minnesota Saint Paul
Mississippi NaN
Missouri Jefferson City
Montana Helena
Nebraska Lincoln
Nevada Carson City
New Hampshire Concord
New Jersey Trenton
New Mexico Santa Fe
New York Albany
North Carolina Raleigh
North Dakota Bismarck
Ohio NaN
Oklahoma NaN
Oregon Salem
Pennsylvania Harrisburg
Rhode Island NaN
South Carolina Columbia
South Dakota Pierre
Tennessee NaN
Texas Austin
Utah NaN
Vermont Montpelier
Virginia Richmond
Washington Olympia
West Virginia NaN
Wisconsin Madison
Wyoming NaN
Name: Capital, dtype: object
%% Cell type:code id: tags:
``` python
states_df.T.loc["Wisconsin"]
```
%% Output
Country United States
Before statehood Wisconsin Territory
Admitted to the Union May 29, 1848 (30th)
Capital Madison
Largest city Milwaukee
...
Largest cities (pop. over 50,000) \nAppleton\nEau Claire\nGreen Bay\nJanesville\...
Smaller cities (pop. 15,000 to 50,000) \nBeaver Dam\nBeloit\nBrookfield\nCudahy\nDe P...
Largest villages (pop. over 15,000) \nAshwaubenon\nBellevue\nCaledonia\nFox Crossi...
Highest elevation (Gannett Peak[2][3][4]) NaN
Lowest elevation (Belle Fourche River at South Dakota border[3][4]) NaN
Name: Wisconsin, Length: 327, dtype: object
%% Cell type:markdown id: tags:
%% Cell type:code id: tags:
## ignore this cell (it's just to make certain text red later, but you don't need to understand it).
``` python
# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
from IPython.core.display import HTML
HTML('<style>em { color: red; }</style>')
```
%% Output
<IPython.core.display.HTML object>
%% Cell type:code id: tags:
``` python
# import statements
import sqlite3
import pandas as pd
import os
```
%% Cell type:markdown id: tags:
## Warmup: SQL query clauses
**Mandatory SQL clauses**
- SELECT: column, column, ... or *
- FROM: table
**Optional SQL clauses**
- WHERE: boolean expression (if row has ....)
- can use AND, OR, NOT
- ORDER BY column (ASC, DESC)
- LIMIT: num rows
%% Cell type:code id: tags:
``` python
# open up the movies database
movies_path = "movies.db"
assert os.path.exists(movies_path)
c = sqlite3.connect(movies_path)
```
%% Cell type:code id: tags:
``` python
# what are the table names?
df = pd.read_sql("select * from sqlite_master where type='table'", c)
df
```
%% Output
type name tbl_name rootpage \
0 table movies movies 2
sql
0 CREATE TABLE "movies" (\n"Title" TEXT,\n "Gen...
%% Cell type:code id: tags:
``` python
# what are the data types?
print(df["sql"].iloc[0])
```
%% Output
CREATE TABLE "movies" (
"Title" TEXT,
"Genre" TEXT,
"Director" TEXT,
"Cast" TEXT,
"Year" INTEGER,
"Runtime" INTEGER,
"Rating" REAL,
"Revenue" REAL
)
%% Cell type:code id: tags:
``` python
# what is all our data?
pd.read_sql("select * from movies", c)
```
%% Output
Title Genre \
0 Guardians of the Galaxy Action,Adventure,Sci-Fi
1 Prometheus Adventure,Mystery,Sci-Fi
2 Split Horror,Thriller
3 Sing Animation,Comedy,Family
4 Suicide Squad Action,Adventure,Fantasy
... ... ...
1063 Guardians of the Galaxy Vol. 2 Action, Adventure, Comedy
1064 Baby Driver Action, Crime, Drama
1065 Only the Brave Action, Biography, Drama
1066 Incredibles 2 Animation, Action, Adventure
1067 A Star Is Born Drama, Music, Romance
Director Cast \
0 James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
1 Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael ...
2 M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
3 Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma...
4 David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D...
... ... ...
1063 James Gunn Chris Pratt, Zoe Saldana, Dave Bautista, Vin D...
1064 Edgar Wright Ansel Elgort, Jon Bernthal, Jon Hamm, Eiza Gon...
1065 Joseph Kosinski Josh Brolin, Miles Teller, Jeff Bridges, Jenni...
1066 Brad Bird Craig T. Nelson, Holly Hunter, Sarah Vowell, H...
1067 Bradley Cooper Lady Gaga, Bradley Cooper, Sam Elliott, Greg G...
Year Runtime Rating Revenue
0 2014 121 8.1 333.13
1 2012 124 7.0 126.46
2 2016 117 7.3 138.12
3 2016 108 7.2 270.32
4 2016 123 6.2 325.02
... ... ... ... ...
1063 2017 136 7.6 389.81
1064 2017 113 7.6 107.83
1065 2017 134 7.6 18.34
1066 2018 118 7.6 608.58
1067 2018 136 7.6 215.29
[1068 rows x 8 columns]
%% Cell type:code id: tags:
``` python
# this function allows to type less for each query
def qry(sql, conn = c):
return pd.read_sql(sql, conn)
```
%% Cell type:markdown id: tags:
Sample query format:
```
SELECT
FROM movies
WHERE
ORDER BY
LIMIT
```
%% Cell type:code id: tags:
``` python
# call qry ....copy/paste the query from above
qry("""
SELECT *
FROM movies
""")
```
%% Output
Title Genre \
0 Guardians of the Galaxy Action,Adventure,Sci-Fi
1 Prometheus Adventure,Mystery,Sci-Fi
2 Split Horror,Thriller
3 Sing Animation,Comedy,Family
4 Suicide Squad Action,Adventure,Fantasy
... ... ...
1063 Guardians of the Galaxy Vol. 2 Action, Adventure, Comedy
1064 Baby Driver Action, Crime, Drama
1065 Only the Brave Action, Biography, Drama
1066 Incredibles 2 Animation, Action, Adventure
1067 A Star Is Born Drama, Music, Romance
Director Cast \
0 James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
1 Ridley Scott Noomi Rapace, Logan Marshall-Green, Michael ...
2 M. Night Shyamalan James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
3 Christophe Lourdelet Matthew McConaughey,Reese Witherspoon, Seth Ma...
4 David Ayer Will Smith, Jared Leto, Margot Robbie, Viola D...
... ... ...
1063 James Gunn Chris Pratt, Zoe Saldana, Dave Bautista, Vin D...
1064 Edgar Wright Ansel Elgort, Jon Bernthal, Jon Hamm, Eiza Gon...
1065 Joseph Kosinski Josh Brolin, Miles Teller, Jeff Bridges, Jenni...
1066 Brad Bird Craig T. Nelson, Holly Hunter, Sarah Vowell, H...
1067 Bradley Cooper Lady Gaga, Bradley Cooper, Sam Elliott, Greg G...
Year Runtime Rating Revenue
0 2014 121 8.1 333.13
1 2012 124 7.0 126.46
2 2016 117 7.3 138.12
3 2016 108 7.2 270.32
4 2016 123 6.2 325.02
... ... ... ... ...
1063 2017 136 7.6 389.81
1064 2017 113 7.6 107.83
1065 2017 134 7.6 18.34
1066 2018 118 7.6 608.58
1067 2018 136 7.6 215.29
[1068 rows x 8 columns]
%% Cell type:markdown id: tags:
### What's the *Title* of the movie with the highest *Rating*?
%% Cell type:code id: tags:
``` python
df = qry("""
SELECT Title, Rating
FROM movies
ORDER BY Rating DESC
LIMIT 1
""")
df
```
%% Output
Title Rating
0 The Dark Knight 9.0
%% Cell type:code id: tags:
``` python
df.iloc[0]["Title"]
```
%% Output
'The Dark Knight'
%% Cell type:markdown id: tags:
### Which *Director* made the movie with the shortest *Runtime*?
%% Cell type:code id: tags:
``` python
df = qry("""
SELECT Director, Runtime
FROM movies
ORDER BY Runtime
LIMIT 1
""")
df
```
%% Output
Director Runtime
0 Claude Barras 66
%% Cell type:code id: tags:
``` python
df.iloc[0]["Director"]
```
%% Output
'Claude Barras'
%% Cell type:markdown id: tags:
### What was the *Director* and *Title* of the movie with the largest *Revenue*?
%% Cell type:code id: tags:
``` python
qry("""
SELECT director, revenue, title
FROM movies
ORDER BY revenue DESC
LIMIT 1
""")
```
%% Output
Director Revenue Title
0 J.J. Abrams 936.63 Star Wars: Episode VII - The Force Awakens
%% Cell type:markdown id: tags:
### What is the *Title* of the movie with the highest *Revenue* in *Year* 2016?
### What is the *Title* of the movie with the highest *Revenue* in *Year* 2019?
%% Cell type:code id: tags:
``` python
df = qry("""
SELECT title, revenue, year
FROM movies
WHERE year = 2016
WHERE year = 2019
ORDER BY revenue DESC
LIMIT 1
""")
df
```
%% Output
Title Revenue Year
0 Rogue One 532.17 2016
Title Revenue Year
0 Avengers: Endgame 858.37 2019
%% Cell type:code id: tags:
``` python
df.iloc[0]["Title"]
```
%% Output
'Rogue One'
'Avengers: Endgame'
%% Cell type:markdown id: tags:
### Which *3 movies* had the highest *Revenue* in the *Year* 2016?
### Which *3 movies* had the top-3 highest *Revenue* in the *Year* 2019?
%% Cell type:code id: tags:
``` python
df = qry("""
SELECT title, revenue
FROM movies
WHERE year = 2016
WHERE year = 2019
ORDER BY revenue DESC
LIMIT 3
""")
df
```
%% Output
Title Revenue
0 Rogue One 532.17
1 Finding Dory 486.29
2 Captain America: Civil War 408.08
Title Revenue
0 Avengers: Endgame 858.37
1 Toy Story 4 434.04
2 Joker 335.45
%% Cell type:code id: tags:
``` python
# Extract revenue column and convert to list
list(df["Revenue"])
# Extract title column and convert to list
list(df["Title"])
```
%% Output
[532.17, 486.29, 408.08]
['Avengers: Endgame', 'Toy Story 4', 'Joker']
%% Cell type:markdown id: tags:
## Lecture 33: Database 2
Learning Objectives:
- Use the AS command to rename a column or a calculation
- Use SQL Aggregate functions to summarize database columns:
- SUM, AVG, COUNT, MIN, MAX, DISTINCT
- Use the GROUP BY command to place database rows into buckets.
- Use the HAVING command to apply conditions to groups.
%% Cell type:markdown id: tags:
### Which *3 movies* have the highest *rating-to-revenue ratios*?
The `AS` clause lets us rename a column or a calcuation
%% Cell type:code id: tags:
``` python
qry("""
SELECT title, rating / revenue AS ratio
FROM movies
ORDER BY ratio DESC
LIMIT 3
""")
```
%% Output
Title ratio
0 Wakefield 750.0
1 Love, Rosie 720.0
2 Lovesong 640.0
%% Cell type:markdown id: tags:
## Aggregate Queries
```
SUM, AVG, COUNT, MIN, MAX, DISTINCT
```
%% Cell type:markdown id: tags:
### How many *rows of movies* are there?
Note: when we want to count the number of rows, we use COUNT(*)
%% Cell type:code id: tags:
``` python
qry("""
SELECT COUNT(*)
FROM movies
""")
```
%% Output
COUNT(*)
0 1068
%% Cell type:markdown id: tags:
### How many *directors* are there?
%% Cell type:code id: tags:
``` python
# This doesn't feel correct - it counts duplicates for director names!
qry("""
SELECT COUNT(director)
FROM movies
""")
# This doesn't feel correct - it counts duplicates for director names!
```
%% Output
COUNT(director)
0 1068
%% Cell type:markdown id: tags:
Use COUNT(DISTINCT columname)
%% Cell type:code id: tags:
``` python
qry("""
SELECT COUNT(DISTINCT director)
FROM movies
""")
```
%% Output
COUNT(DISTINCT director)
0 679
%% Cell type:markdown id: tags:
### What are the names of the *directors* (without duplicates)?
%% Cell type:code id: tags:
``` python
df = qry("""
SELECT DISTINCT director
FROM movies
""")
df
```
%% Output
Director
0 James Gunn
1 Ridley Scott
2 M. Night Shyamalan
3 Christophe Lourdelet
4 David Ayer
.. ...
674 Andrey Zvyagintsev
675 Sean Baker
676 Destin Daniel Cretton
677 Tyler Nilson
678 Bradley Cooper
[679 rows x 1 columns]
%% Cell type:code id: tags:
``` python
# Extract Director column and convert to list
director_list = list(df["Director"])
#director_list # uncomment to see the output
```
%% Cell type:markdown id: tags:
### What is the total *Revenue* of *all the movies*?
%% Cell type:code id: tags:
``` python
qry("""
SELECT SUM(revenue)
FROM movies
""")
```
%% Output
SUM(revenue)
0 80668.27
%% Cell type:markdown id: tags:
### What is the *average rating* across *all movies*?
* v1: with `SUM` and `COUNT`
* v2: with `AVG`
%% Cell type:code id: tags:
``` python
# v1
df = qry("""
SELECT SUM(rating) / COUNT(*)
FROM movies
""")
df
```
%% Output
SUM(rating) / COUNT(*)
0 6.805431
%% Cell type:code id: tags:
``` python
df.iloc[0][0]
```
%% Output
6.805430711610491
%% Cell type:code id: tags:
``` python
# v2
qry("""
SELECT AVG(rating)
FROM movies
""")
```
%% Output
AVG(rating)
0 6.805431
%% Cell type:markdown id: tags:
### What is the *average revenue* and *average runtime* of *all the movies*?
%% Cell type:code id: tags:
``` python
qry("""
SELECT AVG(revenue), AVG(runtime)
FROM movies
""")
```
%% Output
AVG(revenue) AVG(runtime)
0 75.532088 114.093633
%% Cell type:markdown id: tags:
### What is the *average revenue* for a *Ridley Scott* movie?
%% Cell type:code id: tags:
``` python
df = qry("""
SELECT AVG(revenue)
FROM movies
WHERE director = "Ridley Scott"
""")
df
```
%% Output
AVG(revenue)
0 89.8825
%% Cell type:code id: tags:
``` python
df.iloc[0][0]
```
%% Output
89.88250000000001
%% Cell type:markdown id: tags:
### *How many movies* were there in *2016*?
### *How many movies* were there in *2019*?
%% Cell type:code id: tags:
``` python
df = qry("""
SELECT COUNT(*)
FROM movies
WHERE year = 2016
WHERE year = 2019
""")
df
```
%% Output
COUNT(*)
0 23
%% Cell type:code id: tags:
``` python
df.iloc[0][0]
```
%% Output
296
23
%% Cell type:markdown id: tags:
### What *percentage* of the *total revenue* came from the *highest-revenue movie*?
%% Cell type:code id: tags:
``` python
df = qry("""
SELECT title, MAX(revenue) / SUM(revenue) * 100 AS percentage
FROM movies
""")
df
```
%% Output
Title percentage
0 Star Wars: Episode VII - The Force Awakens 1.161088
%% Cell type:code id: tags:
``` python
df.iloc[0][0]
```
%% Output
'Star Wars: Episode VII - The Force Awakens'
%% Cell type:markdown id: tags:
### What *percentage* of the *revenue* came from the *highest-revenue movie* in *2016*?
### What *percentage* of the *revenue* came from the *highest-revenue movie* in *2019*?
%% Cell type:code id: tags:
``` python
qry("""
SELECT title, MAX(revenue) / SUM(revenue) * 100 AS percentage
FROM movies
WHERE year = 2016
WHERE year = 2019
""")
```
%% Output
Title percentage
0 Rogue One 4.746581
Title percentage
0 Avengers: Endgame 32.19777
%% Cell type:markdown id: tags:
# GROUP BY Queries
```sql
SELECT ???, ??? FROM Movies
GROUP BY ???
```
Sample query format:
```
SELECT
FROM movies
WHERE
GROUP BY
ORDER BY
LIMIT
```
%% Cell type:markdown id: tags:
### What is the *total revenue* for each *year*?
%% Cell type:code id: tags:
``` python
qry("""
SELECT year, SUM(revenue)
FROM movies
GROUP BY year
""")
```
%% Output
Year SUM(revenue)
0 2006 3624.46
1 2007 4306.23
2 2008 5053.22
3 2009 5292.26
4 2010 5989.65
5 2011 5431.96
6 2012 6910.29
7 2013 7544.21
8 2014 7997.40
9 2015 8854.12
10 2016 11211.65
11 2017 2086.58
12 2018 2675.12
13 2019 2665.93
14 2020 1025.19
%% Cell type:markdown id: tags:
### *How many movies* were by each *director*?
### *How many movies* were directed by the top-10 *director*s?
%% Cell type:code id: tags:
``` python
qry("""
SELECT director, COUNT(*) AS mov_count
FROM movies
GROUP BY director
ORDER BY mov_count DESC
limit 10
""")
```
%% Output
Director mov_count
0 Ridley Scott 8
1 Paul W.S. Anderson 6
2 Michael Bay 6
3 Martin Scorsese 6
4 M. Night Shyamalan 6
5 Denis Villeneuve 6
6 David Yates 6
7 Christopher Nolan 6
8 Zack Snyder 5
9 Woody Allen 5
%% Cell type:markdown id: tags:
### What is the *average rating* for each *director*?
%% Cell type:code id: tags:
``` python
qry("""
SELECT director, AVG(rating)
FROM movies
GROUP BY director
""")
```
%% Output
Director AVG(rating)
0 Aamir Khan 8.50
1 Aaron Sorkin 7.80
2 Abdellatif Kechiche 7.80
3 Adam Leon 6.50
4 Adam McKay 7.00
.. ... ...
674 Yimou Zhang 6.10
675 Yorgos Lanthimos 7.20
676 Zack Snyder 7.04
677 Zackary Adler 5.10
678 Zoya Akhtar 8.00
[679 rows x 2 columns]
%% Cell type:markdown id: tags:
### How many *unique directors* created a movie in each *year*
### How many *unique directors* created a movie in each *year*?
%% Cell type:code id: tags:
``` python
qry("""
SELECT year, COUNT(DISTINCT director) AS director_count
FROM movies
GROUP BY year
""")
```
%% Output
Year director_count
0 2006 44
1 2007 51
2 2008 51
3 2009 51
4 2010 60
5 2011 63
6 2012 64
7 2013 88
8 2014 97
9 2015 127
10 2016 289
11 2017 22
12 2018 19
13 2019 23
14 2020 6
%% Cell type:markdown id: tags:
## Combining GROUP BY with other CLAUSES
![Screen%20Shot%202022-04-21%20at%2011.37.27%20AM.png](attachment:Screen%20Shot%202022-04-21%20at%2011.37.27%20AM.png)
%% Cell type:markdown id: tags:
### What is the *total revenue* per *year*, in *recent* years?
### What is the *total revenue* per *year*, in *recent* years (last 5 years)?
%% Cell type:code id: tags:
``` python
# recent means 5 years
qry("""
SELECT year, SUM(revenue) AS total_revenue
FROM movies
GROUP BY Year
ORDER BY Year DESC
LIMIT 5
""")
```
%% Output
Year total_revenue
0 2020 1025.19
1 2019 2665.93
2 2018 2675.12
3 2017 2086.58
4 2016 11211.65
%% Cell type:markdown id: tags:
### Which 5 *directors* have had the *most number of movies* earning *over 200M dollars*?
%% Cell type:code id: tags:
``` python
qry("""
SELECT director, COUNT(title) AS count
FROM movies
WHERE revenue > 200
GROUP BY director
ORDER BY count DESC
limit 5
""")
```
%% Output
Director count
0 David Yates 5
1 Michael Bay 4
2 Francis Lawrence 4
3 Anthony Russo 4
4 Zack Snyder 3
%% Cell type:markdown id: tags:
### Which *three* of the *directors* have the *greatest average rating*?
### Which *three directors* have the *greatest average rating*?
%% Cell type:code id: tags:
``` python
qry("""
SELECT director, AVG(rating) AS avg_rating
FROM movies
GROUP BY director
ORDER BY avg_rating DESC
LIMIT 3
""")
```
%% Output
Director avg_rating
0 Thomas Kail 8.6
1 Sudha Kongara 8.6
2 Olivier Nakache 8.6
%% Cell type:markdown id: tags:
Why is the above question maybe not the best to ask?
%% Cell type:code id: tags:
``` python
# These directors could have made just 1 good movie.
# We would want to consider if the director has multiple great movies, instead of just one.
```
%% Cell type:markdown id: tags:
### Which *five* of the *directors* have the *greatest average rating* over at *least three movies*?
### Which *five directors* have the *greatest average rating* over at *least three movies*?
%% Cell type:markdown id: tags:
Can you solve this question just using `GROUPBY` and `WHERE`?
Answer: We cannot use WHERE clause on aggregates because that data doesn't exist in the original table
%% Cell type:code id: tags:
``` python
# This query wouldn't work
# qry("""
# SELECT director, AVG(rating) AS avg_rating, COUNT(*) as count
# FROM movies
# WHERE count >= 3
# GROUP BY director
# ORDER BY avg_rating DESC
# LIMIT 3
# """)
```
%% Cell type:markdown id: tags:
Need filtering BEFORE and AFTER the GROUP operations
![Screen%20Shot%202022-04-21%20at%2011.34.25%20AM.png](attachment:Screen%20Shot%202022-04-21%20at%2011.34.25%20AM.png)
%% Cell type:markdown id: tags:
# WHERE vs. HAVING
* WHERE: filter rows in original table
* HAVING: filter groups
%% Cell type:markdown id: tags:
### Which *five* directors *having* at least 3 movies score the *greatest average rating* ?
### Which *five* directors *have at least 3 movies* that score the *greatest average rating* ?
%% Cell type:markdown id: tags:
![Screen%20Shot%202022-04-21%20at%2011.39.17%20AM.png](attachment:Screen%20Shot%202022-04-21%20at%2011.39.17%20AM.png)
%% Cell type:markdown id: tags:
### SQL query sample format (with all main clauses - both mandatory and optional)
```
SELECT
FROM movies
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT
```
%% Cell type:code id: tags:
``` python
qry("""
SELECT director, AVG(rating) AS avg_rating, COUNT(*) as count
FROM movies
GROUP BY director
HAVING count >= 3
ORDER BY avg_rating DESC
LIMIT 3
""")
```
%% Output
Director avg_rating count
0 Christopher Nolan 8.533333 6
1 Pete Docter 8.200000 3
2 Anthony Russo 8.125000 4
%% Cell type:markdown id: tags:
### Which *directors* have had *more than 3 movies* that have been *since 2010*?
### Which *directors* have had *more than 3 movies* that have been released *since 2010*?
%% Cell type:code id: tags:
``` python
qry("""
SELECT director, COUNT(title) AS count
FROM movies
WHERE year >= 2010
GROUP BY director
HAVING count > 3
""")
```
%% Output
Director count
0 Anthony Russo 4
1 Antoine Fuqua 4
2 Christopher Nolan 4
3 David O. Russell 4
4 David Yates 4
5 Denis Villeneuve 6
6 James Wan 4
7 M. Night Shyamalan 4
8 Martin Scorsese 5
9 Michael Bay 4
10 Mike Flanagan 4
11 Paul Feig 4
12 Paul W.S. Anderson 5
13 Peter Berg 4
14 Ridley Scott 5
15 Woody Allen 4
%% Cell type:markdown id: tags:
### Which *directors* have more than *two* movies with runtimes under *100* minutes
### Which *directors* have more than *two* movies with runtimes under *100* minutes?
%% Cell type:code id: tags:
``` python
qry("""
SELECT director, COUNT(title) AS count
FROM movies
WHERE runtime < 100
GROUP BY director
HAVING count > 2
""")
```
%% Output
Director count
0 Mike Flanagan 3
1 Nicholas Stoller 3
2 Wes Anderson 3
3 Woody Allen 4
%% Cell type:code id: tags:
``` python
# Don't forget to close the movies.db connection
c.close()
```
......