Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • HLI877/cs220-lecture-material
  • DANDAPANTULA/cs220-lecture-material
  • cdis/cs/courses/cs220/cs220-lecture-material
  • GIMOTEA/cs220-lecture-material
  • TWMILLER4/cs220-lecture-material
  • GU227/cs220-lecture-material
  • ABADAL/cs220-lecture-material
  • CMILTON3/cs220-lecture-material
  • BDONG39/cs220-lecture-material
  • JSANDOVAL6/cs220-lecture-material
  • JSABHARWAL2/cs220-lecture-material
  • GFREDERICKS/cs220-lecture-material
  • LMSUN/cs220-lecture-material
  • RBHALE/cs220-lecture-material
  • MILNARIK/cs220-lecture-material
  • SUTTI/cs220-lecture-material
  • NMISHRA4/cs220-lecture-material
  • HXIA36/cs220-lecture-material
  • DEPPELER/cs220-lecture-material
  • KIM2245/cs220-lecture-material
  • SKLEPFER/cs220-lecture-material
  • BANDIERA/cs220-lecture-material
  • JKILPS/cs220-lecture-material
  • SOERGEL/cs220-lecture-material
  • DBAUTISTA2/cs220-lecture-material
  • VLEFTWICH/cs220-lecture-material
  • MOU5/cs220-lecture-material
  • ALJACOBSON3/cs220-lecture-material
  • RCHOUDHARY5/cs220-lecture-material
  • MGERSCH/cs220-lecture-material
  • EKANDERSON8/cs220-lecture-material
  • ZHANG2752/cs220-lecture-material
  • VSANTAMARIA/cs220-lecture-material
  • VILBRANDT/cs220-lecture-material
  • ELADD2/cs220-lecture-material
  • YLIU2328/cs220-lecture-material
  • LMEASNER/cs220-lecture-material
  • ATANG28/cs220-lecture-material
  • AKSCHELLIN/cs220-lecture-material
  • OMBUSH/cs220-lecture-material
  • MJDAVID/cs220-lecture-material
  • AKHATRY/cs220-lecture-material
  • CZHUANG6/cs220-lecture-material
  • JPDEYOUNG/cs220-lecture-material
  • SDREES/cs220-lecture-material
  • CLCAMPBELL3/cs220-lecture-material
  • CJCAMPOS/cs220-lecture-material
  • AMARAN/cs220-lecture-material
  • rmflynn2/cs220-lecture-material
  • zhang2855/cs220-lecture-material
  • imanzoor/cs220-lecture-material
  • TOUSEEF/cs220-lecture-material
  • qchen445/cs220-lecture-material
  • nareed2/cs220-lecture-material
  • younkman/cs220-lecture-material
  • kli382/cs220-lecture-material
  • bsaulnier/cs220-lecture-material
  • isatrom/cs220-lecture-material
  • kgoodrum/cs220-lecture-material
  • mransom2/cs220-lecture-material
  • ahstevens/cs220-lecture-material
  • JRADUECHEL/cs220-lecture-material
  • mpcyr/cs220-lecture-material
  • wmeyrose/cs220-lecture-material
  • mmaltman/cs220-lecture-material
  • lsonntag/cs220-lecture-material
  • ghgallant/cs220-lecture-material
  • agkaiser2/cs220-lecture-material
  • rlgerhardt/cs220-lecture-material
  • chen2552/cs220-lecture-material
  • mickiewicz/cs220-lecture-material
  • cbarnish/cs220-lecture-material
  • alampson/cs220-lecture-material
  • mjwendt4/cs220-lecture-material
  • somsakhein/cs220-lecture-material
  • heppenibanez/cs220-lecture-material
  • szhang926/cs220-lecture-material
  • wewatson/cs220-lecture-material
  • jho34/cs220-lecture-material
  • lmedin/cs220-lecture-material
  • hjiang373/cs220-lecture-material
  • hfry2/cs220-lecture-material
  • ajroberts7/cs220-lecture-material
  • mcerhardt/cs220-lecture-material
  • njtomaszewsk/cs220-lecture-material
  • rwang728/cs220-lecture-material
  • jhansonflore/cs220-lecture-material
  • msajja/cs220-lecture-material
  • bjornson2/cs220-lecture-material
  • ccmclaren/cs220-lecture-material
  • armstrongbag/cs220-lecture-material
  • eloe2/cs220-lecture-material
92 results
Show changes
Showing
with 5427 additions and 526 deletions
Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:code id: tags:
``` python
# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
from IPython.core.display import display, HTML
display(HTML('<style>em { color: red; }</style> <style>.container { width:100% !important; }</style>'))
```
%% Cell type:code id: tags:
``` python
%matplotlib inline
```
%% Cell type:code id: tags:
``` python
import pandas as pd
from pandas import DataFrame, Series
import sqlite3
import os
import matplotlib
# new import statement
from matplotlib import pyplot as plt
import requests
matplotlib.rcParams["font.size"] = 12
```
%% Cell type:markdown id: tags:
#### Wrapping up bus dataset example
%% Cell type:markdown id: tags:
#### What are the top routes, and how many people ride them daily?
%% Cell type:code id: tags:
``` python
path = "bus.db"
# assert existence of path
assert os.path.exists(path)
# establish connection to bus.db
conn = sqlite3.connect(path)
```
%% Cell type:code id: tags:
``` python
df = pd.read_sql("""
SELECT Route, SUM(DailyBoardings) AS daily
FROM boarding
GROUP BY Route
ORDER BY daily DESC
""", conn)
df
```
%% Cell type:code id: tags:
``` python
# let's extract daily column from df
df["daily"]
```
%% Cell type:code id: tags:
``` python
# let's create a bar plot from daily column Series
df["daily"].plot.bar()
# Oops wrong x-axis labels!
```
%% Cell type:code id: tags:
``` python
df
```
%% Cell type:code id: tags:
``` python
df = ???
# let's plot for top 5 routes alone
???
```
%% Cell type:code id: tags:
``` python
# let's use slicing to aggregate the rest of the data
```
%% Cell type:code id: tags:
``` python
# let's plot the bars
ax = (s / 1000).plot.bar(color = "k")
ax.set_ylabel("Rides / Day (Thousands)")
None
```
%% Cell type:code id: tags:
``` python
conn.close()
```
%% Cell type:markdown id: tags:
### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
- This set of data is used in beginning Machine Learning Courses
- You can train a ML algorithm to use the values to predict the class of iris
- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
%% Cell type:markdown id: tags:
#### Warmup 1: Downloading IRIS dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data)
%% Cell type:code id: tags:
``` python
# use requests to get this URL
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
response = ???
# check that the request was successful
???
# open a file called "iris.csv" for writing the data locally
file_obj = open("iris.csv", ???)
# write the text of response to the file object
file_obj.write(???)
# close the file object
file_obj.close()
# Look at the file you downloaded. What's wrong with it?
```
%% Cell type:markdown id: tags:
#### Warmup 2: Making a DataFrame
%% Cell type:code id: tags:
``` python
# read the "iris.csv" file into a Pandas dataframe
iris_df = ???
# display the head of the data frame
iris_df.head()
```
%% Cell type:markdown id: tags:
#### Warmup 3: Our CSV file has no header. Let's add column names.
- Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
%% Cell type:code id: tags:
``` python
# Attribute Information:
# 1. sepal length in cm
# 2. sepal width in cm
# 3. petal length in cm
# 4. petal width in cm
# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
# These should be our headers
# ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
iris_df = pd.read_csv("iris.csv",
???)
iris_df.head()
```
%% Cell type:markdown id: tags:
#### Warmup 4: Connect to our database version of this data!
%% Cell type:code id: tags:
``` python
iris_conn = sqlite3.connect("iris-flowers.db")
pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
```
%% Cell type:markdown id: tags:
#### Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
Break any ties by ordering by the shortest sepal width.
%% Cell type:code id: tags:
``` python
pd.read_sql("""
SELECT
FROM
WHERE
ORDER BY
LIMIT 10
""", iris_conn)
```
%% Cell type:markdown id: tags:
# Lecture 36: Scatter Plots
**Learning Objectives**
- Set the marker, color, and size of scatter plot data
- Calculate correlation between DataFrame columns
- Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
## Set the marker, color, and size of scatter plot data
To start, let's look at some made-up data about Trees.
The city of Madison maintains a database of all the trees they care for.
%% Cell type:code id: tags:
``` python
trees = [
{"age": 1, "height": 1.5, "diameter": 0.8},
{"age": 1, "height": 1.9, "diameter": 1.2},
{"age": 1, "height": 1.8, "diameter": 1.4},
{"age": 2, "height": 1.8, "diameter": 0.9},
{"age": 2, "height": 2.5, "diameter": 1.5},
{"age": 2, "height": 3, "diameter": 1.8},
{"age": 2, "height": 2.9, "diameter": 1.7},
{"age": 3, "height": 3.2, "diameter": 2.1},
{"age": 3, "height": 3, "diameter": 2},
{"age": 3, "height": 2.4, "diameter": 2.2},
{"age": 2, "height": 3.1, "diameter": 2.9},
{"age": 4, "height": 2.5, "diameter": 3.1},
{"age": 4, "height": 3.9, "diameter": 3.1},
{"age": 4, "height": 4.9, "diameter": 2.8},
{"age": 4, "height": 5.2, "diameter": 3.5},
{"age": 4, "height": 4.8, "diameter": 4},
]
trees_df = DataFrame(trees)
trees_df.head()
```
%% Cell type:markdown id: tags:
### Scatter Plots
We can make a scatter plot of a DataFrame using the following function...
`df_name.plot.scatter(x = "x_col_name", y = "y_col_name", \
color = "red", marker = "*", s = 50)`
%% Cell type:markdown id: tags:
Plot the trees data comparing a tree's age to its height...
- What is `df_name`?
- What is `x_col_name`?
- What is `y_col_name`?
%% Cell type:code id: tags:
``` python
# TODO: change y to diameter
```
%% Cell type:markdown id: tags:
Now plot with a little more beautification...
- Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
- Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
- Change the size (any int)
%% Cell type:code id: tags:
``` python
# Plot with some more beautification options.
trees_df.plot.scatter(x = "age", y = "height", color = "r", marker = "D", s = 50)
# D for diamond
```
%% Cell type:code id: tags:
``` python
# Add a title to your plot.
ax = trees_df.plot.scatter(x = "age", y = "height", color = "r", marker = "D", s = 50)
# D for diamond
ax.set_title("Tree Age vs Height")
```
%% Cell type:markdown id: tags:
#### Correlation
%% Cell type:code id: tags:
``` python
# What is the correlation between our DataFrame columns?
corr_df = trees_df.corr()
corr_df
```
%% Cell type:code id: tags:
``` python
# What is the correlation between age and height (don't use .iloc)
# Using index in this case isn't considered as hardcoding
corr_df['age']['height']
```
%% Cell type:markdown id: tags:
### Variating Stylistic Parameters
%% Cell type:code id: tags:
``` python
# Option 1:
trees_df.plot.scatter(x = "age", y = "height", marker = "H", s = "diameter")
```
%% Cell type:code id: tags:
``` python
# Option 2:
# this way allows you to make it bigger
trees_df.plot.scatter(x = "age", y = "height", marker = "H", s = trees_df["diameter"] * 50)
```
%% Cell type:markdown id: tags:
## Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
### Re-visit the Iris Data
%% Cell type:code id: tags:
``` python
iris_df
```
%% Cell type:markdown id: tags:
### How do we create a *scatter plot* for various *class types*?
First, gather all the class types.
%% Cell type:code id: tags:
``` python
# In Pandas
varietes = list(set(iris_df["class"]))
varietes
```
%% Cell type:code id: tags:
``` python
# In SQL
varietes = list(pd.read_sql("""
SELECT DISTINCT class
FROM iris
""", iris_conn)["class"])
varietes
```
%% Cell type:markdown id: tags:
In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
%% Cell type:code id: tags:
``` python
# If you want to continue using SQL instead, don't close the connection!
iris_conn.close()
```
%% Cell type:code id: tags:
``` python
# Change this scatter plot so that the data is only for class ='Iris-setosa'
```
%% Cell type:code id: tags:
``` python
# Write a for loop that iterates through each variety in classes
# and makes a plot for only that class
# For each class add a color and a marker style
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
for i in range(len(varietes)):
???
```
%% Cell type:markdown id: tags:
Did you notice that it made 3 plots?!?! What's decieving about this?
%% Cell type:markdown id: tags:
### We can make Subplots in plots, called an AxesSubplot, keyword ax
1. if AxesSuplot ax passed, then plot in that subplot
2. if ax is None, create a new AxesSubplot
3. return AxesSubplot that was used
%% Cell type:code id: tags:
``` python
# complete this code to make 3 plots in one
plot_area = None # don't change this...look at this variable in line 12
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
for i in range(len(varietes)):
???
```
%% Cell type:markdown id: tags:
### Let's focus on "Iris-virginica" data
%% Cell type:code id: tags:
``` python
iris_virginica = ???
assert(len(iris_virginica) == 50)
iris_virginica.head()
```
%% Cell type:code id: tags:
``` python
iris_virginica.plot.scatter(x = "pet-width", y = "pet-length")
```
%% Cell type:markdown id: tags:
### Let's learn about *xlim* and *ylim*
- Allows us to set x-axis and y-axis limits
- Takes either a single value (LOWER-BOUND) or a tuple containing two values (LOWER-BOUND, UPPER-BOUND)
- You need to be careful about setting the UPPER-BOUND
%% Cell type:code id: tags:
``` python
iris_virginica.plot.scatter(x = "pet-width", y = "pet-length", xlim = ???, ylim = ???)
```
%% Cell type:code id: tags:
``` python
ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
xlim = (0, 6), ylim = (0, 6),
figsize = (3, 3))
# What is wrong with this plot?
```
%% Cell type:markdown id: tags:
What is the maximum pet-len?
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
ax.get_ylim()
```
%% Cell type:markdown id: tags:
Let's include assert statements to make sure we don't crop the plot!
%% Cell type:code id: tags:
``` python
ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
xlim = (0, 6), ylim = (0, 6),
figsize = (3, 3))
assert iris_virginica["pet-length"].max() <= ax.get_ylim()[1]
```
%% Cell type:markdown id: tags:
### Now let's try all 4 assert statements
```
assert iris_virginica[ax.get_xlabel()].min() >= ax.get_xlim()[0]
assert iris_virginica[ax.get_xlabel()].max() <= ax.get_xlim()[1]
assert iris_virginica[ax.get_ylabel()].min() >= ax.get_ylim()[0]
assert iris_virginica[ax.get_ylabel()].max() <= ax.get_ylim()[1]
```
%% Cell type:code id: tags:
``` python
ax = iris_virginica.plot.scatter(x = "pet-width", y = "pet-length",
xlim = (0, 7), ylim = (0, 7),
figsize = (3, 3))
assert iris_virginica[ax.get_xlabel()].min() >= ax.get_xlim()[0]
assert iris_virginica[ax.get_xlabel()].max() <= ax.get_xlim()[1]
assert iris_virginica[ax.get_ylabel()].min() >= ax.get_ylim()[0]
assert iris_virginica[ax.get_ylabel()].max() <= ax.get_ylim()[1]
```
%% Cell type:markdown id: tags:
### Time-Permitting
Plot this data in an interesting/meaningful way & identify any correlations.
%% Cell type:code id: tags:
``` python
students = pd.DataFrame({
"name": [
"Cole",
"Cynthia",
"Alice",
"Seth"
],
"grade": [
"C",
"AB",
"B",
"BC"
],
"gpa": [
2.0,
3.5,
3.0,
2.5
],
"attendance": [
4,
11,
10,
6
],
"height": [
68,
66,
60,
72
]
})
students
```
%% Cell type:code id: tags:
``` python
# Min, Max, and Overall Difference in Student Height
min_height = students["height"].min()
max_height = students["height"].max()
diff_height = max_height - min_height
# Normalize students heights on a scale of [0, 1] (black to white)
height_colors = (students["height"] - min_height) / diff_height
# Normalize students heights on a scale of [0, 0.5] (black to gray)
height_colors = height_colors / 2
# Color must be a string (e.g. c='0.34')
height_colors = height_colors.astype("string")
height_colors
```
%% Cell type:code id: tags:
``` python
students.plot.scatter(x="attendance", y="gpa", c=height_colors)
```
%% Cell type:code id: tags:
``` python
students.corr()
```
%% Cell type:code id: tags:
``` python
# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
from IPython.core.display import display, HTML
display(HTML('<style>em { color: red; }</style> <style>.container { width:100% !important; }</style>'))
```
%% Output
%% Cell type:code id: tags:
``` python
import pandas as pd
from pandas import DataFrame, Series
import sqlite3
import os
import matplotlib
from matplotlib import pyplot as plt
import requests
matplotlib.rcParams["font.size"] = 12
```
%% Cell type:markdown id: tags:
### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
- This set of data is used in beginning Machine Learning Courses
- You can train a ML algorithm to use the values to predict the class of iris
- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
%% Cell type:code id: tags:
``` python
# Warmup 1: Requests and file writing
# use requests to get this file "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
response = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
# check that the request was successful
response.raise_for_status()
# open a file called "iris.csv" for writing the data locally to avoid spamming their server
file_obj = open("iris.csv", "w")
# write the text of response to the file object
file_obj.write(response.text)
# close the file object
file_obj.close()
# Look at the file you downloaded. What's wrong with it?
```
%% Cell type:code id: tags:
``` python
# Warmup 2: Making a DataFrame
# read the "iris.csv" file into a Pandas dataframe
# display the head of the data frame
```
%% Cell type:code id: tags:
``` python
# Warmup 3: Our CSV file has no header....let's add column names.
# Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
# Attribute Information:
# 1. sepal length in cm
# 2. sepal width in cm
# 3. petal length in cm
# 4. petal width in cm
# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
# These should be our headers ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
```
%% Cell type:code id: tags:
``` python
# Warmup 4: Connect to our database version of this data
iris_conn = sqlite3.connect("iris-flowers.db")
# find out the name of the table
pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
```
%% Cell type:code id: tags:
``` python
# Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
# Break any ties by ordering by the shortest sepal width.
pd.read_sql("""
""", iris_conn)
```
%% Cell type:markdown id: tags:
# Lecture 36: Scatter Plots
**Learning Objectives**
- Set the marker, color, and size of scatter plot data
- Calculate correlation between DataFrame columns
- Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
## Set the marker, color, and size of scatter plot data
To start, let's look at some made-up data about Trees.
The city of Madison maintains a database of all the trees they care for.
%% Cell type:code id: tags:
``` python
trees = [
{"age": 1, "height": 1.5, "diameter": 0.8},
{"age": 1, "height": 1.9, "diameter": 1.2},
{"age": 1, "height": 1.8, "diameter": 1.4},
{"age": 2, "height": 1.8, "diameter": 0.9},
{"age": 2, "height": 2.5, "diameter": 1.5},
{"age": 2, "height": 3, "diameter": 1.8},
{"age": 2, "height": 2.9, "diameter": 1.7},
{"age": 3, "height": 3.2, "diameter": 2.1},
{"age": 3, "height": 3, "diameter": 2},
{"age": 3, "height": 2.4, "diameter": 2.2},
{"age": 2, "height": 3.1, "diameter": 2.9},
{"age": 4, "height": 2.5, "diameter": 3.1},
{"age": 4, "height": 3.9, "diameter": 3.1},
{"age": 4, "height": 4.9, "diameter": 2.8},
{"age": 4, "height": 5.2, "diameter": 3.5},
{"age": 4, "height": 4.8, "diameter": 4},
]
trees_df = DataFrame(trees)
trees_df.head()
```
%% Cell type:markdown id: tags:
### Scatter Plots
We can make a scatter plot of a DataFrame using the following function...
`df_name.plot.scatter(x="x_col_name", y="y_col_name", color="peachpuff")`
Plot the trees data comparing a tree's age to its height...
- What is `df_name`?
- What is `x_col_name`?
- What is `y_col_name`?
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
Now plot with a little more beautification...
- Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
- Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
- Change the size (any int)
%% Cell type:code id: tags:
``` python
# Plot with some more beautification options.
```
%% Cell type:code id: tags:
``` python
# Add a title to your plot.
```
%% Cell type:markdown id: tags:
#### Correlation
%% Cell type:code id: tags:
``` python
# What is the correlation between our DataFrame columns?
```
%% Cell type:code id: tags:
``` python
# What is the correlation between age and height (don't use .iloc)
```
%% Cell type:markdown id: tags:
### The Size can be based on a DataFrame value
%% Cell type:code id: tags:
``` python
# Option 1:
trees_df.plot.scatter(x="age", y="height", marker="H", s="diameter")
```
%% Cell type:code id: tags:
``` python
# Option 2:
trees_df.plot.scatter(x="age", y="height", marker = "H", s=trees_df["diameter"] * 50) # this way allows you to make it bigger
```
%% Cell type:markdown id: tags:
## Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
### Re-visit the Iris Data
%% Cell type:code id: tags:
``` python
iris_df
```
%% Cell type:markdown id: tags:
### How do we create a *scatter plot* for various *class types*?
First, gather all the class types.
%% Cell type:code id: tags:
``` python
# In Pandas
varieties = ???
varieties
```
%% Cell type:code id: tags:
``` python
# In SQL
varietes = pd.read_sql("""
""", iris_conn)
varietes
```
%% Cell type:markdown id: tags:
In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
%% Cell type:code id: tags:
``` python
# If you want to continue using SQL instead, don't close the connection!
iris_conn.close()
```
%% Cell type:code id: tags:
``` python
# Change this scatter plot so that the data is only for class ='Iris-setosa'
iris_df.plot.scatter(x = "pet-width", y = "pet-length")
```
%% Cell type:code id: tags:
``` python
# Write a for loop that iterates through each variety in classes
# and makes a plot for only that class
for i in range(len(varietes)):
variety = varietes[i]
pass
```
%% Cell type:code id: tags:
``` python
# copy/paste the code above, but this time make each plot a different color
colors = ["blue", "green", "red"]
```
%% Cell type:code id: tags:
``` python
# copy/paste the code above, but this time make each plot a different color AND marker
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
```
%% Cell type:code id: tags:
``` python
# Did you notice that it made 3 plots?!?! What's deceiving about this?
```
%% Cell type:code id: tags:
``` python
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
```
%% Cell type:code id: tags:
``` python
# Have to be VERY careful to not crop out data.
# We'll talk about this next lecture.
```
%% Cell type:code id: tags:
``` python
# Better yet, we could combine these.
```
%% Cell type:markdown id: tags:
### We can make Subplots in plots, called an AxesSubplot, keyword ax
1. if AxesSuplot ax passed, then plot in that subplot
2. if ax is None, create a new AxesSubplot
3. return AxesSubplot that was used
%% Cell type:code id: tags:
``` python
# complete this code to make 3 plots in one
plot_area = None # don't change this...look at this variable in line 12
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
```
%% Cell type:markdown id: tags:
### Time-Permitting
Plot this data in an interesting/meaningful way & identify any correlations.
%% Cell type:code id: tags:
``` python
students = pd.DataFrame({
"name": [
"Cole",
"Cynthia",
"Alice",
"Seth"
],
"grade": [
"C",
"AB",
"B",
"BC"
],
"gpa": [
2.0,
3.5,
3.0,
2.5
],
"attendance": [
4,
11,
10,
6
],
"height": [
68,
66,
60,
72
]
})
students
```
%% Cell type:code id: tags:
``` python
# Min, Max, and Overall Difference in Student Height
min_height = students["height"].min()
max_height = students["height"].max()
diff_height = max_height - min_height
# Normalize students heights on a scale of [0, 1] (black to white)
height_colors = (students["height"] - min_height) / diff_height
# Normalize students heights on a scale of [0, 0.5] (black to gray)
height_colors = height_colors / 2
# Color must be a string (e.g. c='0.34')
height_colors = height_colors.astype("string")
height_colors
```
%% Cell type:code id: tags:
``` python
# Plot!
```
%% Cell type:code id: tags:
``` python
# What are the correlations?
```
%% Cell type:markdown id: tags:
![image.png](attachment:image.png)
%% Cell type:markdown id: tags:
https://www.researchgate.net/publication/247907373_Stupid_Data_Miner_Tricks_Overfitting_the_SP_500
%% Cell type:code id: tags:
``` python
# known import statements
import pandas as pd
import sqlite3 as sql # note that we are renaming to sql
import os
# new import statement
import numpy as np
```
%% Cell type:markdown id: tags:
# Lecture 35 Pandas 3: Data Transformation
* Data transformation is the process of changing the format, structure, or values of data.
* Often needed during data cleaning and sometimes during data analysis
%% Cell type:markdown id: tags:
# Today's Learning Objectives:
* Setting column as index for pandas `DataFrame`
* Identify, drop, or fill missing values (`np.NaN`) using Pandas `isna`, `dropna`, and `fillna`
* Applying transformations to `DataFrame`:
* Use `apply` on pandas `Series` to apply a transformation function
* Use `replace` to replace all target values in Pandas `Series` and `DataFrame` rows / columns
* Filter, aggregate, group, and summarize information in a `DataFrame` with `groupby`
* Convert .groupby examples to SQL
* Solving the same question using SQL and pandas `DataFrame` manipulations:
* filtering, grouping, and aggregation / summarization
%% Cell type:markdown id: tags:
# The dataset: Spotify songs
Adapted from https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify.
If you are interested in digging deeper in this dataset, here's a [blog post](https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a) that explain each column in details.
%% Cell type:markdown id: tags:
### WARMUP 1: Establish a connection to the spotify.db database
%% Cell type:code id: tags:
``` python
# open up the spotify database
db_pathname = "spotify.db"
assert os.path.exists(db_pathname)
conn = sql.connect(db_pathname)
```
%% Cell type:code id: tags:
``` python
def qry(sql):
return pd.read_sql(sql, conn)
```
%% Cell type:markdown id: tags:
### WARMUP 2: Identify the table name(s) inside the database
%% Cell type:code id: tags:
``` python
df = qry("SELECT * from sqlite_master")
df
```
%% Output
type name tbl_name rootpage \
0 table spotify spotify 1527
1 index sqlite_autoindex_spotify_1 spotify 1528
sql
0 CREATE TABLE spotify(\nid TEXT PRIMARY KEY,\nt...
1 None
%% Cell type:markdown id: tags:
### WARMUP 3: Use pandas lookup expression to extract the "sql" column and display the full query using .iloc lookup
%% Cell type:code id: tags:
``` python
print(df["sql"].iloc[0])
```
%% Output
CREATE TABLE spotify(
id TEXT PRIMARY KEY,
title BLOB,
song_name BLOB,
genre TEXT,
duration_ms INTEGER,
key INTEGER,
mode INTEGER,
time_signature INTEGER,
tempo REAL,
acousticness REAL,
danceability REAL,
energy REAL,
instrumentalness REAL,
liveness REAL,
loudness REAL,
speechiness REAL,
valence REAL)
%% Cell type:markdown id: tags:
### WARMUP 4: Store the data inside `spotify` table inside a variable called `df`
%% Cell type:code id: tags:
``` python
df = qry("SELECT * FROM spotify")
df
```
%% Output
id title song_name \
0 7pgJBLVz5VmnL7uGHmRj6p Pathology
1 0vSWgAlfpye0WCGeNmuNhy Symbiote
2 7EL7ifncK2PWFYThJjzR25 BRAINFOOD
3 1umsRbM7L4ju7rn9aU8Ju6 Sacrifice
4 4SKqOHKYU5pgHr5UiVKiQN Backpack
... ... ... ...
35872 46bXU7Sgj7104ZoXxzz9tM Euphoric Hardstyle
35873 0he2ViGMUO3ajKTxLOfWVT Greatest Hardstyle Playlist
35874 72DAt9Lbpy9EUS29OzQLob Best of Hardstyle 2020
35875 6HXgExFVuE1c3cq9QjFCcU Euphoric Hardstyle
35876 6MAAMZImxcvYhRnxDLTufD Best of Hardstyle 2020
genre duration_ms key mode time_signature tempo \
0 Dark Trap 224427 8 1 4 115.080
1 Dark Trap 98821 5 1 4 218.050
2 Dark Trap 101172 8 1 4 189.938
3 Dark Trap 96062 10 0 4 139.990
4 Dark Trap 135079 5 1 4 128.014
... ... ... ... ... ... ...
35872 hardstyle 269208 4 1 4 150.013
35873 hardstyle 210112 0 0 4 149.928
35874 hardstyle 234823 8 1 4 154.935
35875 hardstyle 323200 6 0 4 150.042
35876 hardstyle 162161 9 1 4 155.047
acousticness danceability energy instrumentalness liveness \
0 0.401000 0.719 0.493 0.000000 0.1180
1 0.013800 0.850 0.893 0.000004 0.3720
2 0.187000 0.864 0.365 0.000000 0.1160
3 0.145000 0.767 0.576 0.000003 0.0968
4 0.007700 0.765 0.726 0.000000 0.6190
... ... ... ... ... ...
35872 0.031500 0.528 0.693 0.000345 0.1210
35873 0.022500 0.517 0.768 0.000018 0.2050
35874 0.026000 0.361 0.821 0.000242 0.3850
35875 0.000551 0.477 0.921 0.029600 0.0575
35876 0.001890 0.529 0.945 0.000055 0.4140
loudness speechiness valence
0 -7.230 0.0794 0.1240
1 -4.783 0.0623 0.0391
2 -10.219 0.0655 0.0478
3 -9.683 0.2560 0.1870
4 -5.580 0.1910 0.2700
... ... ... ...
35872 -5.148 0.0304 0.3940
35873 -7.922 0.0479 0.3830
35874 -3.102 0.0505 0.1240
35875 -4.777 0.0392 0.4880
35876 -5.862 0.0615 0.1340
[35877 rows x 17 columns]
%% Cell type:markdown id: tags:
### Setting a column as row indices for the `DataFrame`
- Syntax: `df.set_index("<COLUMN>")`
- Returns a new DataFrame object instance reference.
- WARNING: executing this twice will result in `KeyError` being thrown. Once you set a column as row index, it will no longer be a column within the `DataFrame`. If you tried this, go back and execute the above cell and update `df` once more and then execute the below cell exactly once.
%% Cell type:code id: tags:
``` python
# Set the id column as row indices
df = df.set_index("id")
df
```
%% Output
title song_name genre \
id
7pgJBLVz5VmnL7uGHmRj6p Pathology Dark Trap
0vSWgAlfpye0WCGeNmuNhy Symbiote Dark Trap
7EL7ifncK2PWFYThJjzR25 BRAINFOOD Dark Trap
1umsRbM7L4ju7rn9aU8Ju6 Sacrifice Dark Trap
4SKqOHKYU5pgHr5UiVKiQN Backpack Dark Trap
... ... ... ...
46bXU7Sgj7104ZoXxzz9tM Euphoric Hardstyle hardstyle
0he2ViGMUO3ajKTxLOfWVT Greatest Hardstyle Playlist hardstyle
72DAt9Lbpy9EUS29OzQLob Best of Hardstyle 2020 hardstyle
6HXgExFVuE1c3cq9QjFCcU Euphoric Hardstyle hardstyle
6MAAMZImxcvYhRnxDLTufD Best of Hardstyle 2020 hardstyle
duration_ms key mode time_signature tempo \
id
7pgJBLVz5VmnL7uGHmRj6p 224427 8 1 4 115.080
0vSWgAlfpye0WCGeNmuNhy 98821 5 1 4 218.050
7EL7ifncK2PWFYThJjzR25 101172 8 1 4 189.938
1umsRbM7L4ju7rn9aU8Ju6 96062 10 0 4 139.990
4SKqOHKYU5pgHr5UiVKiQN 135079 5 1 4 128.014
... ... ... ... ... ...
46bXU7Sgj7104ZoXxzz9tM 269208 4 1 4 150.013
0he2ViGMUO3ajKTxLOfWVT 210112 0 0 4 149.928
72DAt9Lbpy9EUS29OzQLob 234823 8 1 4 154.935
6HXgExFVuE1c3cq9QjFCcU 323200 6 0 4 150.042
6MAAMZImxcvYhRnxDLTufD 162161 9 1 4 155.047
acousticness danceability energy instrumentalness \
id
7pgJBLVz5VmnL7uGHmRj6p 0.401000 0.719 0.493 0.000000
0vSWgAlfpye0WCGeNmuNhy 0.013800 0.850 0.893 0.000004
7EL7ifncK2PWFYThJjzR25 0.187000 0.864 0.365 0.000000
1umsRbM7L4ju7rn9aU8Ju6 0.145000 0.767 0.576 0.000003
4SKqOHKYU5pgHr5UiVKiQN 0.007700 0.765 0.726 0.000000
... ... ... ... ...
46bXU7Sgj7104ZoXxzz9tM 0.031500 0.528 0.693 0.000345
0he2ViGMUO3ajKTxLOfWVT 0.022500 0.517 0.768 0.000018
72DAt9Lbpy9EUS29OzQLob 0.026000 0.361 0.821 0.000242
6HXgExFVuE1c3cq9QjFCcU 0.000551 0.477 0.921 0.029600
6MAAMZImxcvYhRnxDLTufD 0.001890 0.529 0.945 0.000055
liveness loudness speechiness valence
id
7pgJBLVz5VmnL7uGHmRj6p 0.1180 -7.230 0.0794 0.1240
0vSWgAlfpye0WCGeNmuNhy 0.3720 -4.783 0.0623 0.0391
7EL7ifncK2PWFYThJjzR25 0.1160 -10.219 0.0655 0.0478
1umsRbM7L4ju7rn9aU8Ju6 0.0968 -9.683 0.2560 0.1870
4SKqOHKYU5pgHr5UiVKiQN 0.6190 -5.580 0.1910 0.2700
... ... ... ... ...
46bXU7Sgj7104ZoXxzz9tM 0.1210 -5.148 0.0304 0.3940
0he2ViGMUO3ajKTxLOfWVT 0.2050 -7.922 0.0479 0.3830
72DAt9Lbpy9EUS29OzQLob 0.3850 -3.102 0.0505 0.1240
6HXgExFVuE1c3cq9QjFCcU 0.0575 -4.777 0.0392 0.4880
6MAAMZImxcvYhRnxDLTufD 0.4140 -5.862 0.0615 0.1340
[35877 rows x 16 columns]
%% Cell type:markdown id: tags:
### Not a Number
- `np.NaN` is the floating point representation of Not a Number
- You do not need to know / learn the details about the `numpy` package
### Replacing / modifying values within the `DataFrame`
Syntax: `df.replace(<TARGET>, <REPLACE>)`
- Your target can be `str`, `int`, `float`, `None` (there are other possiblities, but those are too advanced for this course)
- Returns a new DataFrame object instance reference.
Let's now replace the missing values (empty strings) with `np.NAN`
%% Cell type:code id: tags:
``` python
df = df.replace("", np.NaN)
df.head(10) # title is the album name
```
%% Output
title song_name genre duration_ms \
id
7pgJBLVz5VmnL7uGHmRj6p NaN Pathology Dark Trap 224427
0vSWgAlfpye0WCGeNmuNhy NaN Symbiote Dark Trap 98821
7EL7ifncK2PWFYThJjzR25 NaN BRAINFOOD Dark Trap 101172
1umsRbM7L4ju7rn9aU8Ju6 NaN Sacrifice Dark Trap 96062
4SKqOHKYU5pgHr5UiVKiQN NaN Backpack Dark Trap 135079
3uE1swbcRp5BrO64UNy6Ex NaN TakingOutTheTrash Dark Trap 192833
3KJrwOuqiEwHq6QTreZT61 NaN Io sono qui Dark Trap 180880
4QhUXx4ON40TIBrZIlnIke NaN Murder Dark Trap 186261
09320vyX4qHd4GjHIpy5w0 NaN High 'N Mighty Dark Trap 124676
6xEnbXM1us9fDJy2LC0lru NaN Bang Ya Fucking Head Dark Trap 154929
key mode time_signature tempo acousticness \
id
7pgJBLVz5VmnL7uGHmRj6p 8 1 4 115.080 0.4010
0vSWgAlfpye0WCGeNmuNhy 5 1 4 218.050 0.0138
7EL7ifncK2PWFYThJjzR25 8 1 4 189.938 0.1870
1umsRbM7L4ju7rn9aU8Ju6 10 0 4 139.990 0.1450
4SKqOHKYU5pgHr5UiVKiQN 5 1 4 128.014 0.0077
3uE1swbcRp5BrO64UNy6Ex 11 1 4 120.004 0.1720
3KJrwOuqiEwHq6QTreZT61 10 0 4 128.066 0.0987
4QhUXx4ON40TIBrZIlnIke 0 1 4 114.956 0.0343
09320vyX4qHd4GjHIpy5w0 7 1 5 111.958 0.1120
6xEnbXM1us9fDJy2LC0lru 1 1 4 125.013 0.0525
danceability energy instrumentalness liveness \
id
7pgJBLVz5VmnL7uGHmRj6p 0.719 0.493 0.000000 0.1180
0vSWgAlfpye0WCGeNmuNhy 0.850 0.893 0.000004 0.3720
7EL7ifncK2PWFYThJjzR25 0.864 0.365 0.000000 0.1160
1umsRbM7L4ju7rn9aU8Ju6 0.767 0.576 0.000003 0.0968
4SKqOHKYU5pgHr5UiVKiQN 0.765 0.726 0.000000 0.6190
3uE1swbcRp5BrO64UNy6Ex 0.814 0.575 0.000291 0.1090
3KJrwOuqiEwHq6QTreZT61 0.812 0.813 0.000150 0.0758
4QhUXx4ON40TIBrZIlnIke 0.602 0.578 0.000000 0.1640
09320vyX4qHd4GjHIpy5w0 0.876 0.768 0.000012 0.2830
6xEnbXM1us9fDJy2LC0lru 0.690 0.760 0.000000 0.1340
loudness speechiness valence
id
7pgJBLVz5VmnL7uGHmRj6p -7.230 0.0794 0.1240
0vSWgAlfpye0WCGeNmuNhy -4.783 0.0623 0.0391
7EL7ifncK2PWFYThJjzR25 -10.219 0.0655 0.0478
1umsRbM7L4ju7rn9aU8Ju6 -9.683 0.2560 0.1870
4SKqOHKYU5pgHr5UiVKiQN -5.580 0.1910 0.2700
3uE1swbcRp5BrO64UNy6Ex -9.635 0.0635 0.2880
3KJrwOuqiEwHq6QTreZT61 -5.583 0.0984 0.3480
4QhUXx4ON40TIBrZIlnIke -5.610 0.0283 0.1560
09320vyX4qHd4GjHIpy5w0 -6.606 0.2010 0.7200
6xEnbXM1us9fDJy2LC0lru -5.431 0.0895 0.0797
%% Cell type:markdown id: tags:
### Checking for missing values
Syntax: `Series.isna()`
- Returns a boolean Series
Let's check if any of the "song_name"(s) are missing
%% Cell type:code id: tags:
``` python
df["song_name"].isna()
```
%% Output
id
7pgJBLVz5VmnL7uGHmRj6p False
0vSWgAlfpye0WCGeNmuNhy False
7EL7ifncK2PWFYThJjzR25 False
1umsRbM7L4ju7rn9aU8Ju6 False
4SKqOHKYU5pgHr5UiVKiQN False
...
46bXU7Sgj7104ZoXxzz9tM True
0he2ViGMUO3ajKTxLOfWVT True
72DAt9Lbpy9EUS29OzQLob True
6HXgExFVuE1c3cq9QjFCcU True
6MAAMZImxcvYhRnxDLTufD True
Name: song_name, Length: 35877, dtype: bool
%% Cell type:markdown id: tags:
### Review: `Pandas.Series.value_counts()`
- Returns a new `Series` with unique values from the original `Series` as keys and the count of those unique values as values.
- Return value `Series` is ordered using descending order of counts
%% Cell type:code id: tags:
``` python
# count the number of missing values for song name
df["song_name"].isna().value_counts()
```
%% Output
False 18342
True 17535
Name: song_name, dtype: int64
%% Cell type:markdown id: tags:
### Missing value manipulation
Syntax: `df.fillna(<REPLACE>)`
- Returns a new DataFrame object instance reference.
%% Cell type:code id: tags:
``` python
# use .fillna to replace missing values
df["song_name"].fillna("No Song Name")
# to replace the original DataFrame's column, you need to explicitly update that object instance
df["song_name"] = df["song_name"].fillna("No Song Name")
df
```
%% Output
title song_name genre \
id
7pgJBLVz5VmnL7uGHmRj6p NaN Pathology Dark Trap
0vSWgAlfpye0WCGeNmuNhy NaN Symbiote Dark Trap
7EL7ifncK2PWFYThJjzR25 NaN BRAINFOOD Dark Trap
1umsRbM7L4ju7rn9aU8Ju6 NaN Sacrifice Dark Trap
4SKqOHKYU5pgHr5UiVKiQN NaN Backpack Dark Trap
... ... ... ...
46bXU7Sgj7104ZoXxzz9tM Euphoric Hardstyle No Song Name hardstyle
0he2ViGMUO3ajKTxLOfWVT Greatest Hardstyle Playlist No Song Name hardstyle
72DAt9Lbpy9EUS29OzQLob Best of Hardstyle 2020 No Song Name hardstyle
6HXgExFVuE1c3cq9QjFCcU Euphoric Hardstyle No Song Name hardstyle
6MAAMZImxcvYhRnxDLTufD Best of Hardstyle 2020 No Song Name hardstyle
duration_ms key mode time_signature tempo \
id
7pgJBLVz5VmnL7uGHmRj6p 224427 8 1 4 115.080
0vSWgAlfpye0WCGeNmuNhy 98821 5 1 4 218.050
7EL7ifncK2PWFYThJjzR25 101172 8 1 4 189.938
1umsRbM7L4ju7rn9aU8Ju6 96062 10 0 4 139.990
4SKqOHKYU5pgHr5UiVKiQN 135079 5 1 4 128.014
... ... ... ... ... ...
46bXU7Sgj7104ZoXxzz9tM 269208 4 1 4 150.013
0he2ViGMUO3ajKTxLOfWVT 210112 0 0 4 149.928
72DAt9Lbpy9EUS29OzQLob 234823 8 1 4 154.935
6HXgExFVuE1c3cq9QjFCcU 323200 6 0 4 150.042
6MAAMZImxcvYhRnxDLTufD 162161 9 1 4 155.047
acousticness danceability energy instrumentalness \
id
7pgJBLVz5VmnL7uGHmRj6p 0.401000 0.719 0.493 0.000000
0vSWgAlfpye0WCGeNmuNhy 0.013800 0.850 0.893 0.000004
7EL7ifncK2PWFYThJjzR25 0.187000 0.864 0.365 0.000000
1umsRbM7L4ju7rn9aU8Ju6 0.145000 0.767 0.576 0.000003
4SKqOHKYU5pgHr5UiVKiQN 0.007700 0.765 0.726 0.000000
... ... ... ... ...
46bXU7Sgj7104ZoXxzz9tM 0.031500 0.528 0.693 0.000345
0he2ViGMUO3ajKTxLOfWVT 0.022500 0.517 0.768 0.000018
72DAt9Lbpy9EUS29OzQLob 0.026000 0.361 0.821 0.000242
6HXgExFVuE1c3cq9QjFCcU 0.000551 0.477 0.921 0.029600
6MAAMZImxcvYhRnxDLTufD 0.001890 0.529 0.945 0.000055
liveness loudness speechiness valence
id
7pgJBLVz5VmnL7uGHmRj6p 0.1180 -7.230 0.0794 0.1240
0vSWgAlfpye0WCGeNmuNhy 0.3720 -4.783 0.0623 0.0391
7EL7ifncK2PWFYThJjzR25 0.1160 -10.219 0.0655 0.0478
1umsRbM7L4ju7rn9aU8Ju6 0.0968 -9.683 0.2560 0.1870
4SKqOHKYU5pgHr5UiVKiQN 0.6190 -5.580 0.1910 0.2700
... ... ... ... ...
46bXU7Sgj7104ZoXxzz9tM 0.1210 -5.148 0.0304 0.3940
0he2ViGMUO3ajKTxLOfWVT 0.2050 -7.922 0.0479 0.3830
72DAt9Lbpy9EUS29OzQLob 0.3850 -3.102 0.0505 0.1240
6HXgExFVuE1c3cq9QjFCcU 0.0575 -4.777 0.0392 0.4880
6MAAMZImxcvYhRnxDLTufD 0.4140 -5.862 0.0615 0.1340
[35877 rows x 16 columns]
%% Cell type:markdown id: tags:
### Dropping missing values
Syntax: `df.dropna()`
- Returns a new DataFrame object instance reference.
%% Cell type:code id: tags:
``` python
# .dropna will drop all rows that contain NaN in them
df.dropna()
```
%% Output
title song_name genre \
id
5LzAV6KfjN8VhWCedeygfY Dirtybird Players No Song Name techhouse
3TsCb6ueD678XBJDiRrvhr tech house No Song Name techhouse
6Y0Fy2buEis7bEOlG0QET1 Tech House Bangerz No Song Name techhouse
4EJI2XGViSQp6WscLKgYDD tech house No Song Name techhouse
4x6VzOQTLIrkkCWcDPh5Y0 blanc | Tech House No Song Name techhouse
... ... ... ...
46bXU7Sgj7104ZoXxzz9tM Euphoric Hardstyle No Song Name hardstyle
0he2ViGMUO3ajKTxLOfWVT Greatest Hardstyle Playlist No Song Name hardstyle
72DAt9Lbpy9EUS29OzQLob Best of Hardstyle 2020 No Song Name hardstyle
6HXgExFVuE1c3cq9QjFCcU Euphoric Hardstyle No Song Name hardstyle
6MAAMZImxcvYhRnxDLTufD Best of Hardstyle 2020 No Song Name hardstyle
duration_ms key mode time_signature tempo \
id
5LzAV6KfjN8VhWCedeygfY 197499 7 1 4 127.997
3TsCb6ueD678XBJDiRrvhr 206000 10 1 4 124.994
6Y0Fy2buEis7bEOlG0QET1 199839 4 0 4 124.006
4EJI2XGViSQp6WscLKgYDD 173861 8 1 4 125.031
4x6VzOQTLIrkkCWcDPh5Y0 394960 8 0 4 127.029
... ... ... ... ... ...
46bXU7Sgj7104ZoXxzz9tM 269208 4 1 4 150.013
0he2ViGMUO3ajKTxLOfWVT 210112 0 0 4 149.928
72DAt9Lbpy9EUS29OzQLob 234823 8 1 4 154.935
6HXgExFVuE1c3cq9QjFCcU 323200 6 0 4 150.042
6MAAMZImxcvYhRnxDLTufD 162161 9 1 4 155.047
acousticness danceability energy instrumentalness \
id
5LzAV6KfjN8VhWCedeygfY 0.000957 0.806 0.950 0.920000
3TsCb6ueD678XBJDiRrvhr 0.062300 0.729 0.978 0.908000
6Y0Fy2buEis7bEOlG0QET1 0.019100 0.724 0.792 0.812000
4EJI2XGViSQp6WscLKgYDD 0.053000 0.700 0.898 0.418000
4x6VzOQTLIrkkCWcDPh5Y0 0.000301 0.803 0.919 0.926000
... ... ... ... ...
46bXU7Sgj7104ZoXxzz9tM 0.031500 0.528 0.693 0.000345
0he2ViGMUO3ajKTxLOfWVT 0.022500 0.517 0.768 0.000018
72DAt9Lbpy9EUS29OzQLob 0.026000 0.361 0.821 0.000242
6HXgExFVuE1c3cq9QjFCcU 0.000551 0.477 0.921 0.029600
6MAAMZImxcvYhRnxDLTufD 0.001890 0.529 0.945 0.000055
liveness loudness speechiness valence
id
5LzAV6KfjN8VhWCedeygfY 0.1130 -6.782 0.0811 0.580
3TsCb6ueD678XBJDiRrvhr 0.0353 -6.645 0.0420 0.778
6Y0Fy2buEis7bEOlG0QET1 0.1080 -8.555 0.0405 0.346
4EJI2XGViSQp6WscLKgYDD 0.5740 -6.099 0.2570 0.791
4x6VzOQTLIrkkCWcDPh5Y0 0.1020 -8.667 0.0702 0.754
... ... ... ... ...
46bXU7Sgj7104ZoXxzz9tM 0.1210 -5.148 0.0304 0.394
0he2ViGMUO3ajKTxLOfWVT 0.2050 -7.922 0.0479 0.383
72DAt9Lbpy9EUS29OzQLob 0.3850 -3.102 0.0505 0.124
6HXgExFVuE1c3cq9QjFCcU 0.0575 -4.777 0.0392 0.488
6MAAMZImxcvYhRnxDLTufD 0.4140 -5.862 0.0615 0.134
[17529 rows x 16 columns]
%% Cell type:markdown id: tags:
### Review: `Pandas.Series.apply(...)`
Syntax: `Series.apply(<FUNCTION OBJECT REFERENCE>)`
- applies input function to every element of the Series.
- Returns a new `Series` object instance reference.
Let's apply transformation function to `mode` column `Series`:
- mode = 1 means major modality (sounds happy)
- mode = 0 means minor modality (sounds sad)
%% Cell type:code id: tags:
``` python
def replace_mode(m):
if m == 1:
return "major"
else:
return "minor"
```
%% Cell type:code id: tags:
``` python
df["mode"].apply(replace_mode)
```
%% Output
id
7pgJBLVz5VmnL7uGHmRj6p major
0vSWgAlfpye0WCGeNmuNhy major
7EL7ifncK2PWFYThJjzR25 major
1umsRbM7L4ju7rn9aU8Ju6 minor
4SKqOHKYU5pgHr5UiVKiQN major
...
46bXU7Sgj7104ZoXxzz9tM major
0he2ViGMUO3ajKTxLOfWVT minor
72DAt9Lbpy9EUS29OzQLob major
6HXgExFVuE1c3cq9QjFCcU minor
6MAAMZImxcvYhRnxDLTufD major
Name: mode, Length: 35877, dtype: object
%% Cell type:markdown id: tags:
### `lambda`
Let's write a `lambda` function instead of the `replace_mode` function
%% Cell type:code id: tags:
``` python
df["mode"].apply(lambda m: "major" if m == 1 else "minor")
```
%% Output
id
7pgJBLVz5VmnL7uGHmRj6p major
0vSWgAlfpye0WCGeNmuNhy major
7EL7ifncK2PWFYThJjzR25 major
1umsRbM7L4ju7rn9aU8Ju6 minor
4SKqOHKYU5pgHr5UiVKiQN major
...
46bXU7Sgj7104ZoXxzz9tM major
0he2ViGMUO3ajKTxLOfWVT minor
72DAt9Lbpy9EUS29OzQLob major
6HXgExFVuE1c3cq9QjFCcU minor
6MAAMZImxcvYhRnxDLTufD major
Name: mode, Length: 35877, dtype: object
%% Cell type:markdown id: tags:
Typically transformed columns are added as new columns within the DataFrame.
Let's add a new `modified_mode` column.
%% Cell type:code id: tags:
``` python
df["modified_mode"] = df["mode"].apply(lambda m: "major" if m == 1 else "minor")
df
```
%% Output
title song_name genre \
id
7pgJBLVz5VmnL7uGHmRj6p NaN Pathology Dark Trap
0vSWgAlfpye0WCGeNmuNhy NaN Symbiote Dark Trap
7EL7ifncK2PWFYThJjzR25 NaN BRAINFOOD Dark Trap
1umsRbM7L4ju7rn9aU8Ju6 NaN Sacrifice Dark Trap
4SKqOHKYU5pgHr5UiVKiQN NaN Backpack Dark Trap
... ... ... ...
46bXU7Sgj7104ZoXxzz9tM Euphoric Hardstyle No Song Name hardstyle
0he2ViGMUO3ajKTxLOfWVT Greatest Hardstyle Playlist No Song Name hardstyle
72DAt9Lbpy9EUS29OzQLob Best of Hardstyle 2020 No Song Name hardstyle
6HXgExFVuE1c3cq9QjFCcU Euphoric Hardstyle No Song Name hardstyle
6MAAMZImxcvYhRnxDLTufD Best of Hardstyle 2020 No Song Name hardstyle
duration_ms key mode time_signature tempo \
id
7pgJBLVz5VmnL7uGHmRj6p 224427 8 1 4 115.080
0vSWgAlfpye0WCGeNmuNhy 98821 5 1 4 218.050
7EL7ifncK2PWFYThJjzR25 101172 8 1 4 189.938
1umsRbM7L4ju7rn9aU8Ju6 96062 10 0 4 139.990
4SKqOHKYU5pgHr5UiVKiQN 135079 5 1 4 128.014
... ... ... ... ... ...
46bXU7Sgj7104ZoXxzz9tM 269208 4 1 4 150.013
0he2ViGMUO3ajKTxLOfWVT 210112 0 0 4 149.928
72DAt9Lbpy9EUS29OzQLob 234823 8 1 4 154.935
6HXgExFVuE1c3cq9QjFCcU 323200 6 0 4 150.042
6MAAMZImxcvYhRnxDLTufD 162161 9 1 4 155.047
acousticness danceability energy instrumentalness \
id
7pgJBLVz5VmnL7uGHmRj6p 0.401000 0.719 0.493 0.000000
0vSWgAlfpye0WCGeNmuNhy 0.013800 0.850 0.893 0.000004
7EL7ifncK2PWFYThJjzR25 0.187000 0.864 0.365 0.000000
1umsRbM7L4ju7rn9aU8Ju6 0.145000 0.767 0.576 0.000003
4SKqOHKYU5pgHr5UiVKiQN 0.007700 0.765 0.726 0.000000
... ... ... ... ...
46bXU7Sgj7104ZoXxzz9tM 0.031500 0.528 0.693 0.000345
0he2ViGMUO3ajKTxLOfWVT 0.022500 0.517 0.768 0.000018
72DAt9Lbpy9EUS29OzQLob 0.026000 0.361 0.821 0.000242
6HXgExFVuE1c3cq9QjFCcU 0.000551 0.477 0.921 0.029600
6MAAMZImxcvYhRnxDLTufD 0.001890 0.529 0.945 0.000055
liveness loudness speechiness valence modified_mode
id
7pgJBLVz5VmnL7uGHmRj6p 0.1180 -7.230 0.0794 0.1240 major
0vSWgAlfpye0WCGeNmuNhy 0.3720 -4.783 0.0623 0.0391 major
7EL7ifncK2PWFYThJjzR25 0.1160 -10.219 0.0655 0.0478 major
1umsRbM7L4ju7rn9aU8Ju6 0.0968 -9.683 0.2560 0.1870 minor
4SKqOHKYU5pgHr5UiVKiQN 0.6190 -5.580 0.1910 0.2700 major
... ... ... ... ... ...
46bXU7Sgj7104ZoXxzz9tM 0.1210 -5.148 0.0304 0.3940 major
0he2ViGMUO3ajKTxLOfWVT 0.2050 -7.922 0.0479 0.3830 minor
72DAt9Lbpy9EUS29OzQLob 0.3850 -3.102 0.0505 0.1240 major
6HXgExFVuE1c3cq9QjFCcU 0.0575 -4.777 0.0392 0.4880 minor
6MAAMZImxcvYhRnxDLTufD 0.4140 -5.862 0.0615 0.1340 major
[35877 rows x 17 columns]
%% Cell type:markdown id: tags:
#### Let's go back to the original table from the SQL database
%% Cell type:code id: tags:
``` python
df = qry("SELECT * FROM spotify")
df
```
%% Output
id title song_name \
0 7pgJBLVz5VmnL7uGHmRj6p Pathology
1 0vSWgAlfpye0WCGeNmuNhy Symbiote
2 7EL7ifncK2PWFYThJjzR25 BRAINFOOD
3 1umsRbM7L4ju7rn9aU8Ju6 Sacrifice
4 4SKqOHKYU5pgHr5UiVKiQN Backpack
... ... ... ...
35872 46bXU7Sgj7104ZoXxzz9tM Euphoric Hardstyle
35873 0he2ViGMUO3ajKTxLOfWVT Greatest Hardstyle Playlist
35874 72DAt9Lbpy9EUS29OzQLob Best of Hardstyle 2020
35875 6HXgExFVuE1c3cq9QjFCcU Euphoric Hardstyle
35876 6MAAMZImxcvYhRnxDLTufD Best of Hardstyle 2020
genre duration_ms key mode time_signature tempo \
0 Dark Trap 224427 8 1 4 115.080
1 Dark Trap 98821 5 1 4 218.050
2 Dark Trap 101172 8 1 4 189.938
3 Dark Trap 96062 10 0 4 139.990
4 Dark Trap 135079 5 1 4 128.014
... ... ... ... ... ... ...
35872 hardstyle 269208 4 1 4 150.013
35873 hardstyle 210112 0 0 4 149.928
35874 hardstyle 234823 8 1 4 154.935
35875 hardstyle 323200 6 0 4 150.042
35876 hardstyle 162161 9 1 4 155.047
acousticness danceability energy instrumentalness liveness \
0 0.401000 0.719 0.493 0.000000 0.1180
1 0.013800 0.850 0.893 0.000004 0.3720
2 0.187000 0.864 0.365 0.000000 0.1160
3 0.145000 0.767 0.576 0.000003 0.0968
4 0.007700 0.765 0.726 0.000000 0.6190
... ... ... ... ... ...
35872 0.031500 0.528 0.693 0.000345 0.1210
35873 0.022500 0.517 0.768 0.000018 0.2050
35874 0.026000 0.361 0.821 0.000242 0.3850
35875 0.000551 0.477 0.921 0.029600 0.0575
35876 0.001890 0.529 0.945 0.000055 0.4140
loudness speechiness valence
0 -7.230 0.0794 0.1240
1 -4.783 0.0623 0.0391
2 -10.219 0.0655 0.0478
3 -9.683 0.2560 0.1870
4 -5.580 0.1910 0.2700
... ... ... ...
35872 -5.148 0.0304 0.3940
35873 -7.922 0.0479 0.3830
35874 -3.102 0.0505 0.1240
35875 -4.777 0.0392 0.4880
35876 -5.862 0.0615 0.1340
[35877 rows x 17 columns]
%% Cell type:markdown id: tags:
Extract just the "genre" and "duration_ms" columns from `df`.
%% Cell type:code id: tags:
``` python
df[["genre", "duration_ms"]]
```
%% Output
genre duration_ms
0 Dark Trap 224427
1 Dark Trap 98821
2 Dark Trap 101172
3 Dark Trap 96062
4 Dark Trap 135079
... ... ...
35872 hardstyle 269208
35873 hardstyle 210112
35874 hardstyle 234823
35875 hardstyle 323200
35876 hardstyle 162161
[35877 rows x 2 columns]
%% Cell type:markdown id: tags:
### `Pandas.DataFrame.groupby(...)`
Syntax: `DataFrame.groupby(<COLUMN>)`
- Returns a `groupby` object instance reference
- Need to apply aggregation methods to use the return value of `groupby`
%% Cell type:code id: tags:
``` python
df[["genre", "duration_ms"]].groupby("genre")
```
%% Output
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbc472bad90>
%% Cell type:markdown id: tags:
### What is the average duration for each genre ordered based on decreasing order of averages?
#### v1: using `df` (`pandas`) to answer the question
%% Cell type:code id: tags:
``` python
df[["genre", "duration_ms"]].groupby("genre").mean()
```
%% Output
duration_ms
genre
Dark Trap 196059.938997
Emo 218370.989519
Hiphop 227885.028411
Pop 211558.052980
Rap 200816.798836
RnB 225628.556955
Trap Metal 145940.519467
Underground Rap 175506.191224
dnb 288860.138811
hardstyle 232828.626542
psytrance 445770.492075
techhouse 298395.587596
techno 399123.187453
trance 288729.366262
trap 225149.277731
%% Cell type:code id: tags:
``` python
df[["genre", "duration_ms"]].groupby("genre").mean().sort_values(by = "duration_ms", ascending = False)
```
%% Output
duration_ms
genre
psytrance 445770.492075
techno 399123.187453
techhouse 298395.587596
dnb 288860.138811
trance 288729.366262
hardstyle 232828.626542
Hiphop 227885.028411
RnB 225628.556955
trap 225149.277731
Emo 218370.989519
Pop 211558.052980
Rap 200816.798836
Dark Trap 196059.938997
Underground Rap 175506.191224
Trap Metal 145940.519467
%% Cell type:markdown id: tags:
One way to check whether `groupby` works would be to use `value_counts` on the same column `Series`.
%% Cell type:code id: tags:
``` python
df["genre"].value_counts()
```
%% Output
Underground Rap 4330
Dark Trap 3590
Hiphop 3027
trance 2804
psytrance 2650
techno 2646
dnb 2507
trap 2362
hardstyle 2351
techhouse 2209
RnB 1905
Trap Metal 1875
Emo 1622
Rap 1546
Pop 453
Name: genre, dtype: int64
%% Cell type:markdown id: tags:
### What is the average duration for each genre ordered based on decreasing order of averages?
#### v2: using SQL query to answer the question
%% Cell type:code id: tags:
``` python
# SQL equivalent query of the above Pandas query
avg_duration_per_genre = qry("""
SELECT genre, AVG(duration_ms) as avg_duration
FROM spotify
GROUP BY genre
ORDER BY avg_duration DESC
""")
# How can we get make the SQL query output to be exactly same as df.groupby?
avg_duration_per_genre = avg_duration_per_genre.set_index("genre")
avg_duration_per_genre
```
%% Output
avg_duration
genre
psytrance 445770.492075
techno 399123.187453
techhouse 298395.587596
dnb 288860.138811
trance 288729.366262
hardstyle 232828.626542
Hiphop 227885.028411
RnB 225628.556955
trap 225149.277731
Emo 218370.989519
Pop 211558.052980
Rap 200816.798836
Dark Trap 196059.938997
Underground Rap 175506.191224
Trap Metal 145940.519467
%% Cell type:markdown id: tags:
### What is the average speechiness for each mode, time signature pair?
#### v1: pandas
%% Cell type:code id: tags:
``` python
# use a list to indicate all the columns you want to groupby
df[["mode", "time_signature", "speechiness"]].groupby(["mode", "time_signature"]).mean()
```
%% Output
speechiness
mode time_signature
0 1 0.181224
3 0.121837
4 0.126688
5 0.204890
1 1 0.173138
3 0.129512
4 0.139170
5 0.220177
%% Cell type:code id: tags:
``` python
# SQL equivalent query of the above Pandas query
qry("""
SELECT mode, time_signature, AVG(speechiness) as avg_speechiness
FROM spotify
GROUP BY mode, time_signature
""")
```
%% Output
mode time_signature avg_speechiness
0 0 1 0.181224
1 0 3 0.121837
2 0 4 0.126688
3 0 5 0.204890
4 1 1 0.173138
5 1 3 0.129512
6 1 4 0.139170
7 1 5 0.220177
%% Cell type:markdown id: tags:
### Self-practice
%% Cell type:markdown id: tags:
### Which songs have a tempo greater than 150 and what are their genre?
%% Cell type:code id: tags:
``` python
# v1: pandas
fast_songs = df[df["tempo"] > 150]
fast_songs[["song_name", "genre"]]
```
%% Output
song_name genre
1 Symbiote Dark Trap
2 BRAINFOOD Dark Trap
18 FunnyToSeeYouHere Dark Trap
19 Killer Dark Trap
20 608 Dark Trap
... ... ...
35871 hardstyle
35872 hardstyle
35874 hardstyle
35875 hardstyle
35876 hardstyle
[13753 rows x 2 columns]
%% Cell type:code id: tags:
``` python
# v2: SQL
qry("""
SELECT song_name, genre
FROM spotify
WHERE tempo > 150
""")
```
%% Output
song_name genre
0 Symbiote Dark Trap
1 BRAINFOOD Dark Trap
2 FunnyToSeeYouHere Dark Trap
3 Killer Dark Trap
4 608 Dark Trap
... ... ...
13748 hardstyle
13749 hardstyle
13750 hardstyle
13751 hardstyle
13752 hardstyle
[13753 rows x 2 columns]
%% Cell type:markdown id: tags:
### What is the sum of danceability and liveness for "Hiphop" genre songs?
%% Cell type:code id: tags:
``` python
# v1: pandas
hiphop_songs = df[df["genre"] == "Hiphop"]
hiphop_songs["danceability"] + hiphop_songs["liveness"]
```
%% Output
15321 0.8416
15322 0.9201
15323 0.8580
15324 0.8240
15325 0.9348
...
18343 0.6690
18344 0.5370
18345 0.8850
18346 0.8770
18347 0.8703
Length: 3027, dtype: float64
%% Cell type:code id: tags:
``` python
# v2: SQL
hiphop_songs = qry("""
SELECT danceability + liveness as song_score
FROM spotify
WHERE genre = "Hiphop"
""")
hiphop_songs["song_score"]
```
%% Output
0 0.8416
1 0.9201
2 0.8580
3 0.8240
4 0.9348
...
3022 0.6690
3023 0.5370
3024 0.8850
3025 0.8770
3026 0.8703
Name: song_score, Length: 3027, dtype: float64
%% Cell type:markdown id: tags:
### Find all song_name ordered by ascending order of duration_ms. Eliminate songs which don't have a song_name
%% Cell type:code id: tags:
``` python
# v1: pandas
songs_by_duration = list(df.sort_values(by = "duration_ms")["song_name"])
# [song for song in songs_by_duration if song != ""] # uncomment to see the output
```
%% Cell type:code id: tags:
``` python
# v2
songs_by_duration = qry("""
SELECT song_name
FROM spotify
ORDER BY duration_ms
""")
songs_by_duration = list(songs_by_duration["song_name"])
# [song for song in songs_by_duration if song != ""] # uncomment to see the output
```
%% Cell type:markdown id: tags:
### How many distinct "genre"s are there in the dataset?
%% Cell type:code id: tags:
``` python
# v1: pandas
list(set(list(df["genre"])))
```
%% Output
['trance',
'techno',
'dnb',
'Trap Metal',
'RnB',
'Pop',
'psytrance',
'techhouse',
'trap',
'Dark Trap',
'Emo',
'Underground Rap',
'Rap',
'Hiphop',
'hardstyle']
%% Cell type:code id: tags:
``` python
# v2: SQL
genres = qry("""
SELECT DISTINCT genre
FROM spotify
""")
list(genres["genre"])
```
%% Output
['Dark Trap',
'Underground Rap',
'Trap Metal',
'Emo',
'Rap',
'RnB',
'Pop',
'Hiphop',
'techhouse',
'techno',
'trance',
'psytrance',
'trap',
'dnb',
'hardstyle']
%% Cell type:markdown id: tags:
### Considering only songs with energy greater than 0.5, what is the maximum energy for each "genre" with song count greater than 2000?
%% Cell type:code id: tags:
``` python
# v1: pandas
high_energy_songs = df[df["energy"] > 0.5]
genre_groups = high_energy_songs[["genre", "energy"]].groupby("genre")
max_energy = genre_groups.max()
max_energy["energy"]
```
%% Output
genre
Dark Trap 0.998
Emo 0.995
Hiphop 0.978
Pop 0.977
Rap 0.980
RnB 0.974
Trap Metal 0.999
Underground Rap 0.997
dnb 0.999
hardstyle 0.999
psytrance 0.999
techhouse 0.999
techno 1.000
trance 1.000
trap 1.000
Name: energy, dtype: float64
%% Cell type:code id: tags:
``` python
genre_counts = genre_groups.count()
genre_counts["energy_max"] = max_energy["energy"]
filtered_genre_counts = genre_counts[genre_counts["energy"] > 2000]
filtered_genre_counts
```
%% Output
energy energy_max
genre
Dark Trap 2757 0.998
Hiphop 2497 0.978
Underground Rap 3420 0.997
dnb 2496 0.999
hardstyle 2345 0.999
psytrance 2642 0.999
techhouse 2164 0.999
techno 2534 1.000
trance 2786 1.000
trap 2346 1.000
%% Cell type:code id: tags:
``` python
# v2: SQL
qry("""
SELECT genre, COUNT(*) as song_count, MAX("energy") as energy_max
FROM spotify
WHERE energy > 0.5
GROUP BY genre
HAVING song_count > 2000
""")
```
%% Output
genre song_count energy_max
0 Dark Trap 2757 0.998
1 Hiphop 2497 0.978
2 Underground Rap 3420 0.997
3 dnb 2496 0.999
4 hardstyle 2345 0.999
5 psytrance 2642 0.999
6 techhouse 2164 0.999
7 techno 2534 1.000
8 trance 2786 1.000
9 trap 2346 1.000
%% Cell type:code id: tags:
``` python
# Close the database connection here
conn.close()
```
%% Cell type:code id: tags:
``` python
# known import statements
import pandas as pd
import sqlite3 as sql # note that we are renaming to sql
import os
# new import statement
import numpy as np
```
%% Cell type:markdown id: tags:
# Lecture 35 Pandas 3: Data Transformation
* Data transformation is the process of changing the format, structure, or values of data.
* Often needed during data cleaning and sometimes during data analysis
%% Cell type:markdown id: tags:
# Today's Learning Objectives:
* Setting column as index for pandas `DataFrame`
* Identify, drop, or fill missing values (`np.NaN`) using Pandas `isna`, `dropna`, and `fillna`
* Applying transformations to `DataFrame`:
* Use `apply` on pandas `Series` to apply a transformation function
* Use `replace` to replace all target values in Pandas `Series` and `DataFrame` rows / columns
* Filter, aggregate, group, and summarize information in a `DataFrame` with `groupby`
* Convert .groupby examples to SQL
* Solving the same question using SQL and pandas `DataFrame` manipulations:
* filtering, grouping, and aggregation / summarization
%% Cell type:markdown id: tags:
# The dataset: Spotify songs
Adapted from https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify.
If you are interested in digging deeper in this dataset, here's a [blog post](https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a) that explain each column in details.
%% Cell type:markdown id: tags:
### WARMUP 1: Establish a connection to the spotify.db database
%% Cell type:code id: tags:
``` python
# open up the spotify database
db_pathname = "spotify.db"
assert ???
conn = sql.connect(db_pathname)
```
%% Cell type:code id: tags:
``` python
def qry(sql):
return pd.read_sql(sql, conn)
```
%% Cell type:markdown id: tags:
### WARMUP 2: Identify the table name(s) inside the database
%% Cell type:code id: tags:
``` python
df = qry("")
df
```
%% Cell type:markdown id: tags:
### WARMUP 3: Use pandas lookup expression to extract the "sql" column and display the full query using .iloc lookup
%% Cell type:code id: tags:
``` python
print()
```
%% Cell type:markdown id: tags:
### WARMUP 4: Store the data inside `spotify` table inside a variable called `df`
%% Cell type:code id: tags:
``` python
df = qry("")
df
```
%% Cell type:markdown id: tags:
### Setting a column as row indices for the `DataFrame`
- Syntax: `df.set_index("<COLUMN>")`
- Returns a new DataFrame object instance reference.
- WARNING: executing this twice will result in `KeyError` being thrown. Once you set a column as row index, it will no longer be a column within the `DataFrame`. If you tried this, go back and execute the above cell and update `df` once more and then execute the below cell exactly once.
%% Cell type:code id: tags:
``` python
# Set the id column as row indices
df =
df
```
%% Cell type:markdown id: tags:
### Not a Number
- `np.NaN` is the floating point representation of Not a Number
- You do not need to know / learn the details about the `numpy` package
### Replacing / modifying values within the `DataFrame`
Syntax: `df.replace(<TARGET>, <REPLACE>)`
- Your target can be `str`, `int`, `float`, `None` (there are other possiblities, but those are too advanced for this course)
- Returns a new DataFrame object instance reference.
Let's now replace the missing values (empty strings) with `np.NAN`
%% Cell type:code id: tags:
``` python
df =
df.head(10) # title is the album name
```
%% Cell type:markdown id: tags:
### Checking for missing values
Syntax: `Series.isna()`
- Returns a boolean Series
Let's check if any of the "song_name"(s) are missing
%% Cell type:code id: tags:
``` python
df["song_name"]
```
%% Cell type:markdown id: tags:
### Review: `Pandas.Series.value_counts()`
- Returns a new `Series` with unique values from the original `Series` as keys and the count of those unique values as values.
- Return value `Series` is ordered using descending order of counts
%% Cell type:code id: tags:
``` python
# count the number of missing values for song name
df["song_name"]
```
%% Cell type:markdown id: tags:
### Missing value manipulation
Syntax: `df.fillna(<REPLACE>)`
- Returns a new DataFrame object instance reference.
%% Cell type:code id: tags:
``` python
# use .fillna to replace missing values
df["song_name"]
# to replace the original DataFrame's column, you need to explicitly update that object instance
# TODO: uncomment the below lines and update the code
#df["song_name"] = ???
#df
```
%% Cell type:markdown id: tags:
### Dropping missing values
Syntax: `df.dropna()`
- Returns a new DataFrame object instance reference.
%% Cell type:code id: tags:
``` python
# .dropna will drop all rows that contain NaN in them
df.dropna()
```
%% Cell type:markdown id: tags:
### Review: `Pandas.Series.apply(...)`
Syntax: `Series.apply(<FUNCTION OBJECT REFERENCE>)`
- applies input function to every element of the Series.
- Returns a new `Series` object instance reference.
Let's apply transformation function to `mode` column `Series`:
- mode = 1 means major modality (sounds happy)
- mode = 0 means minor modality (sounds sad)
%% Cell type:code id: tags:
``` python
def replace_mode(m):
if m == 1:
return "major"
else:
return "minor"
```
%% Cell type:code id: tags:
``` python
df["mode"]
```
%% Cell type:markdown id: tags:
### `lambda`
Let's write a `lambda` function instead of the `replace_mode` function
%% Cell type:code id: tags:
``` python
df["mode"].apply(???)
```
%% Cell type:markdown id: tags:
Typically transformed columns are added as new columns within the DataFrame.
Let's add a new `modified_mode` column.
%% Cell type:code id: tags:
``` python
df["modified_mode"] = df["mode"].apply(lambda m: "major" if m == 1 else "minor")
df
```
%% Cell type:markdown id: tags:
#### Let's go back to the original table from the SQL database
%% Cell type:code id: tags:
``` python
df = qry("SELECT * FROM spotify")
df
```
%% Cell type:markdown id: tags:
Extract just the "genre" and "duration_ms" columns from `df`.
%% Cell type:code id: tags:
``` python
df[???]
```
%% Cell type:markdown id: tags:
### `Pandas.DataFrame.groupby(...)`
Syntax: `DataFrame.groupby(<COLUMN>)`
- Returns a `groupby` object instance reference
- Need to apply aggregation methods to use the return value of `groupby`
%% Cell type:code id: tags:
``` python
df[["genre", "duration_ms"]]
```
%% Cell type:markdown id: tags:
### What is the average duration for each genre ordered based on decreasing order of averages?
#### v1: using `df` (`pandas`) to answer the question
%% Cell type:code id: tags:
``` python
df[["genre", "duration_ms"]]
```
%% Cell type:code id: tags:
``` python
df[["genre", "duration_ms"]]
```
%% Cell type:markdown id: tags:
One way to check whether `groupby` works would be to use `value_counts` on the same column `Series`.
%% Cell type:code id: tags:
``` python
df["genre"].value_counts()
```
%% Cell type:markdown id: tags:
### What is the average duration for each genre ordered based on decreasing order of averages?
#### v2: using SQL query to answer the question
%% Cell type:code id: tags:
``` python
# SQL equivalent query of the above Pandas query
avg_duration_per_genre = qry("""
""")
# How can we get make the SQL query output to be exactly same as df.groupby?
avg_duration_per_genre = avg_duration_per_genre.set_index("genre")
avg_duration_per_genre
```
%% Cell type:markdown id: tags:
### What is the average speechiness for each mode, time signature pair?
#### v1: pandas
%% Cell type:code id: tags:
``` python
# use a list to indicate all the columns you want to groupby
```
%% Cell type:code id: tags:
``` python
# SQL equivalent query of the above Pandas query
qry("""
""")
```
%% Cell type:markdown id: tags:
### Self-practice
%% Cell type:markdown id: tags:
### Which songs have a tempo greater than 150 and what are their genre?
%% Cell type:code id: tags:
``` python
# v1: pandas
fast_songs =
```
%% Cell type:code id: tags:
``` python
# v2: SQL
qry("""
""")
```
%% Cell type:markdown id: tags:
### What is the sum of danceability and liveness for "Hiphop" genre songs?
%% Cell type:code id: tags:
``` python
# v1: pandas
hiphop_songs =
```
%% Cell type:code id: tags:
``` python
# v2: SQL
hiphop_songs = qry("""
""")
hiphop_songs
```
%% Cell type:markdown id: tags:
### Find all song_name ordered by ascending order of duration_ms. Eliminate songs which don't have a song_name
%% Cell type:code id: tags:
``` python
# v1: pandas
songs_by_duration =
```
%% Cell type:code id: tags:
``` python
# v2
songs_by_duration = qry("""
""")
songs_by_duration
```
%% Cell type:markdown id: tags:
### How many distinct "genre"s are there in the dataset?
%% Cell type:code id: tags:
``` python
# v1: pandas
```
%% Cell type:code id: tags:
``` python
# v2: SQL
genres = qry("""
""")
```
%% Cell type:markdown id: tags:
### Considering only songs with energy greater than 0.5, what is the maximum energy for each "genre" with song count greater than 2000?
%% Cell type:code id: tags:
``` python
genre_groups =
```
%% Cell type:code id: tags:
``` python
# v1: pandas
high_energy_songs = ???
genre_groups = ???
max_energy = ???
max_energy["energy"]
```
%% Cell type:code id: tags:
``` python
genre_counts = ???
genre_counts["energy_max"] = max_energy["energy"]
filtered_genre_counts = ???
filtered_genre_counts
```
%% Cell type:code id: tags:
``` python
# v2: SQL
qry("""
""")
```
%% Cell type:code id: tags:
``` python
# Close the database connection here
```
File added
Source diff could not be displayed: it is too large. Options to address this: view the blob.