Skip to content
Snippets Groups Projects
Commit 45a19866 authored by Andy Kuemmel's avatar Andy Kuemmel
Browse files

Upload New File

parent 2c76af22
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id: tags:
``` python
# known import statements
import pandas as pd
import sqlite3 as sql # note that we are renaming to sql
import os
# new import statement
import numpy as np
```
%% Cell type:markdown id: tags:
# Lecture 35 Pandas 3: Data Transformation
* Data transformation is the process of changing the format, structure, or values of data.
* Often needed during data cleaning and sometimes during data analysis
Possible data transformation:
* Parsing/Extraction
* Parse CSV to Pandas DataFrame
* Missing Value Manipulations
* Dropping
* Imputation: replace missing value with substitute values
* Typecasting, Formating, Renaming
* Typecasting: covert one column from int to float
* Formating: format the time column to datatime object
* Renaming: rename column and index names
* Applying/Mapping
* Filtering, Aggregation, Grouping, and Summarization
* Covered in Pandas 1 & 2 lectures
%% Cell type:markdown id: tags:
# Today's Learning Objectives:
* Identify, drop, or fill missing values with Pandas .isna, .dropna, and .fillna
* Apply a function to Pandas Series and DataFrame rows/columns
* Replace all target values to Pandas Series and DataFrame rows/columns
* Filter, Aggregate, Group, and Summarize information in a DataFrame with .groupby
* Convert .groupby examples to SQL
%% Cell type:markdown id: tags:
# The dataset: Spotify songs
Adapted from https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify.
If you are interested in digging deeper in this dataset, here's a [blog post](https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a) that explain each column in details.
%% Cell type:markdown id: tags:
### WARMUP 1: Establish a connection to the spotify.db database
%% Cell type:code id: tags:
``` python
# open up the spotify database
db_pathname = "spotify.db"
assert os.path.exists(db_pathname)
conn = sql.connect(db_pathname)
```
%% Cell type:code id: tags:
``` python
def qry(sql):
return pd.read_sql(sql, conn)
```
%% Cell type:markdown id: tags:
### WARMUP 2: Identify the table name(s) inside the database
%% Cell type:code id: tags:
``` python
qry("select * from sqlite_master")
```
%% Output
type name tbl_name rootpage \
0 table spotify spotify 1527
1 index sqlite_autoindex_spotify_1 spotify 1528
sql
0 CREATE TABLE spotify(\nid TEXT PRIMARY KEY,\nt...
1 None
%% Cell type:markdown id: tags:
### WARMUP 3: Use pandas lookup expression to identify the column names and the types: use .iloc
%% Cell type:code id: tags:
``` python
print(qry("select * from sqlite_master")["sql"].iloc[0])
```
%% Output
CREATE TABLE spotify(
id TEXT PRIMARY KEY,
title BLOB,
song_name BLOB,
genre TEXT,
duration_ms INTEGER,
key INTEGER,
mode INTEGER,
time_signature INTEGER,
tempo REAL,
acousticness REAL,
danceability REAL,
energy REAL,
instrumentalness REAL,
liveness REAL,
loudness REAL,
speechiness REAL,
valence REAL)
%% Cell type:markdown id: tags:
### WARMUP 4: Store the data inside `spotify` table inside a variable called `df`
%% Cell type:code id: tags:
``` python
df = qry("select * from spotify")
df
```
%% Output
id title song_name \
0 7pgJBLVz5VmnL7uGHmRj6p Pathology
1 0vSWgAlfpye0WCGeNmuNhy Symbiote
2 7EL7ifncK2PWFYThJjzR25 BRAINFOOD
3 1umsRbM7L4ju7rn9aU8Ju6 Sacrifice
4 4SKqOHKYU5pgHr5UiVKiQN Backpack
... ... ... ...
35872 46bXU7Sgj7104ZoXxzz9tM Euphoric Hardstyle
35873 0he2ViGMUO3ajKTxLOfWVT Greatest Hardstyle Playlist
35874 72DAt9Lbpy9EUS29OzQLob Best of Hardstyle 2020
35875 6HXgExFVuE1c3cq9QjFCcU Euphoric Hardstyle
35876 6MAAMZImxcvYhRnxDLTufD Best of Hardstyle 2020
genre duration_ms key mode time_signature tempo \
0 Dark Trap 224427 8 1 4 115.080
1 Dark Trap 98821 5 1 4 218.050
2 Dark Trap 101172 8 1 4 189.938
3 Dark Trap 96062 10 0 4 139.990
4 Dark Trap 135079 5 1 4 128.014
... ... ... ... ... ... ...
35872 hardstyle 269208 4 1 4 150.013
35873 hardstyle 210112 0 0 4 149.928
35874 hardstyle 234823 8 1 4 154.935
35875 hardstyle 323200 6 0 4 150.042
35876 hardstyle 162161 9 1 4 155.047
acousticness danceability energy instrumentalness liveness \
0 0.401000 0.719 0.493 0.000000 0.1180
1 0.013800 0.850 0.893 0.000004 0.3720
2 0.187000 0.864 0.365 0.000000 0.1160
3 0.145000 0.767 0.576 0.000003 0.0968
4 0.007700 0.765 0.726 0.000000 0.6190
... ... ... ... ... ...
35872 0.031500 0.528 0.693 0.000345 0.1210
35873 0.022500 0.517 0.768 0.000018 0.2050
35874 0.026000 0.361 0.821 0.000242 0.3850
35875 0.000551 0.477 0.921 0.029600 0.0575
35876 0.001890 0.529 0.945 0.000055 0.4140
loudness speechiness valence
0 -7.230 0.0794 0.1240
1 -4.783 0.0623 0.0391
2 -10.219 0.0655 0.0478
3 -9.683 0.2560 0.1870
4 -5.580 0.1910 0.2700
... ... ... ...
35872 -5.148 0.0304 0.3940
35873 -7.922 0.0479 0.3830
35874 -3.102 0.0505 0.1240
35875 -4.777 0.0392 0.4880
35876 -5.862 0.0615 0.1340
[35877 rows x 17 columns]
%% Cell type:markdown id: tags:
### Setting a column as row indices for the `DataFrame`
- Syntax: `df.set_index("<COLUMN>")`
- Returns a new DataFrame object instance reference.
- WARNING: executing this twice will result in `KeyError` being thrown. Once you set a column as row index, it will no longer be a column within the `DataFrame`. If you tried this, go back and execute the above cell and update `df` once more and then execute the below cell exactly once.
%% Cell type:code id: tags:
``` python
# Set the id column as row indices
df =
df
```
%% Cell type:markdown id: tags:
### Not a Number
- `np.NaN` is the floating point representation of Not a Number
- You do not need to know / learn the details about the `numpy` package
### Replacing / modifying values within the `DataFrame`
Syntax: `df.replace(<TARGET>, <REPLACE>)`
- Your target can be `str`, `int`, `float`, `None` (there are other possiblities, but those are too advanced for this course)
- Returns a new DataFrame object instance reference.
Let's now replace the missing values (empty strings) with `np.NAN`
%% Cell type:code id: tags:
``` python
df =
df.head(10) # title is the album name
```
%% Cell type:markdown id: tags:
### Checking for missing values
Syntax: `Series.isna()`
- Returns a boolean Series
Let's check if any of the "song_name"(s) are missing
%% Cell type:code id: tags:
``` python
df["song_name"]
```
%% Cell type:markdown id: tags:
### Review: `Pandas.Series.value_counts()`
- Returns a new `Series` with unique values from the original `Series` as keys and the count of those unique values as values.
- Return value `Series` is ordered using descending order of counts
%% Cell type:code id: tags:
``` python
# count the number of missing values for song name
df["song_name"]
```
%% Cell type:markdown id: tags:
### Missing value manipulation
Syntax: `df.fillna(<REPLACE>)`
- Returns a new DataFrame object instance reference.
%% Cell type:code id: tags:
``` python
# use .fillna to replace missing values
df["song_name"]
# to replace the original DataFrame's column, you need to explicitly update that object instance
# TODO: uncomment the below lines and update the code
#df["song_name"] = ???
#df
```
%% Cell type:markdown id: tags:
### Dropping missing values
Syntax: `df.dropna()`
- Returns a new DataFrame object instance reference.
%% Cell type:code id: tags:
``` python
# .dropna will drop all rows that contain NaN in them
df.dropna()
```
%% Cell type:markdown id: tags:
### Review: `Pandas.Series.apply(...)`
Syntax: `Series.apply(<FUNCTION OBJECT REFERENCE>)`
- applies input function to every element of the Series.
- Returns a new `Series` object instance reference.
Let's apply transformation function to `mode` column `Series`:
- mode = 1 means major modality (sounds happy)
- mode = 0 means minor modality (sounds sad)
%% Cell type:code id: tags:
``` python
def replace_mode(m):
if m == 1:
return "major"
else:
return "minor"
```
%% Cell type:code id: tags:
``` python
df["mode"]
```
%% Cell type:markdown id: tags:
### `lambda` recap
Let's write a `lambda` function instead of the `replace_mode` function
%% Cell type:code id: tags:
``` python
df["mode"].apply(???)
```
%% Cell type:markdown id: tags:
Typically transformed columns are added as new columns within the DataFrame.
Let's add a new `modified_mode` column.
%% Cell type:code id: tags:
``` python
df["modified_mode"] = df["mode"].apply(lambda m: "major" if m == 1 else "minor")
df
```
%% Cell type:markdown id: tags:
#### Let's go back to the original table from the SQL database
%% Cell type:code id: tags:
``` python
df = qry("SELECT * FROM spotify")
df
```
%% Cell type:markdown id: tags:
Extract just the "genre" and "duration_ms" columns from `df`.
%% Cell type:code id: tags:
``` python
df[???]
```
%% Cell type:markdown id: tags:
### `Pandas.DataFrame.groupby(...)`
Syntax: `DataFrame.groupby(<COLUMN>)`
- Returns a `groupby` object instance reference
- Need to apply aggregation methods to use the return value of `groupby`
%% Cell type:code id: tags:
``` python
df[["genre", "duration_ms"]]
```
%% Cell type:markdown id: tags:
### What is the average duration for each genre ordered based on decreasing order of averages?
#### v1: using `df` (`pandas`) to answer the question
%% Cell type:code id: tags:
``` python
df[["genre", "duration_ms"]]
```
%% Cell type:code id: tags:
``` python
df[["genre", "duration_ms"]]
```
%% Cell type:markdown id: tags:
One way to check whether `groupby` works would be to use `value_counts` on the same column `Series`.
%% Cell type:code id: tags:
``` python
df["genre"].value_counts()
```
%% Cell type:markdown id: tags:
### What is the average duration for each genre ordered based on decreasing order of averages?
#### v2: using SQL query to answer the question
%% Cell type:code id: tags:
``` python
# SQL equivalent query of the above Pandas query
avg_duration_per_genre = qry("""
""")
# How can we get make the SQL query output to be exactly same as df.groupby?
avg_duration_per_genre = avg_duration_per_genre.set_index("genre")
avg_duration_per_genre
```
%% Cell type:markdown id: tags:
### What is the average speechiness for each mode, time signature pair?
#### v1: pandas
%% Cell type:code id: tags:
``` python
# use a list to indicate all the columns you want to groupby
```
%% Cell type:code id: tags:
``` python
# SQL equivalent query of the above Pandas query
qry("""
""")
```
%% Cell type:markdown id: tags:
### Self-practice
%% Cell type:markdown id: tags:
### Which songs have a tempo greater than 150 and what are their genre?
%% Cell type:code id: tags:
``` python
# v1: pandas
fast_songs =
```
%% Cell type:code id: tags:
``` python
# v2: SQL
qry("""
""")
```
%% Cell type:markdown id: tags:
### What is the sum of danceability and liveness for "Hiphop" genre songs?
%% Cell type:code id: tags:
``` python
# v1: pandas
hiphop_songs =
```
%% Cell type:code id: tags:
``` python
# v2: SQL
hiphop_songs = qry("""
""")
hiphop_songs
```
%% Cell type:markdown id: tags:
### Find all song_name ordered by ascending order of duration_ms. Eliminate songs which don't have a song_name
%% Cell type:code id: tags:
``` python
# v1: pandas
songs_by_duration =
```
%% Cell type:code id: tags:
``` python
# v2
songs_by_duration = qry("""
""")
songs_by_duration
```
%% Cell type:markdown id: tags:
### How many distinct "genre"s are there in the dataset?
%% Cell type:code id: tags:
``` python
# v1: pandas
```
%% Cell type:code id: tags:
``` python
# v2: SQL
genres = qry("""
""")
```
%% Cell type:markdown id: tags:
### Considering only songs with energy greater than 0.5, what is the maximum energy for each "genre" with song count greater than 2000?
%% Cell type:code id: tags:
``` python
genre_groups =
```
%% Cell type:code id: tags:
``` python
# v1: pandas
high_energy_songs = ???
genre_groups = ???
max_energy = ???
max_energy["energy"]
```
%% Cell type:code id: tags:
``` python
genre_counts = ???
genre_counts["energy_max"] = max_energy["energy"]
filtered_genre_counts = ???
filtered_genre_counts
```
%% Cell type:code id: tags:
``` python
# v2: SQL
qry("""
""")
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment