Skip to content
Snippets Groups Projects
Commit ed64bcea authored by LOUIS TYRRELL OLIPHANT's avatar LOUIS TYRRELL OLIPHANT
Browse files

finished lec 37 adv pandas

parent a5d56bc4
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id: tags:
``` python
# known import statements
import pandas as pd
import sqlite3
import os
# new import statement
import numpy as np
```
%% Cell type:code id: tags:
``` python
# Get the Piazza data from 'piazza.db'
db_name = "piazza.db"
assert os.path.exists(db_name)
conn = sqlite3.connect(db_name)
def qry(sql):
return pd.read_sql(sql, conn)
df = qry("""
SELECT *
FROM sqlite_master
WHERE type='table'
""")
print(df.iloc[0]['sql'])
```
%% Cell type:code id: tags:
``` python
piazza_df = pd.read_sql("""
SELECT *
FROM piazza
""", conn)
piazza_df.head(5)
```
%% Cell type:markdown id: tags:
## Warmup
%% Cell type:code id: tags:
``` python
# Warmup 1: Set the student id column as the index
piazza_df = piazza_df.set_index("student_id")
piazza_df
```
%% Cell type:code id: tags:
``` python
# Warmup 2a: Which 10 students post the most?
```
%% Cell type:code id: tags:
``` python
# Warmup 2b: Can you plot their number of posts as a bar graph? Be sure to label your axes!
```
%% Cell type:code id: tags:
``` python
# Warmup 2c: How about with their name rather than their student id?
```
%% Cell type:code id: tags:
``` python
# Warmup 3a: Which people had more than 10 answers? Include all roles. Store the results in a dataframe named top_answers
```
%% Cell type:code id: tags:
``` python
# Warmup 3b: Plot this as a bar graph.
```
%% Cell type:code id: tags:
``` python
# Warmup 3c: Plot the contributions of the various roles as a bar graph.
top_answers["role"].value_counts().plot.bar()
```
%% Cell type:code id: tags:
``` python
# Warmup 3d: Can you get this same data using SQL?
qry("""
""")
```
%% Cell type:code id: tags:
``` python
# Warmup 3e: What about their average # of days online as well?
qry("""
""")
```
%% Cell type:code id: tags:
``` python
# Warmup 3f: Can we do that in Pandas as well?
# TODAY'S TOPIC
```
%% Cell type:markdown id: tags:
# Advanced Pandas
## Learning Objectives:
* Setting column as index for pandas `DataFrame`
* Identify, drop, or fill missing values (`np.NaN`) using Pandas `isna`, `dropna`, and `fillna`
* Applying transformations to `DataFrame`:
* Use `apply` on pandas `Series` to apply a transformation function
* Use `replace` to replace all target values in Pandas `Series` and `DataFrame` rows / columns
* Filter, aggregate, group, and summarize information in a `DataFrame` with `groupby`
* Convert .groupby examples to SQL
* Solving the same question using SQL and pandas `DataFrame` manipulations:
* filtering, grouping, and aggregation / summarization
%% Cell type:code id: tags:
``` python
# Sort piazza_df by name column ... What do we notice?
```
%% Cell type:markdown id: tags:
### Not a Number
- `np.NaN` is the floating point representation of Not a Number
- You do not need to know / learn the details about the `numpy` package
### Replacing / modifying values within the `DataFrame`
Syntax: `df.replace(<TARGET>, <REPLACE>)`
Let's now replace the missing values (empty strings) with `np.NaN`
%% Cell type:code id: tags:
``` python
# Let's replace these empty strings with this special value.
piazza_df = ...
piazza_df
```
%% Cell type:code id: tags:
``` python
# Sort by name again... What do we notice?
```
%% Cell type:markdown id: tags:
### Checking for missing values
Syntax: `Series.isna()`
- Returns a boolean Series
%% Cell type:code id: tags:
``` python
# Run isna() on the name column
```
%% Cell type:code id: tags:
``` python
# How many people are missing a name?
```
%% Cell type:code id: tags:
``` python
# How many people are missing an email?
```
%% Cell type:code id: tags:
``` python
# How many people are missing both a name and email?
```
%% Cell type:code id: tags:
``` python
# How many people are missing either a name or email?
```
%% Cell type:code id: tags:
``` python
# So... What do we do?
# 1. Drop those rows
# 2. Interpolate / Best Guess
```
%% Cell type:code id: tags:
``` python
# Option 1: Drop those rows.
```
%% Cell type:code id: tags:
``` python
# Option 2a: Interpolate / Best Guess
```
%% Cell type:code id: tags:
``` python
# Create a function to take an email (e.g. "calm_star@wisc.edu")
# and return the name (e.g. "calm star")
def parse_name_from_email(email):
if pd.isna(email):
return np.nan
else:
pass # TODO Parse out the name!
# Test your function!
parse_name_from_email("calm_star@wisc.edu")
```
%% Cell type:markdown id: tags:
### Review: `Pandas.Series.apply(...)`
Syntax: `Series.apply(<FUNCTION OBJECT REFERENCE>)`
- applies input function to every element of the Series.
- Returns a new `Series`
%% Cell type:code id: tags:
``` python
# Now, apply that function to each value in email!
piazza_df["guessed_name"] = ???
piazza_df
```
%% Cell type:code id: tags:
``` python
# Create a function to take a name (e.g. "calm star")
# and return the email (e.g. "calm_star@wisc.edu")
def parse_email_from_name(name):
pass
# Test your function!
parse_email_from_name("calm star")
```
%% Cell type:code id: tags:
``` python
# Now, apply that function to each value in name!
piazza_df["guessed_email"] = ???
piazza_df
```
%% Cell type:markdown id: tags:
### `Pandas.DataFrame.apply(...)`
Syntax: `DataFrame.apply(<FUNCTION OBJECT REFERENCE>, axis=1)`
- `axis=1` means apply to each row.
- returns a new `Series`
%% Cell type:code id: tags:
``` python
# If the name has a value, use it, otherwise use our best guess!
piazza_df["name"] = piazza_df.apply(lambda r : r["guessed_name"] if pd.isna(r["name"]) else r["name"], axis=1)
```
%% Cell type:code id: tags:
``` python
# Same thing for email!
piazza_df["email"] = piazza_df.apply(lambda r : r["guessed_email"] if pd.isna(r["email"]) else r["email"], axis=1)
```
%% Cell type:code id: tags:
``` python
help(piazza_df.drop)
```
%% Cell type:code id: tags:
``` python
# Drop the guessing columns
piazza_df = piazza_df.drop("guessed_name", axis=1)
piazza_df = piazza_df.drop("guessed_email", axis=1)
```
%% Cell type:code id: tags:
``` python
help(piazza_df.dropna)
```
%% Cell type:code id: tags:
``` python
# How many rows are missing data now?
len(piazza_df.dropna())
```
%% Cell type:code id: tags:
``` python
help(piazza_df.fillna)
```
%% Cell type:code id: tags:
``` python
# Give a name of "anonymous" and email of "anonymous@wisc.edu"
# to anyone left with missing data.
piazza_df['name'] = piazza_df['name'].fillna('anonymous')
# TODO: now do the email column
```
%% Cell type:markdown id: tags:
### `Pandas.DataFrame.groupby(...)`
Syntax: `DataFrame.groupby(<COLUMN>)`
- Returns a `groupby` object
- Need to apply aggregation functions to use the return value of `groupby`
%% Cell type:code id: tags:
``` python
# What does this return?
piazza_df.groupby("role")
```
%% Cell type:code id: tags:
``` python
# Try getting the "mean" of this groupby object.
piazza_df.groupby("role").mean(numeric_only=True)
```
%% Cell type:code id: tags:
``` python
# How many answers does the average instructor, student, and TA give?
```
%% Cell type:code id: tags:
``` python
# How would we write this in SQL?
qry("""
""")
```
%% Cell type:code id: tags:
``` python
# What is the total number of days spent online for instructors, students, and TAs?
# Order your answer from lowest to highest
```
%% Cell type:code id: tags:
``` python
# How would we write this in SQL?
qry("""
""")
```
%% Cell type:code id: tags:
``` python
# Of those individuals who spend less than 100 days online,
# how does their average number of posts compare to those that
# spend 100 days or more online? Do your analysis by role as well.
```
%% Cell type:code id: tags:
``` python
# How would we write this in SQL?
qry("""
""")
```
%% Cell type:code id: tags:
``` python
# What percentage of instructors, students, and TAs did not write a single answer,
# followup, or reply to a followup?
```
%% Cell type:code id: tags:
``` python
# How would we write this in SQL?
qry("""
""")
```
%% Cell type:code id: tags:
``` python
conn.close()
```
Source diff could not be displayed: it is too large. Options to address this: view the blob.
File added
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment