Skip to content
Snippets Groups Projects
Commit 96be45cd authored by Cole Nelson's avatar Cole Nelson
Browse files

cole lec38

parent df91f45c
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id: tags:
``` python
# Warmup 0
import sqlite3
import pandas as pd
from pandas import DataFrame, Series
import matplotlib
from matplotlib import pyplot as plt
matplotlib.rcParams["font.size"] = 16
```
%% Cell type:code id: tags:
``` python
# Warmup 1: Write a function that converts any Fehrenheit temp to Celcius
# Note: The final exam will have a select amount of material from earlier in the course
# C = (5/9) * (f-32)
def f_to_c(f):
return (5/9) * (f-32)
# test it by making several calls
print(f_to_c(212))
print(f_to_c(32))
print(f_to_c(67))
```
%% Cell type:code id: tags:
``` python
# Warmup 2a: Save all the data from the "piazza" table to "piazza_df"
piazza_conn = sqlite3.connect("piazza.db")
piazza_df = pd.read_sql("SELECT * FROM piazza", piazza_conn)
piazza_df
```
%% Cell type:code id: tags:
``` python
# Warmup 2b: Set the index of piazza_df to be student_id
piazza_df = piazza_df.set_index("student_id")
piazza_df
```
%% Cell type:code id: tags:
``` python
# Warmup 2c: Add a column "total" to "piazza_df". This should be the sum of
# the number of posts, answers, edits, followups, and replies_to_followups
piazza_df["total"] = (piazza_df["posts"] + piazza_df["answers"] + piazza_df["edits"] + piazza_df["followups"] + piazza_df["replies_to_followups"])
# piazza_df["total"] = piazza_df.loc[:, "posts":"replies_to_followups"].sum(axis=1).sort_values() # advanced way!
piazza_df
```
%% Cell type:code id: tags:
``` python
# Warmup 2d: Create a new dataframe "contributors_df" which contains those
# that had more than 0 total contributions, and sort by this
# value from highest to lowest. Break ties by name in alphabetical order.
contributors_df = piazza_df[piazza_df["total"] > 0]
contributors_df = contributors_df.sort_values(["total", "name"], ascending=[False, True])
contributors_df
```
%% Cell type:code id: tags:
``` python
# Warmup 2e: How would we have done this in sql?
pd.read_sql("""
SELECT *, posts + answers + edits + followups + replies_to_followups AS total
FROM piazza
WHERE total > 0
ORDER BY total DESC, name ASC
""", piazza_conn).set_index("student_id")
```
%% Cell type:code id: tags:
``` python
# Warmup 3a: Of those that contributed, what was their average number of contributions?
# Do your analysis by role (e.g. by ta, instructor, and student)
contributors_df.groupby("role")["total"].mean()
```
%% Cell type:code id: tags:
``` python
# Warmup 3b: How would we have done this in sql?
pd.read_sql("""
SELECT
role,
posts + answers + edits + followups + replies_to_followups AS total,
AVG(posts + answers + edits + followups + replies_to_followups) as avg_total
FROM piazza
WHERE total > 0
GROUP BY role
""", piazza_conn).set_index("role")["avg_total"]
```
%% Cell type:code id: tags:
``` python
# Warmup 4: What is the correlation between all of the columns?
contributors_df.corr()
```
%% Cell type:code id: tags:
``` python
# Warmup 5: Close the connection.
piazza_conn.close()
```
%% Cell type:markdown id: tags:
# Plotting Applications
**Learning Objectives**
- Make a line plot on a series or on a DataFrame
- Apply features of line plots and bar plots to visualize results of data investigations
- Clean Series data by dropping NaN values and by converting to int
- Make a stacked bar plot
%% Cell type:markdown id: tags:
## Line plots
- `SERIES.plot.line()`
- `DATAFRAME.plot.line()` each column in the data frame becomes a line in the plot
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.line.html
%% Cell type:code id: tags:
``` python
# when you make a series from a list, the default indices 0, 1, 2, ...
s = Series([1758, 2002, 2408, 2898, 3814, 4803, 5713, 6661, 7618, 8391, 8764]) # y values
s.plot.line()
```
%% Cell type:code id: tags:
``` python
# You can make a series from a list and add indices
s = Series([1758, 2002, 2408, 2898, 3814, 4803, 5713, 6661, 7618, 8391, 8764], \
index=[2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020])
s.plot.line()
```
%% Cell type:code id: tags:
``` python
# We can save the AxesSubplot and "beautify" it like the other plots...
ax = s.plot.line()
ax.set_title("Number of Craft Breweries in the USA")
ax.set_xlabel("Year")
ax.set_ylabel("# Craft Breweries")
```
%% Cell type:code id: tags:
``` python
# Be careful! If the indices are out of order you get a mess
# pandas plots each (index, value) in the order given
s = Series([1758, 2408, 2898, 3814, 4803, 5713, 6661, 7618, 8391, 8764, 2002], \
index=[2010, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2011])
s.plot.line()
s
```
%% Cell type:code id: tags:
``` python
# you can fix this by calling sort_index()
s.sort_index().plot.line()
s.sort_index()
```
%% Cell type:markdown id: tags:
### Plotting lines from a DataFrame
%% Cell type:code id: tags:
``` python
# This DataFrame is made using a dict of lists
# City of Madison normal high and low (degrees F) by month
temp_df = DataFrame(
{
"high": [26, 31, 43, 57, 68, 78, 82, 79, 72, 59, 44, 30],
"low": [11, 15, 25, 36, 46, 56, 61, 59, 50, 39, 28, 16] }
)
temp_df
```
%% Cell type:markdown id: tags:
### A Line Plot made from a DataFrame automatically plots all columns
The same is true for bar plots; we'll see this later.
%% Cell type:code id: tags:
``` python
# You can also add ticks and ticklabels to a line plot
ax = temp_df.plot.line(figsize=(12, 4))
ax.set_title("Average Temperatures in Madison, WI")
ax.set_xlabel("Month")
ax.set_ylabel("Temp (Fahrenheit)")
ax.set_xticks(range(12)) # makes a range from 0 to 11
ax.set_xticklabels(["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"])
# This gets rid of the weird output
None
```
%% Cell type:code id: tags:
``` python
# ... Or explicitly pass the "x" and "y" parameters...
temp_df_with_month = DataFrame(
{
"month": ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"],
"high": [26, 31, 43, 57, 68, 78, 82, 79, 72, 59, 44, 30],
"low": [11, 15, 25, 36, 46, 56, 61, 59, 50, 39, 28, 16] }
)
ax = temp_df_with_month.plot.line(x="month", y=["high", "low"], figsize=(12, 4))
ax.set_title("Average Temperatures in Madison, WI")
ax.set_xlabel("Month")
ax.set_ylabel("Temp (Fahrenheit)")
```
%% Cell type:markdown id: tags:
### We can perform a calculation on an entire DataFrame
Let's change the entire DataFrame to Celcius
%% Cell type:code id: tags:
``` python
# call the function on the dataframe
celcius_df = f_to_c(temp_df)
celcius_df
```
%% Cell type:code id: tags:
``` python
# here is one way to add a horizontal line to our line plots
celcius_df["freezing"] = 0
celcius_df
```
%% Cell type:code id: tags:
``` python
# this plots each column as lines
# with rotation for the tick labels
ax = celcius_df.plot.line(y=["high", "low", "freezing"], figsize = (12,4))
ax.set_xlabel("Month")
ax.set_ylabel("Temp (Celcius)")
ax.set_xticks(range(12))
ax.set_xticklabels(["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"], rotation=45)
None
```
%% Cell type:markdown id: tags:
### Bar Plot Example w/ Fire Hydrants
- General review of Pandas
- Some new Bar Plot options
%% Cell type:code id: tags:
``` python
hdf = pd.read_csv("Fire_Hydrants.csv")
hdf.tail()
```
%% Cell type:code id: tags:
``` python
# grab just the column names
hdf.columns
```
%% Cell type:markdown id: tags:
### Let's create a *bar plot* to visualize *colors* of fire hydrants.
%% Cell type:code id: tags:
``` python
# make a series called counts_series which stores the value counts of the "nozzle_color"
color_counts = hdf["nozzle_color"].value_counts()
color_counts # what type is this?
```
%% Cell type:code id: tags:
``` python
# TODO: Clean the data ......use str.upper()
color_counts= hdf["nozzle_color"].str.upper().value_counts()
color_counts
```
%% Cell type:code id: tags:
``` python
# make a horizontal bar plot of counts of colors and have the colors match
# use color list: ["b", "g", "darkorange", "r", "c", "0.5"]
ax = color_counts.plot.barh(color=["b", "g", "darkorange", "r", "c", "0.5"])
ax.set_ylabel("Fire hydrant count")
```
%% Cell type:markdown id: tags:
### Let's create a *bar plot* to visualize *style* of fire hydrants.
%% Cell type:code id: tags:
``` python
# Do the same thing as we did for the colors but this time for the "Style"
style_counts = hdf["Style"].str.upper().value_counts()
style_counts
```
%% Cell type:code id: tags:
``` python
style_counts.plot.bar()
```
%% Cell type:code id: tags:
``` python
# Grab the top 12
top12 = style_counts.iloc[:12]
# and them add an index to our Series for the sum of all the "other" for
top12["other"] = style_counts.iloc[12:].sum()
```
%% Cell type:code id: tags:
``` python
# Plot the results
ax = top12.plot.bar(color="firebrick")
ax.set_ylabel("Hydrant Count")
ax.set_xlabel("Hydrant Type")
```
%% Cell type:markdown id: tags:
### Plot the year manufactured for the Pacer Style as opposed to other styles
%% Cell type:code id: tags:
``` python
# Let's get the year manufactured for all of the "Pacer" hydrants.
pacer_years = hdf[hdf["Style"] == "Pacer"]["year_manufactured"]
# Note: We can do this either way
# pacer_years = hdf["year_manufactured"][hdf["Style"] == "Pacer"]
pacer_years
```
%% Cell type:code id: tags:
``` python
# then do the same for all the other data
other_years = hdf["year_manufactured"][hdf["Style"] != "Pacer"]
other_years
```
%% Cell type:code id: tags:
``` python
# Round each year down to the start of the decade.
# e.g. 1987 --> 1980, 2003 --> 2000
pacer_decades = (pacer_years // 10 * 10)
pacer_decades
```
%% Cell type:code id: tags:
``` python
# Drop the NaN values, convert to int, and do value counts
pacer_decades = pacer_decades.dropna()
pacer_decades = pacer_decades.astype(int).value_counts()
pacer_decades
```
%% Cell type:code id: tags:
``` python
# Do the same thing for other_years. Save to a variable called "other_decades"
other_decades = (other_years // 10 * 10).dropna()
other_decades = other_decades.astype(int).value_counts()
other_decades
```
%% Cell type:code id: tags:
``` python
# Build a DataFrame from a dictionary of key, Series
plot_df = DataFrame({
"pacer": pacer_decades,
"other": other_decades,
})
plot_df
```
%% Cell type:code id: tags:
``` python
# make a bar plot
ax = plot_df.plot.bar()
ax.set_xlabel("Decade")
ax.set_ylabel("Hydrant Count")
```
%% Cell type:code id: tags:
``` python
# Ignore data from before 1950 using boolean indexing.
ax = plot_df[plot_df.index >= 1950].plot.bar()
ax.set_xlabel("Decade")
ax.set_ylabel("Hydrant Count")
```
%% Cell type:code id: tags:
``` python
# Make a Stacked Bar Chart!
ax = plot_df[plot_df.index >= 1950].plot.bar(stacked=True)
ax.set_xlabel("Decade")
ax.set_ylabel("Hydrant Count")
None
```
%% Cell type:code id: tags:
``` python
# Warmup 0
import sqlite3
import pandas as pd
from pandas import DataFrame, Series
import matplotlib
from matplotlib import pyplot as plt
matplotlib.rcParams["font.size"] = 15
```
%% Cell type:code id: tags:
``` python
# Warmup 1: Write a function that converts any Fehrenheit temp to Celcius
# Note: The final exam will have a select amount of material from earlier in the course
# C = (5/9) * (f-32)
def f_to_c(f):
pass
# test it by making several calls
```
%% Cell type:code id: tags:
``` python
# Warmup 2a: Save all the data from the "piazza" table to "piazza_df"
piazza_conn = sqlite3.connect("piazza.db")
piazza_df = ???
piazza_df
```
%% Cell type:code id: tags:
``` python
# Warmup 2b: Set the index of piazza_df to be student_id
piazza_df = ???
piazza_df
```
%% Cell type:code id: tags:
``` python
# Warmup 2c: Add a column "total" to "piazza_df". This should be the sum of
# the number of posts, answers, edits, followups, and replies_to_followups
```
%% Cell type:code id: tags:
``` python
# Warmup 2d: Create a new dataframe "contributors_df" which contains those
# that had more than 0 total contributions, and sort by this
# value from highest to lowest. Break ties by name in alphabetical order.
contributors_df = ???
```
%% Cell type:code id: tags:
``` python
# Warmup 2e: How would we have done this in sql?
pd.read_sql("""
""", piazza_conn)
```
%% Cell type:code id: tags:
``` python
# Warmup 3a: Of those that contributed, what was their average number of contributions?
# Do your analysis by role (e.g. by ta, instructor, and student)
```
%% Cell type:code id: tags:
``` python
# Warmup 3b: How would we have done this in sql?
pd.read_sql("""
""", piazza_conn)
```
%% Cell type:code id: tags:
``` python
# Warmup 4: What is the correlation between all of the columns?
```
%% Cell type:code id: tags:
``` python
# Warmup 5: Close the connection.
piazza_conn.close()
```
%% Cell type:markdown id: tags:
# Plotting Applications
**Learning Objectives**
- Make a line plot on a series or on a DataFrame
- Apply features of line plots and bar plots to visualize results of data investigations
- Clean Series data by dropping NaN values and by converting to int
- Make a stacked bar plot
%% Cell type:markdown id: tags:
## Line plots
- `SERIES.plot.line()`
- `DATAFRAME.plot.line()` each column in the data frame becomes a line in the plot
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.line.html
%% Cell type:code id: tags:
``` python
# when you make a series from a list, the default indices 0, 1, 2, ...
s = Series([1758, 2002, 2408, 2898, 3814, 4803, 5713, 6661, 7618, 8391, 8764]) # y values
s.plot.line()
```
%% Cell type:code id: tags:
``` python
# You can make a series from a list and add indices
s = Series([1758, 2002, 2408, 2898, 3814, 4803, 5713, 6661, 7618, 8391, 8764], \
index=[2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020])
s.plot.line()
```
%% Cell type:code id: tags:
``` python
# We can save the AxesSubplot and "beautify" it like the other plots...
s.plot.line()
```
%% Cell type:code id: tags:
``` python
# Be careful! If the indices are out of order you get a mess
# pandas plots each (index, value) in the order given
s = Series([1758, 2408, 2898, 3814, 4803, 5713, 6661, 7618, 8391, 8764, 2002], \
index=[2010, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2011])
s.plot.line()
s
```
%% Cell type:code id: tags:
``` python
# you can fix this by calling sort_index()
```
%% Cell type:markdown id: tags:
### Plotting lines from a DataFrame
%% Cell type:code id: tags:
``` python
# This DataFrame is made using a dict of lists
# City of Madison normal high and low (degrees F) by month
temp_df = DataFrame(
{
"high": [26, 31, 43, 57, 68, 78, 82, 79, 72, 59, 44, 30],
"low": [11, 15, 25, 36, 46, 56, 61, 59, 50, 39, 28, 16] }
)
temp_df
```
%% Cell type:markdown id: tags:
### A Line Plot made from a DataFrame automatically plots all columns
The same is true for bar plots; we'll see this later.
%% Cell type:code id: tags:
``` python
# You can also add ticks and ticklabels to a line plot
ax = temp_df.plot.line(figsize=(12, 4))
ax.set_title("Average Temperatures in Madison, WI")
ax.set_xlabel("Month")
ax.set_ylabel("Temp (Fahrenheit)")
ax.set_xticks(range(12)) # makes a range from 0 to 11
ax.set_xticklabels(["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"])
# This gets rid of the weird output
None
```
%% Cell type:code id: tags:
``` python
# ... Or explicitly pass the "x" and "y" parameters...
temp_df_with_month = DataFrame(
{
"month": ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"],
"high": [26, 31, 43, 57, 68, 78, 82, 79, 72, 59, 44, 30],
"low": [11, 15, 25, 36, 46, 56, 61, 59, 50, 39, 28, 16] }
)
ax = temp_df_with_month.plot.line(figsize=(12, 4))
ax.set_title("Average Temperatures in Madison, WI")
ax.set_xlabel("Month")
ax.set_ylabel("Temp (Fahrenheit)")
```
%% Cell type:markdown id: tags:
### We can perform a calculation on an entire DataFrame
Let's change the entire DataFrame to Celcius
%% Cell type:code id: tags:
``` python
# call the function on the dataframe
celcius_df = ???
celcius_df
```
%% Cell type:code id: tags:
``` python
# here is one way to add a horizontal line to our line plots
celcius_df["freezing"] = 0
celcius_df
```
%% Cell type:code id: tags:
``` python
# this plots each column as lines
# with rotation for the tick labels
ax = celcius_df.plot.line(y=["high", "low", "freezing"], figsize = (12,4))
ax.set_xlabel("Month")
ax.set_ylabel("Temp (Celcius)")
ax.set_xticks(range(12))
ax.set_xticklabels(["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"], rotation=45)
None
```
%% Cell type:markdown id: tags:
### Bar Plot Example w/ Fire Hydrants
- General review of Pandas
- Some new Bar Plot options
%% Cell type:code id: tags:
``` python
hdf = ???
hdf.tail()
```
%% Cell type:code id: tags:
``` python
# grab just the column names
???
```
%% Cell type:markdown id: tags:
### Let's create a *bar plot* to visualize *colors* of fire hydrants.
%% Cell type:code id: tags:
``` python
# make a series called counts_series which stores the value counts of the "nozzle_color"
color_counts = ???
color_counts # what type is this?
```
%% Cell type:code id: tags:
``` python
# TODO: Clean the data ......use str.upper()
color_counts= ???
color_counts
```
%% Cell type:code id: tags:
``` python
# make a horizontal bar plot of counts of colors and have the colors match
# use color list: ["b", "g", "darkorange", "r", "c", "0.5"]
```
%% Cell type:markdown id: tags:
### Let's create a *bar plot* to visualize *style* of fire hydrants.
%% Cell type:code id: tags:
``` python
# Do the same thing as we did for the colors but this time for the "Style"
style_counts = ???
style_counts
```
%% Cell type:code id: tags:
``` python
style_counts.plot.bar()
```
%% Cell type:code id: tags:
``` python
# Grab the top 12
# and them add an index to our Series for the sum of all the "other" for
```
%% Cell type:code id: tags:
``` python
# Plot the results
```
%% Cell type:markdown id: tags:
### Plot the year manufactured for the Pacer Style as opposed to other styles
%% Cell type:code id: tags:
``` python
# Let's get the year manufactured for all of the "Pacer" hydrants.
pacer_years = ???
pacer_years
```
%% Cell type:code id: tags:
``` python
# then do the same for all the other data
other_years = ???
other_years
```
%% Cell type:code id: tags:
``` python
# Round each year down to the start of the decade.
# e.g. 1987 --> 1980, 2003 --> 2000
pacer_decades = ???
pacer_decades
```
%% Cell type:code id: tags:
``` python
# Drop the NaN values, convert to int, and do value counts
```
%% Cell type:code id: tags:
``` python
# Do the same thing for other_years. Save to a variable called "other_decades"
```
%% Cell type:code id: tags:
``` python
# Build a DataFrame from a dictionary of key, Series
plot_df = DataFrame({
"pacer": pacer_decades,
"other": other_decades,
})
plot_df
```
%% Cell type:code id: tags:
``` python
# Make a bar plot
ax = plot_df.plot.bar()
ax.set_xlabel("Decade")
ax.set_ylabel("Hydrant Count")
```
%% Cell type:code id: tags:
``` python
# Ignore data from before 1950 using boolean indexing.
```
%% Cell type:code id: tags:
``` python
# Make a Stacked Bar Chart!
```
This diff is collapsed.
File added
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment