Skip to content
Snippets Groups Projects
Commit 6671cbaf authored by Ashwin Maran's avatar Ashwin Maran
Browse files

add lecture 35

parent 76a6e4fd
No related branches found
No related tags found
No related merge requests found
This diff is collapsed.
%% Cell type:code id: tags:
``` python
import pandas as pd
from pandas import DataFrame, Series
import sqlite3
import os
import matplotlib
from matplotlib import pyplot as plt
import requests
matplotlib.rcParams["font.size"] = 12
```
%% Cell type:markdown id: tags:
### Titanic dataset: https://www.kaggle.com/datasets/yasserh/titanic-dataset
A **copy** can be found at: `https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/raw/main/s24/AmFam_Ashwin/35_Plotting2/Lecture%20Code/titanic.csv`
%% Cell type:markdown id: tags:
## Warmup 1: Requests and file writing
Download this file and save it locally in the file `titanic.csv`
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
## Warmup 2: Making a DataFrame
Read the `"titanic.csv"` file into a Pandas DataFrame
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
## Warmup 3: Some of our column names are not very clear, let's change them.
These should be our headers: `"ID", "Survived", "Passenger Class", "Name", "Sex", "Age", "No. of Siblings/Spouses aboard", "No. of Parents/Children aboard", "Ticket Number", "Fare", "Cabin", "Location Embarked"`
Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
## Warmup 4: Connect to our database version of this data!
%% Cell type:markdown id: tags:
#### This following code will create a `titanic.db` file and write the contents of `titanic_df` into this Database
%% Cell type:code id: tags:
``` python
titanic_conn = sqlite3.connect("titanic.db")
titanic_df.to_sql("titanic", titanic_conn, if_exists="replace", index=False)
```
%% Cell type:code id: tags:
``` python
pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", titanic_conn)
```
%% Cell type:code id: tags:
``` python
pd.read_sql("SELECT * FROM titanic LIMIT 5", titanic_conn)
```
%% Cell type:markdown id: tags:
## Warmup 5: Using SQL, get the 10 oldest male Titanic passengers
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
## Warmup 6: Using SQL, get the average Fare for each Passenger Class.
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
# Lecture 35: Scatter Plots
**Learning Objectives**
- Set the marker, color, and size of scatter plot data
- Calculate correlation between DataFrame columns
- Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
## Set the marker, color, and size of scatter plot data
To start, let's look at some made-up data about Trees.
The city of Madison maintains a database of all the trees they care for.
%% Cell type:code id: tags:
``` python
trees = [
{"age": 1, "height": 1.5, "diameter": 0.8},
{"age": 1, "height": 1.9, "diameter": 1.2},
{"age": 1, "height": 1.8, "diameter": 1.4},
{"age": 2, "height": 1.8, "diameter": 0.9},
{"age": 2, "height": 2.5, "diameter": 1.5},
{"age": 2, "height": 3, "diameter": 1.8},
{"age": 2, "height": 2.9, "diameter": 1.7},
{"age": 3, "height": 3.2, "diameter": 2.1},
{"age": 3, "height": 3, "diameter": 2},
{"age": 3, "height": 2.4, "diameter": 2.2},
{"age": 2, "height": 3.1, "diameter": 2.9},
{"age": 4, "height": 2.5, "diameter": 3.1},
{"age": 4, "height": 3.9, "diameter": 3.1},
{"age": 4, "height": 4.9, "diameter": 2.8},
{"age": 4, "height": 5.2, "diameter": 3.5},
{"age": 4, "height": 4.8, "diameter": 4},
]
trees_df = DataFrame(trees)
trees_df.head()
```
%% Cell type:markdown id: tags:
### Scatter Plots
We can make a scatter plot of a DataFrame using the following function...
`df_name.plot.scatter(x="x_col_name", y="y_col_name", color="peachpuff")`
## Example 1: Plot the trees data comparing a tree's age to its height
<pre>
- What is `df_name`?
- What is `x_col_name`?
- What is `y_col_name`?
</pre>
%% Cell type:code id: tags:
``` python
trees_df.plot.scatter(x="age", y="height", color="g")
```
%% Cell type:markdown id: tags:
#### Now plot with a little more beautification...
- Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
- Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
- Change the size (any int)
%% Cell type:code id: tags:
``` python
trees_df.plot.scatter(x="age", y="height", color="r", marker="D", s=50) # D for diamond
```
%% Cell type:markdown id: tags:
#### And we can add a Title to our plot...
%% Cell type:code id: tags:
``` python
ax = trees_df.plot.scatter(x="age", y="height", color="r", marker="D", s=50)
ax.set_title("Tree Age vs Height")
```
%% Cell type:markdown id: tags:
# Correlation
## Example 2: What is the correlation between our DataFrame columns?
%% Cell type:code id: tags:
``` python
corr_df = trees_df.corr()
corr_df
```
%% Cell type:markdown id: tags:
## Exercise 1: What is the correlation between age and height?
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
### Variating Stylistic Parameters
%% Cell type:code id: tags:
``` python
trees_df.plot.scatter(x="age", y="height", marker="H", s="diameter")
```
%% Cell type:markdown id: tags:
#### We should scale up the sizes to make them more easily visible
%% Cell type:code id: tags:
``` python
trees_df.plot.scatter(x="age", y="height", marker="H", s=trees_df["diameter"] * 20) # this way allows you to make it bigger
```
%% Cell type:markdown id: tags:
## Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
### Re-visit the Titanic Data
%% Cell type:code id: tags:
``` python
titanic_df.head()
```
%% Cell type:markdown id: tags:
### How do we create a *scatter plot* for various *class types*?
First, gather all the class types.
%% Cell type:markdown id: tags:
#### In Pandas...
%% Cell type:code id: tags:
``` python
classes = list(set(titanic_df["Passenger Class"]))
classes
```
%% Cell type:markdown id: tags:
#### In SQL...
%% Cell type:code id: tags:
``` python
classes = sorted(list(pd.read_sql("""
SELECT DISTINCT `Passenger Class`
FROM titanic
""", titanic_conn)["Passenger Class"]))
classes
```
%% Cell type:markdown id: tags:
#### In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
%% Cell type:code id: tags:
``` python
# If you want to continue using SQL instead, don't close the connection!
titanic_conn.close()
```
%% Cell type:markdown id: tags:
## Exercise 2: Change this scatter plot so that the data is only for `Passenger class = 3`
%% Cell type:code id: tags:
``` python
titanic_df.plot.scatter(x="Age", y="Fare")
```
%% Cell type:markdown id: tags:
## Exercise 3: Write a for loop that iterates through each Passenger Class and makes a plot for only that class
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
#### Make the same series of plots, but this time make each plot a different color
%% Cell type:code id: tags:
``` python
colors = ["blue", "green", "red"]
# write your code here
```
%% Cell type:markdown id: tags:
#### Make the same series of plots, but this time make each plot a different color AND marker
%% Cell type:code id: tags:
``` python
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
# write your code here
```
%% Cell type:markdown id: tags:
**Food for thought:** Did you notice that it made 3 plots? What's deceptive about this?
%% Cell type:code id: tags:
``` python
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
min_x = titanic_df["Age"].min()
max_x = titanic_df["Age"].max()
min_y = titanic_df["Fare"].min()
max_y = titanic_df["Fare"].max()
for i in range(len(classes)):
pass_class = classes[i]
# make a df just of just the data for this variety
pass_class_df = titanic_df[titanic_df["Passenger Class"] == pass_class]
# make a scatter plot for this passenger class
pass_class_df.plot.scatter(x="Age", y="Fare", label=pass_class, color=colors[i], marker=markers[i], xlim=(min_x, max_x), ylim=(min_y, max_y))
```
%% Cell type:markdown id: tags:
#### We have to be VERY careful to not crop out data. We'll talk about this next lecture...
%% Cell type:markdown id: tags:
### We can also make Subplots in plots, called an AxesSubplot, keyword `ax`
<pre>
1. if AxesSuplot ax passed, then plot in that subplot
2. if ax is None, create a new AxesSubplot
3. return AxesSubplot that was used
</pre>
%% Cell type:code id: tags:
``` python
plot_area = None # don't change this...look at this variable in the last line
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
for i in range(len(classes)):
pass_class = classes[i]
# make a df just of just the data for this variety
pass_class_df = titanic_df[titanic_df["Passenger Class"] == pass_class]
# make a scatter plot for this passenger class
plot_area = pass_class_df.plot.scatter(x="Age", y="Fare", label=pass_class, color=colors[i], marker=markers[i], ax=plot_area)
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment