Skip to content
Snippets Groups Projects
Commit a3c78028 authored by Andy Kuemmel's avatar Andy Kuemmel
Browse files

Update...

Update f22/andy_lec_notes/lec37_Dec07_Plotting2/lec_37_plotting2_scatter_plots_template.ipynb, f22/andy_lec_notes/lec37_Dec07_Plotting2/lec_37_plotting2_scatter_plots.ipynb
Deleted f22/andy_lec_notes/lec37_Dec07_Plotting2/lec36_plotting2_850.ipynb, f22/andy_lec_notes/lec37_Dec07_Plotting2/lec36_plotting2_complete.ipynb
parent 575dec36
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id: tags:
``` python
# ignore this cell (it's just to make certain text red later, but you don't need to understand it).
from IPython.core.display import display, HTML
display(HTML('<style>em { color: red; }</style> <style>.container { width:100% !important; }</style>'))
```
%% Output
%% Cell type:code id: tags:
``` python
import pandas as pd
from pandas import DataFrame, Series
import sqlite3
import os
import matplotlib
from matplotlib import pyplot as plt
import requests
matplotlib.rcParams["font.size"] = 12
```
%% Cell type:markdown id: tags:
### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
- This set of data is used in beginning Machine Learning Courses
- You can train a ML algorithm to use the values to predict the class of iris
- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
%% Cell type:code id: tags:
``` python
# Warmup 1: Requests and file writing
# use requests to get this file "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
response = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
# check that the request was successful
response.raise_for_status()
# open a file called "iris.csv" for writing the data locally to avoid spamming their server
file_obj = open("iris.csv", "w")
# write the text of response to the file object
file_obj.write(response.text)
# close the file object
file_obj.close()
# Look at the file you downloaded. What's wrong with it?
```
%% Cell type:code id: tags:
``` python
# Warmup 2: Making a DataFrame
# read the "iris.csv" file into a Pandas dataframe
# display the head of the data frame
```
%% Cell type:code id: tags:
``` python
# Warmup 3: Our CSV file has no header....let's add column names.
# Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
# Attribute Information:
# 1. sepal length in cm
# 2. sepal width in cm
# 3. petal length in cm
# 4. petal width in cm
# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
# These should be our headers ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
```
%% Cell type:code id: tags:
``` python
# Warmup 4: Connect to our database version of this data
iris_conn = sqlite3.connect("iris-flowers.db")
# find out the name of the table
pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
```
%% Cell type:code id: tags:
``` python
# Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
# Break any ties by ordering by the shortest sepal width.
pd.read_sql("""
""", iris_conn)
```
%% Cell type:markdown id: tags:
# Lecture 36: Scatter Plots
**Learning Objectives**
- Set the marker, color, and size of scatter plot data
- Calculate correlation between DataFrame columns
- Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
## Set the marker, color, and size of scatter plot data
To start, let's look at some made-up data about Trees.
The city of Madison maintains a database of all the trees they care for.
%% Cell type:code id: tags:
``` python
trees = [
{"age": 1, "height": 1.5, "diameter": 0.8},
{"age": 1, "height": 1.9, "diameter": 1.2},
{"age": 1, "height": 1.8, "diameter": 1.4},
{"age": 2, "height": 1.8, "diameter": 0.9},
{"age": 2, "height": 2.5, "diameter": 1.5},
{"age": 2, "height": 3, "diameter": 1.8},
{"age": 2, "height": 2.9, "diameter": 1.7},
{"age": 3, "height": 3.2, "diameter": 2.1},
{"age": 3, "height": 3, "diameter": 2},
{"age": 3, "height": 2.4, "diameter": 2.2},
{"age": 2, "height": 3.1, "diameter": 2.9},
{"age": 4, "height": 2.5, "diameter": 3.1},
{"age": 4, "height": 3.9, "diameter": 3.1},
{"age": 4, "height": 4.9, "diameter": 2.8},
{"age": 4, "height": 5.2, "diameter": 3.5},
{"age": 4, "height": 4.8, "diameter": 4},
]
trees_df = DataFrame(trees)
trees_df.head()
```
%% Cell type:markdown id: tags:
### Scatter Plots
We can make a scatter plot of a DataFrame using the following function...
`df_name.plot.scatter(x="x_col_name", y="y_col_name", color="peachpuff")`
Plot the trees data comparing a tree's age to its height...
- What is `df_name`?
- What is `x_col_name`?
- What is `y_col_name`?
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
Now plot with a little more beautification...
- Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
- Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
- Change the size (any int)
%% Cell type:code id: tags:
``` python
# Plot with some more beautification options.
```
%% Cell type:code id: tags:
``` python
# Add a title to your plot.
```
%% Cell type:markdown id: tags:
#### Correlation
%% Cell type:code id: tags:
``` python
# What is the correlation between our DataFrame columns?
```
%% Cell type:code id: tags:
``` python
# What is the correlation between age and height (don't use .iloc)
```
%% Cell type:markdown id: tags:
### The Size can be based on a DataFrame value
%% Cell type:code id: tags:
``` python
# Option 1:
trees_df.plot.scatter(x="age", y="height", marker="H", s="diameter")
```
%% Cell type:code id: tags:
``` python
# Option 2:
trees_df.plot.scatter(x="age", y="height", marker = "H", s=trees_df["diameter"] * 50) # this way allows you to make it bigger
```
%% Cell type:markdown id: tags:
## Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
### Re-visit the Iris Data
%% Cell type:code id: tags:
``` python
iris_df
```
%% Cell type:markdown id: tags:
### How do we create a *scatter plot* for various *class types*?
First, gather all the class types.
%% Cell type:code id: tags:
``` python
# In Pandas
varieties = ???
varieties
```
%% Cell type:code id: tags:
``` python
# In SQL
varietes = pd.read_sql("""
""", iris_conn)
varietes
```
%% Cell type:markdown id: tags:
In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
%% Cell type:code id: tags:
``` python
# If you want to continue using SQL instead, don't close the connection!
iris_conn.close()
```
%% Cell type:code id: tags:
``` python
# Change this scatter plot so that the data is only for class ='Iris-setosa'
iris_df.plot.scatter(x = "pet-width", y = "pet-length")
```
%% Cell type:code id: tags:
``` python
# Write a for loop that iterates through each variety in classes
# and makes a plot for only that class
for i in range(len(varietes)):
variety = varietes[i]
pass
```
%% Cell type:code id: tags:
``` python
# copy/paste the code above, but this time make each plot a different color
colors = ["blue", "green", "red"]
```
%% Cell type:code id: tags:
``` python
# copy/paste the code above, but this time make each plot a different color AND marker
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
```
%% Cell type:code id: tags:
``` python
# Did you notice that it made 3 plots?!?! What's deceiving about this?
```
%% Cell type:code id: tags:
``` python
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
```
%% Cell type:code id: tags:
``` python
# Have to be VERY careful to not crop out data.
# We'll talk about this next lecture.
```
%% Cell type:code id: tags:
``` python
# Better yet, we could combine these.
```
%% Cell type:markdown id: tags:
### We can make Subplots in plots, called an AxesSubplot, keyword ax
1. if AxesSuplot ax passed, then plot in that subplot
2. if ax is None, create a new AxesSubplot
3. return AxesSubplot that was used
%% Cell type:code id: tags:
``` python
# complete this code to make 3 plots in one
plot_area = None # don't change this...look at this variable in line 12
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
```
%% Cell type:markdown id: tags:
### Time-Permitting
Plot this data in an interesting/meaningful way & identify any correlations.
%% Cell type:code id: tags:
``` python
students = pd.DataFrame({
"name": [
"Cole",
"Cynthia",
"Alice",
"Seth"
],
"grade": [
"C",
"AB",
"B",
"BC"
],
"gpa": [
2.0,
3.5,
3.0,
2.5
],
"attendance": [
4,
11,
10,
6
],
"height": [
68,
66,
60,
72
]
})
students
```
%% Cell type:code id: tags:
``` python
# Min, Max, and Overall Difference in Student Height
min_height = students["height"].min()
max_height = students["height"].max()
diff_height = max_height - min_height
# Normalize students heights on a scale of [0, 1] (black to white)
height_colors = (students["height"] - min_height) / diff_height
# Normalize students heights on a scale of [0, 0.5] (black to gray)
height_colors = height_colors / 2
# Color must be a string (e.g. c='0.34')
height_colors = height_colors.astype("string")
height_colors
```
%% Cell type:code id: tags:
``` python
# Plot!
```
%% Cell type:code id: tags:
``` python
# What are the correlations?
```
%% Cell type:markdown id: tags:
![image.png](attachment:image.png)
%% Cell type:markdown id: tags:
https://www.researchgate.net/publication/247907373_Stupid_Data_Miner_Tricks_Overfitting_the_SP_500
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment