Skip to content
Snippets Groups Projects
Commit 92ac16fc authored by Cole Nelson's avatar Cole Nelson
Browse files

lec35

parent 882e3dc8
No related branches found
No related tags found
No related merge requests found
Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:code id: tags:
``` python
import pandas as pd
from pandas import DataFrame, Series
import sqlite3
import os
import matplotlib
from matplotlib import pyplot as plt
import requests
matplotlib.rcParams["font.size"] = 12
```
%% Cell type:markdown id: tags:
### IRIS dataset: http://archive.ics.uci.edu/ml/datasets/iris
- This set of data is used in beginning Machine Learning Courses
- You can train a ML algorithm to use the values to predict the class of iris
- Dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
%% Cell type:code id: tags:
``` python
# Warmup 1: Requests and file writing
# use requests to get this file "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
response = requests.get("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
# check that the request was successful
# open a file called "iris.csv" for writing the data locally to avoid spamming their server
# write the text of response to the file object
# close the file object
# Look at the file you downloaded. What's wrong with it?
```
%% Cell type:code id: tags:
``` python
# Warmup 2: Making a DataFrame
# read the "iris.csv" file into a Pandas dataframe
# display the head of the data frame
```
%% Cell type:code id: tags:
``` python
# Warmup 3: Our CSV file has no header....let's add column names.
# Refer to the documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
# Attribute Information:
# 1. sepal length in cm
# 2. sepal width in cm
# 3. petal length in cm
# 4. petal width in cm
# 5. class: Iris Setosa, Iris Versicolour, Iris Virginica
# These should be our headers ["sep-length", "sep-width", "pet-length", "pet-width", "class"]
```
%% Cell type:code id: tags:
``` python
# Warmup 4: Connect to our database version of this data!
iris_conn = sqlite3.connect("iris-flowers.db")
pd.read_sql("SELECT * FROM sqlite_master WHERE type='table'", iris_conn)
```
%% Cell type:code id: tags:
``` python
# Warmup 5: Using SQL, get the 10 'Iris-setosa' flowers with the longest sepal length.
# Break any ties by ordering by the shortest sepal width.
pd.read_sql("""
""", iris_conn)
```
%% Cell type:markdown id: tags:
# Lecture 36: Scatter Plots
**Learning Objectives**
- Set the marker, color, and size of scatter plot data
- Calculate correlation between DataFrame columns
- Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
## Set the marker, color, and size of scatter plot data
To start, let's look at some made-up data about Trees.
The city of Madison maintains a database of all the trees they care for.
%% Cell type:code id: tags:
``` python
trees = [
{"age": 1, "height": 1.5, "diameter": 0.8},
{"age": 1, "height": 1.9, "diameter": 1.2},
{"age": 1, "height": 1.8, "diameter": 1.4},
{"age": 2, "height": 1.8, "diameter": 0.9},
{"age": 2, "height": 2.5, "diameter": 1.5},
{"age": 2, "height": 3, "diameter": 1.8},
{"age": 2, "height": 2.9, "diameter": 1.7},
{"age": 3, "height": 3.2, "diameter": 2.1},
{"age": 3, "height": 3, "diameter": 2},
{"age": 3, "height": 2.4, "diameter": 2.2},
{"age": 2, "height": 3.1, "diameter": 2.9},
{"age": 4, "height": 2.5, "diameter": 3.1},
{"age": 4, "height": 3.9, "diameter": 3.1},
{"age": 4, "height": 4.9, "diameter": 2.8},
{"age": 4, "height": 5.2, "diameter": 3.5},
{"age": 4, "height": 4.8, "diameter": 4},
]
trees_df = DataFrame(trees)
trees_df.head()
```
%% Cell type:markdown id: tags:
### Scatter Plots
We can make a scatter plot of a DataFrame using the following function...
`df_name.plot.scatter(x="x_col_name", y="y_col_name", color="peachpuff")`
Plot the trees data comparing a tree's age to its height...
- What is `df_name`?
- What is `x_col_name`?
- What is `y_col_name`?
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
Now plot with a little more beautification...
- Use a new [color](https://matplotlib.org/3.5.0/_images/sphx_glr_named_colors_003.png)
- Use a type of [marker](https://matplotlib.org/stable/api/markers_api.html)
- Change the size (any int)
%% Cell type:code id: tags:
``` python
# Plot with some more beautification options.
```
%% Cell type:code id: tags:
``` python
# Add a title to your plot.
```
%% Cell type:markdown id: tags:
#### Correlation
%% Cell type:code id: tags:
``` python
# What is the correlation between our DataFrame columns?
```
%% Cell type:code id: tags:
``` python
# What is the correlation between age and height (don't use .iloc)
```
%% Cell type:markdown id: tags:
### Variating Stylistic Parameters
%% Cell type:code id: tags:
``` python
# Option 1:
trees_df.plot.scatter(x="age", y="height", marker="H", s="diameter")
```
%% Cell type:code id: tags:
``` python
# Option 2:
trees_df.plot.scatter(x="age", y="height", marker = "H", s=trees_df["diameter"] * 50) # this way allows you to make it bigger
```
%% Cell type:markdown id: tags:
## Use subplots to group scatterplot data
%% Cell type:markdown id: tags:
### Re-visit the Iris Data
%% Cell type:code id: tags:
``` python
iris_df
```
%% Cell type:markdown id: tags:
### How do we create a *scatter plot* for various *class types*?
First, gather all the class types.
%% Cell type:code id: tags:
``` python
# In Pandas
varietes = ???
varietes
```
%% Cell type:code id: tags:
``` python
# In SQL
varietes = pd.read_sql("""
""", iris_conn)
varietes
```
%% Cell type:markdown id: tags:
In reality, you can choose to write Pandas or SQL queries (or a mix of both!). For the rest of this lecture, we'll use Pandas.
%% Cell type:code id: tags:
``` python
# If you want to continue using SQL instead, don't close the connection!
iris_conn.close()
```
%% Cell type:code id: tags:
``` python
# Change this scatter plot so that the data is only for class ='Iris-setosa'
iris_df.plot.scatter(x = "pet-width", y = "pet-length")
```
%% Cell type:code id: tags:
``` python
# Write a for loop that iterates through each variety in classes
# and makes a plot for only that class
for i in range(len(varietes)):
variety = varietes[i]
pass
```
%% Cell type:code id: tags:
``` python
# copy/paste the code above, but this time make each plot a different color
colors = ["blue", "green", "red"]
```
%% Cell type:code id: tags:
``` python
# copy/paste the code above, but this time make each plot a different color AND marker
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
```
%% Cell type:code id: tags:
``` python
# Did you notice that it made 3 plots?!?! What's decieving about this?
```
%% Cell type:code id: tags:
``` python
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
```
%% Cell type:code id: tags:
``` python
# Have to be VERY careful to not crop out data.
# We'll talk about this next lecture.
```
%% Cell type:code id: tags:
``` python
# Better yet, we could combine these.
```
%% Cell type:markdown id: tags:
### We can make Subplots in plots, called an AxesSubplot, keyword ax
1. if AxesSuplot ax passed, then plot in that subplot
2. if ax is None, create a new AxesSubplot
3. return AxesSubplot that was used
%% Cell type:code id: tags:
``` python
# complete this code to make 3 plots in one
plot_area = None # don't change this...look at this variable in line 12
colors = ["blue", "green", "red"]
markers = ["o", "^", "v"]
```
%% Cell type:markdown id: tags:
### Time-Permitting
Plot this data in an interesting/meaningful way & identify any correlations.
%% Cell type:code id: tags:
``` python
students = pd.DataFrame({
"name": [
"Cole",
"Cynthia",
"Alice",
"Seth"
],
"grade": [
"C",
"AB",
"B",
"BC"
],
"gpa": [
2.0,
3.5,
3.0,
2.5
],
"attendance": [
4,
11,
10,
6
],
"height": [
68,
66,
60,
72
]
})
students
```
%% Cell type:code id: tags:
``` python
# Min, Max, and Overall Difference in Student Height
min_height = students["height"].min()
max_height = students["height"].max()
diff_height = max_height - min_height
# Normalize students heights on a scale of [0, 1] (black to white)
height_colors = (students["height"] - min_height) / diff_height
# Normalize students heights on a scale of [0, 0.5] (black to gray)
height_colors = height_colors / 2
# Color must be a string (e.g. c='0.34')
height_colors = height_colors.astype("string")
height_colors
```
%% Cell type:code id: tags:
``` python
# Plot!
```
%% Cell type:code id: tags:
``` python
# What are the correlations?
```
%% Cell type:markdown id: tags:
![image.png](attachment:image.png)
%% Cell type:markdown id: tags:
https://www.researchgate.net/publication/247907373_Stupid_Data_Miner_Tricks_Overfitting_the_SP_500
File added
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment