Skip to content
Snippets Groups Projects
Commit ad343639 authored by msyamkumar's avatar msyamkumar
Browse files

Removing checkpoints

parent b96aa05d
No related branches found
No related tags found
No related merge requests found
Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:code id: tags:
``` python
# this allows the full screen to be used
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
```
%% Cell type:code id: tags:
``` python
# import statements
import pandas as pd
from pandas import DataFrame, Series
import sqlite3
import os
```
%% Cell type:markdown id: tags:
#### Warmup 1: Create a database called student_grades.db with a single table called grades
%% Cell type:code id: tags:
``` python
# establish connection to a new database
grades_conn = sqlite3.connect("student_grades.db")
# Q: When outer data structure is a dictionary, are inner data structures
# rows or columns of the DataFrame table?
# A:
df = pd.DataFrame({
"name": [
"Cole",
"Cynthia",
"Alice",
"Seth"
],
"grade": [
"C",
"AB",
"B",
"BC"
],
"gpa": [
2.0,
3.5,
3.0,
2.5
],
"attendance": [
4,
11,
10,
6
]
})
# convert the DataFrame to sql database
df.to_sql("grades", con = grades_conn, if_exists = "replace", index = False)
```
%% Cell type:markdown id: tags:
#### Warmup 2: What are the columns of our table? What are their datatypes?
%% Cell type:code id: tags:
``` python
df = pd.read_sql("???", grades_conn)
print(df['sql'].iloc[0])
# name is TEXT (in Python, str)
# grade is TEXT (in Python, str)
# gpa is REAL (in Python, float)
# attendance is INTEGER (in Python, int)
```
%% Cell type:markdown id: tags:
#### Warmup 4: What is the data in "grades" table?
Save this to a variable called "student_grades" and display it.
%% Cell type:code id: tags:
``` python
student_grades = pd.read_sql("???", grades_conn)
student_grades
```
%% Cell type:markdown id: tags:
#### Warmup 5: Make a scatter plot where the attendance of a student is on the x-axis and their gpa on the y-axis
Preview to upcoming topic
%% Cell type:code id: tags:
``` python
student_grades.plot.scatter(x = "attendance", y = "gpa")
```
%% Cell type:markdown id: tags:
#### Warmup 6: What is the correlation between gpa and attendance?
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
#### Warmup 7: Close the connection.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Lecture 35: Bar Plots
Learning Objectives:
- Make a bar plot from a Pandas Series
- Add features: x-label, y-label, title, gridlines, color to plot
- Set the index of a DataFrame certain column
- Create an 'other' column in a DataFrame
%% Cell type:code id: tags:
``` python
# Without this Jupyter notebook cannot display the "first" plot in older versions
# of Python / mathplotlib / jupyter
```
%% Cell type:markdown id: tags:
### Helpful documentation and an overview of how matplotlib works
https://matplotlib.org/stable/tutorials/introductory/usage.html
%% Cell type:markdown id: tags:
***Just for today's lecture, let's have import statements inside the notebook file. You should never do this when you write project code***
%% Cell type:code id: tags:
``` python
# matplotlib is a plotting module similar to MATLAB
import matplotlib
# matplotlib is highly configurable, acts like a style sheet for plots
# rc stands for runtime config, syntax is like a dictionary
matplotlib.rcParams # show all parameters
#matplotlib.rcParams["???"] # show current font size setting
#matplotlib.rcParams["???"] = ??? # change current font size setting
```
%% Cell type:markdown id: tags:
## Plots from pandas Series
- matplotlib integrates with pandas, just like sqlite3 integrates with pandas
- Syntax: ```Series.plot.<PLOT_FUNCTION>(...)```
## Bar plots: From a Series
- Series indices are the x-axis labels
- Series values are the height of each bar
https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.bar.html
%% Cell type:code id: tags:
``` python
s = Series({"Police": 5000000, "Fire": 3000000, "Schools": 2000000})
# What are the two terminologies associated with pandas Series?
# A:
# make a bar plot...notice the type
```
%% Cell type:code id: tags:
``` python
# if we store the returned object in a variable, we can configure the AxesSubplot
# typically the variable name used is 'ax'
```
%% Cell type:markdown id: tags:
### How can we set the x-axis, y-axis labels, and title?
- plotting functions return what is called as AxesSubplot
- store into a variable and use the AxesSubplot object
- Syntax:
```
ax.set_ylabel("...")
ax.set_xlabel("...")
ax.set_title("...")
```
%% Cell type:code id: tags:
``` python
# What is this 1e6? Can we get rid of it?
# Instead of 1e6, divide all values in s by 1000000 (1 million)
# better plot:
# set the y label to "Dollars (Millions)"
# set the x label to "City Agency"
# this is self-explanatory, so we will skip this for other example plots
# set the title to "Annual City Spending"
```
%% Cell type:markdown id: tags:
### How can we rotate the x-axis labels so that they are more readable?
%% Cell type:code id: tags:
``` python
s
```
%% Cell type:markdown id: tags:
Which aspect of the Series is the x-axis label coming from?
%% Cell type:code id: tags:
``` python
# Answer:
```
%% Cell type:markdown id: tags:
How can we extract the indices from a Series?
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
Now let's use that to set x-axis tick label formatting.
- Syntax:
```
ax.set_xticklabels(<list of x-axis labels>, rotation = ???)
```
%% Cell type:code id: tags:
``` python
ax = (s / 1000000).plot.bar()
ax.set_ylabel("Dollars (Millions)")
ax.set_title("Annual City Spending")
# give the x ticklabels a rotation of 45 degrees
```
%% Cell type:markdown id: tags:
### How can we change the figure size?
- figsize keyword argument
- should be a tuple with two values: width and height (in inches)
%% Cell type:code id: tags:
``` python
ax = (s / 1000000).plot.bar(???)
ax.set_ylabel("Dollars (Millions)")
ax.set_title("Annual City Spending")
```
%% Cell type:markdown id: tags:
### How can we make the bars horizontal?
https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.barh.html
- switch figsize arguments
- change y-label to x-label
%% Cell type:code id: tags:
``` python
# paste the previous code cell here and modify
```
%% Cell type:markdown id: tags:
### Change bar color by using the argument color = ' '
- Syntax: ``` plot.bar(figsize = (width, height), color = ???)```
- 8 standard colors: r, g, b, c, m, y, k, w (for example: ```color = 'k'```, which is black)
- you could also specify the entire color as a string (for example: ```color = 'red'```)
- can use value of grey between 0 and 1 (for example: ```color = '0.6'```)
- can use a tuple (r, g, b) between 0 and 1 (for example: ```color = (0, .3, .4)```)
%% Cell type:code id: tags:
``` python
# color as a single char
ax = (s / 1000000).plot.barh(figsize = (4, 1.5), color = ???) # black color
ax.set_xlabel("Dollars (Millions)")
ax.set_title("Annual City Spending")
```
%% Cell type:code id: tags:
``` python
# color as a str
ax = (s / 1000000).plot.barh(figsize = (4, 1.5), color = ???) # red color
ax.set_xlabel("Dollars (Millions)")
ax.set_title("Annual City Spending")
```
%% Cell type:code id: tags:
``` python
# color as tuple of (r, g, b)
ax = (s / 1000000).plot.barh(figsize = (4, 1.5), color = (.2, .5, 0))
ax.set_xlabel("Dollars (Millions)")
ax.set_title("Annual City Spending")
```
%% Cell type:markdown id: tags:
### How can we mark gridlines?
- use ax.grid()
%% Cell type:code id: tags:
``` python
# copy the previous code and add grid lines
ax = (s / 1000000).plot.barh(figsize = (4, 1.5), color = 'b')
ax.set_xlabel("Dollars (Millions)")
ax.set_title("Annual City Spending")
```
%% Cell type:markdown id: tags:
## Examples with the Bus Route Database
%% Cell type:code id: tags:
``` python
path = "bus.db"
# assert existence of path
# establish connection to bus.db
```
%% Cell type:markdown id: tags:
### Find the tables in `bus.db`
%% Cell type:code id: tags:
``` python
pd.read_sql("""
SELECT *
FROM sqlite_master
WHERE type = 'table'
""", conn)
```
%% Cell type:code id: tags:
``` python
pd.read_sql("""
SELECT * from
boarding
""", conn)
```
%% Cell type:markdown id: tags:
#### What are the top routes, and how many people ride them daily?
%% Cell type:code id: tags:
``` python
df = pd.read_sql("""
""", conn)
df
```
%% Cell type:markdown id: tags:
#### Let's take the daily column out as a Series ...
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Oops, too much data. Let's filter down to top 5 routes. How can we do that in SQL?
%% Cell type:code id: tags:
``` python
# TODO: add the appropriate SQL clause
df = pd.read_sql("""
SELECT Route, SUM(DailyBoardings) AS daily
FROM boarding
GROUP BY Route
ORDER BY daily DESC
""", conn)
df
```
%% Cell type:markdown id: tags:
Now, plot this!
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
#### Huh, what exactly is route 0? Where is that coming from?
Q: Can you guess where it is coming from?
A: It is coming from dataframe row index!
%% Cell type:code id: tags:
``` python
df
```
%% Cell type:markdown id: tags:
#### Let's fix that: we can use df.set_index(...)
- set_index returns a brand new DataFrame object instance
%% Cell type:code id: tags:
``` python
df
```
%% Cell type:markdown id: tags:
And now plot this...
%% Cell type:code id: tags:
``` python
s = df["daily"]
s.plot.bar()
```
%% Cell type:markdown id: tags:
### Wouldn't it be nice to have an "other" bar to represent other routes?
- we have to now get rid of LIMIT clause
- we have to deal with other routes using pandas
%% Cell type:code id: tags:
``` python
df = pd.read_sql("""
SELECT Route, SUM(DailyBoardings) AS daily
FROM boarding
GROUP BY Route
ORDER BY daily DESC
""", conn)
df = df.set_index("Route")
s = df["daily"]
df.head()
```
%% Cell type:markdown id: tags:
#### We are back to plotting all route bars ...
%% Cell type:code id: tags:
``` python
s.plot.bar()
```
%% Cell type:markdown id: tags:
### How can we slice a pandas dataframe?
- Recall that .iloc allows us to do slicing.
- For reproducing previous 5-route plot, we just need to take first 5 route details and populate into a series s.
- For the "other" part, we want all the rows in dataframe after row 5 summed up together.
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
s.plot.bar()
# Q: Where did the xlabel come from?
# A:
```
%% Cell type:markdown id: tags:
Let's fix the plot aesthetics.
%% Cell type:code id: tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment