Skip to content
Snippets Groups Projects
Commit a4737065 authored by msyamkumar's avatar msyamkumar
Browse files

Lec 32 materials

parent 38593c46
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id: tags:
``` python
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
```
%% Output
%% Cell type:code id: tags:
``` python
import csv
import os
import csv
```
%% Cell type:code id: tags:
``` python
# copied from https://automatetheboringstuff.com/2e/chapter16/
def process_csv(filename):
exampleFile = open(filename)
exampleReader = csv.reader(exampleFile)
exampleData = list(exampleReader)
return exampleData
```
%% Cell type:markdown id: tags:
## Example 1: List Visualization
### Write a gen_html function
- Input: shopping_list and path to shopping.html
- Outcome: create shopping.html file
### Pseudocode
1. Open "shopping.html" in write mode.
2. Write \<ul\> tag into the html file
3. Iterate over each item in shopping list.
4. Write each item with <\li\> tag.
5. After you are done iterating, write \</ul\> tag.
6. Close the file object.
%% Cell type:code id: tags:
``` python
def gen_html(shopping_list, html_path):
f = open(html_path, "w")
f.write("<ul>\n")
for item in shopping_list:
f.write("<li>" + str(item) + "\n")
f.write("</ul>\n")
f.close()
gen_html(["apples", "oranges", "milk", "banana"], "shopping.html")
```
%% Cell type:markdown id: tags:
## Example 2: Dictionary Visualization
### Write a csv_to_html function
- Input: path to review1.csv and path to reviews.html
- Outcome 1: create a html file for each review
- Outcome 2: create reviews.html file containing link to a html file for each review
### Pseudocode
1. Create data_html folder using os.mkdir. Make sure to use try ... except blocks to catch FileExistsError
2. Use process_csv function to read csv data and split the header and the data
3. For each review, extract review id, review title, review text.
4. generate the \<rid\>.html for each review inside data_html folder.
- Open \<rid\>.html in write mode
- Add review title using \<h1\> tag
- Add review text inside\<p\> tag
- Close \<rid\>.html file object
5. generate a reviews.html file which has link to each review html page \<rid\>.html
- Open reviews.html file in write mode
- Add each \<rid\>.html as hyperlink using \<a\> tag.
- Close reviews.html file
%% Cell type:code id: tags:
``` python
def csv_to_html(csv_path, html_path):
try:
os.mkdir("data_html")
except FileExistsError:
pass
reviews_data = process_csv(csv_path)
reviews_header = reviews_data[0]
reviews_data = reviews_data[1:]
reviews_file = open(html_path, "w")
reviews_file.write("<ul>\n")
for row in reviews_data:
rid = row[reviews_header.index("review id")]
title = row[reviews_header.index("review title")]
text = row[reviews_header.index("review text")]
# STEP 4: generate the <rid>.html for each review inside data folder
review_path = os.path.join("data_html", str(rid) + ".html")
html_file = open(review_path, "w")
html_file.write("<h1>{}</h1><p>{}</p>".format(title, text))
html_file.close()
# STEP 5: generate a reviews.html file which has link to each review html page <rid>.html
reviews_file.write('<li><a href = "{}">{}</a>'.format(review_path, str(rid) + ":" + str(title)) + "<br>\n")
reviews_file.write("</ul>\n")
reviews_file.close()
csv_to_html(os.path.join("data", "review1.csv"), "reviews.html")
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
```
%% Cell type:code id: tags:
``` python
import csv
import os
```
%% Cell type:markdown id: tags:
## Example 1: List Visualization
### Write a gen_html function
- Input: shopping_list and path to shopping.html
- Outcome: create shopping.html file
### Pseudocode
1. Open "shopping.html" in write mode.
2. Write \<ul\> tag into the html file
3. Iterate over each item in shopping list.
4. Write each item with \<li\> tag.
5. After you are done iterating, write \</ul\> tag.
6. Close the file object.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## Example 2: Dictionary Visualization
### Write a csv_to_html function
- Input: path to review1.csv and path to reviews.html
- Outcome 1: create a html file for each review
- Outcome 2: create reviews.html file containing link to a html file for each review
### Pseudocode
1. Create data_html folder using os.mkdir. Make sure to use try ... except blocks to catch FileExistsError
2. Use process_csv function to read csv data and split the header and the data
3. For each review, extract review id, review title, review text.
4. generate the \<rid\>.html for each review inside data_html folder.
- Open \<rid\>.html in write mode
- Add review title using \<h1\> tag
- Add review text inside\<p\> tag
- Close \<rid\>.html file object
5. generate a reviews.html file which has link to each review html page \<rid\>.html
- Open reviews.html file in write mode
- Add each \<rid\>.html as hyperlink using \<a\> tag.
- Close reviews.html file
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:code id: tags:
``` python
# known import statements
from bs4 import BeautifulSoup
import os
import pandas as pd
# let's import sqlite3 module
```
%% Cell type:markdown id: tags:
### Warmup 1: Explore this HTML table of volunteer hours
%% Cell type:markdown id: tags:
<table>
<tr>
<th>Name</th>
<th>Week 1</th>
<th>Week 2</th
><th>Week 3</th>
</tr>
<tr>
<td>Therese</td>
<td>13</td>
<td>4</td>
<td>5</td>
</tr>
<tr>
<td>Carl</td>
<td>5</td>
<td>7</td>
<td>8</td>
</tr>
<tr>
<td>Marie</td>
<td>2</td>
<td>9</td>
<td>11</td>
</tr>
</table>
%% Cell type:markdown id: tags:
### Warmup 2a: Parse "hours.html" using BeautifulSoup
#### Step 1: Read contents from "hours.html" file
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
#### Step 2: Create a BeautifulSoup object instance
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
#### Step 3: Parse the table
%% Cell type:code id: tags:
``` python
# Use find method to find the table
# works only if there is 1 table
# Q: what method do you need if the HTML has more than 1 table?
# A:
```
%% Cell type:markdown id: tags:
#### Step 4: Parse the header
- Bonus: Use list comprehension
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
#### Step 5: Parse the data rows and store data into a list of dict
- Remember that you need to skip over the first tr (which contains the header)
%% Cell type:code id: tags:
``` python
# Find all tr elements
tr_elements = ???
# Skip first tr row (header row)
tr_elements = ???
# Initialize empty list
work_hours = []
# Iterate through the tr elements
# Find all "td" elements in this row
# Create row dictionary
row_dict = {} # Key: column name (header); Value: cell's value
# Iterate over indices of td elements
# Assumes that td_elements and header have same length
# Extract the td text
# Make appropriate type conversions
# Use header instead of hardcoing index
# Insert key-value pairs
# Append row dictionary into list
work_hours
```
%% Cell type:markdown id: tags:
### Warmup 3: Use appropriate os module to assert that bus.db in this directory
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## April 20: DataBase1
### Learning Objectives:
- Explain how a database is different from a CSV file or a JSON file
- Use SQLite to connect to a database and pandas to query the database
- Write basic queries on a database using SELECT, FROM, WHERE, ORDER BY, and LIMIT
We will get started with slides.
%% Cell type:code id: tags:
``` python
# Get the Bus data from 'bus.db'
db_name = "bus.db"
assert os.path.exists(db_name)
# Why do we have to assert that database exists?
# If the database file does not exist, connect function creates a brand new one!
# open a connection object to our database file
# Important note: we need to close 'conn' when we are done, at the end of the notebook file
type(conn)
```
%% Cell type:markdown id: tags:
### Pandas has a .read_sql function `pd.read_sql(query, connection)`
- Allows us to process an SQL `query` on a SQL `connection`
- stores the result in a Pandas DataFrame
- First SQL query to always run on a database:
```
select * from sqlite_master
```
%% Cell type:code id: tags:
``` python
# This SQL query helps us know the table names, we don't use the other info
# Key observation: there are two tables: boarding and routes
```
%% Cell type:markdown id: tags:
### Databases are more structured than CSV and JSON files:
- all data contained inside one or more tables
- all tables must be named, all columns must be named
- all values in a column must be the same type
%% Cell type:markdown id: tags:
### Extract the "sql" column from df
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
# The SQL queries in sql column of the returned DataFrame show
# how database was set up (not part of CS220).
# Let's focus on the table names and column names
# Key observation: SQL has its own types (pandas takes care of the type conversions)
# and the types are strictly enforced
```
%% Cell type:markdown id: tags:
### Most basic SQL query
```
SELECT <Column(s)>
FROM <Table name>
```
- `SELECT` and `FROM` are mandatory clauses in a SQL query
- Can use * to mean "all columns"
%% Cell type:code id: tags:
``` python
# pandas continues to be an awesome tool
# pandas allows us to write a SQL query and create a DataFrame
```
%% Cell type:code id: tags:
``` python
# TODO: Now write a SQL query for displaying all columns from boarding table
```
%% Cell type:markdown id: tags:
### Optional SQL clauses
- WHERE: filters rows based on a column condition
- ORDER BY: sorting (`ASC` or `DESC` after the column name specify the ordering)
- LIMIT: simplistic filter (similar to slicing / head/tail functions in pandas DataFrames)
%% Cell type:markdown id: tags:
![Screen%20Shot%202021-11-23%20at%201.43.54%20PM.png](attachment:Screen%20Shot%202021-11-23%20at%201.43.54%20PM.png)
%% Cell type:markdown id: tags:
### What are all the details of route 80 bus stops?
%% Cell type:code id: tags:
``` python
query = """
"""
pd.read_sql(query, conn)
```
%% Cell type:markdown id: tags:
#### Sort the route 80 rows based on ascending order of DailyBoardings column.
%% Cell type:code id: tags:
``` python
query = """
"""
pd.read_sql(query, conn)
```
%% Cell type:markdown id: tags:
#### Sort the route 80 rows based on descending order of DailyBoardings column.
%% Cell type:code id: tags:
``` python
query = """
"""
pd.read_sql(query, conn)
```
%% Cell type:markdown id: tags:
### Which 10 bus stops have the lowest DailyBoardings and for what bus?
%% Cell type:code id: tags:
``` python
query = """
"""
pd.read_sql(query, conn)
```
%% Cell type:markdown id: tags:
### What are the top 3 stops (based on DailyBoardings) of route 3?
%% Cell type:code id: tags:
``` python
query = """
"""
pd.read_sql(query, conn)
```
%% Cell type:markdown id: tags:
### Go West - which bus should I take to go as far west as possible?
- Smallest Longitude
%% Cell type:code id: tags:
``` python
qry = """
"""
pd.read_sql(qry, conn)
```
%% Cell type:code id: tags:
``` python
# TODO: make a tuple out of this lat-long and enter that tuple into Google Maps
# TODO: Where is this location?
```
%% Cell type:markdown id: tags:
### How many people get on a bus in Madison every day?
- we are interested in boarding table to answer this question
%% Cell type:code id: tags:
``` python
#Answer using pandas
qry = """
"""
df = pd.read_sql(qry, conn)
```
%% Cell type:code id: tags:
``` python
# Next lecture, we'll learn all about SQL summarization
#Using SQL summarization
qry = """
"""
pd.read_sql(qry, conn)
```
%% Cell type:markdown id: tags:
![Screen%20Shot%202021-11-23%20at%201.47.20%20PM.png](attachment:Screen%20Shot%202021-11-23%20at%201.47.20%20PM.png)
%% Cell type:code id: tags:
``` python
# don't forget to close your connection!
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment