Skip to content
Snippets Groups Projects
Commit bdd47a8b authored by Andy Kuemmel's avatar Andy Kuemmel
Browse files

Upload New File

parent 3ba45828
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Web3: Scraping Web Data
%% Cell type:code id: tags:
``` python
# import statements
import requests
from bs4 import BeautifulSoup # bs4 is the module, BeautifulSoup is the type
```
%% Cell type:markdown id: tags:
### Warmup 1: HTML table and hyperlinks
In order to scrape web pages, you need to know the HTML syntax for tables and hyperlinks.
TODO: Add another row or two to the table below
%% Cell type:markdown id: tags:
<table>
<tr>
<th>University</th>
<th>Department</th>
</tr>
<tr>
<td>UW-Madison</td>
<td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
</tr>
<tr>
<td>UW-Madison</td>
<td><a href = "https://stat.wisc.edu/">Statistics</a></td>
</tr>
<tr>
<td>UW-Madison</td>
<td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
</tr>
<tr>
<td>UC Berkeley</td>
<td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
</tr>
</table>
%% Cell type:markdown id: tags:
### Warmup 2: Scraping data from syllabus page
%% Cell type:code id: tags:
``` python
# Get this page using requests.
url = "https://cs220.cs.wisc.edu/f22/syllabus.html"
r = requests.get(url, verify=False)
# make sure there is no error
# read the entire contents of the page into a single string variable
html_str = ...
# split the contents into list of strings using newline separator
#html_lines = ...
#html_lines[:10]
```
%% Output
/Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
warnings.warn(
['<!doctype html>',
'<html lang="en">',
' <head>',
' <meta charset="utf-8">',
' <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">',
' <meta name="description" content="">',
' <meta name="author" content="">',
'',
' <!-- Google Auth stuff -->',
' <meta name="google-signin-scope" content="profile email">']
%% Cell type:markdown id: tags:
#### Warmup 2: find all lines with 'Kuemmel'
%% Cell type:code id: tags:
``` python
# Takeaway: It would be nice if there were a module that could make finding easy!
```
%% Output
<li>Andrew Kuemmel (Teaching Faculty - Department of Computer Sciences) kuemmel@wisc.edu</li>
%% Cell type:markdown id: tags:
### Learning Objectives:
- Using the Document Object Model of web pages
- describe the 3 things a DOM element may contain, and give examples of each
- given an html string, identify the correct DOM tree of elements
- Create BeautifulSoup objects from an html string and use prettify to display
- Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag
- Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs
- Use BeautifulSoup to scrape a live web site.
%% Cell type:markdown id: tags:
### Document Object Model
In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.
<div>
<img src="attachment:image.png" width="600"/>
</div>
%% Cell type:markdown id: tags:
### Take a look at the HTML in the below cell.
%% Cell type:markdown id: tags:
<b>To Do List</b>
<ul>
<li>Eat Healthy</li>
<li>Sleep <b>More</b></li>
<li>Exercise</li>
</ul>
%% Cell type:markdown id: tags:
### BeautifulSoup constructor
- takes a html, as a string, as argument and parses it
- Syntax: `BeautifulSoup(<html_string>, "html.parser")`
- Second argument specifies what kind of parsing we want done
%% Cell type:code id: tags:
``` python
html_string = "<b>To Do List</b><ul><li>Eat Healthy</li><li>Sleep <b>More</b></li><li>Exercise</li></ul>"
bs_obj = BeautifulSoup(..., "html.parser")
type(bs_obj)
```
%% Output
bs4.BeautifulSoup
%% Cell type:markdown id: tags:
## BeautifulSoup operations
- `prettify()` returns a formatted representation of the raw HTML
### A BeautifulSoup object can be searched for elements using:
- `find("")` returns the first element matching the tag string, None otherwise
- `find_all("")` returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise
### Beautiful Soup Elements can be inspected by using:
- `text` returns the text associated with this element, if applicable; does not return the child elements associated with that element
- `.children` all children of this element (can be converted into a list)
- `.attrs` the atribute associated with that element / tag.
%% Cell type:markdown id: tags:
`prettify()` returns a formatted representation of the raw HTML
%% Cell type:code id: tags:
``` python
# bs_obj.prettify()
```
%% Output
'<b>\n To Do List\n</b>\n<ul>\n <li>\n Eat Healthy\n </li>\n <li>\n Sleep\n <b>\n More\n </b>\n </li>\n <li>\n Exercise\n </li>\n</ul>'
%% Cell type:markdown id: tags:
`find` returns the first HTML 'tag' matching the string "b"
%% Cell type:code id: tags:
``` python
# bs_obj.find("b")
```
%% Output
<b>To Do List</b>
%% Cell type:markdown id: tags:
What is the type of find's return value?
%% Cell type:code id: tags:
``` python
```
%% Output
bs4.element.Tag
%% Cell type:markdown id: tags:
How do we extract the text of the "b" element and what is its type?
%% Cell type:code id: tags:
``` python
```
%% Output
'To Do List'
%% Cell type:markdown id: tags:
`find` returns None if it cannot find that element.
%% Cell type:code id: tags:
``` python
# assert that this html string has a <ul> tag
assert bs_obj.find("ul") ...
# assert that this does not have an <a> tag
assert bs_obj.find("a") ...
```
%% Cell type:markdown id: tags:
`find_all` returns an iterable of all matching elements (HTML 'tags') matching the string "b"
%% Cell type:code id: tags:
``` python
bold_elements = ...
bold_elements
```
%% Output
[<b>To Do List</b>, <b>More</b>]
%% Cell type:markdown id: tags:
What is the type of return value of `find_all`?
%% Cell type:code id: tags:
``` python
type(bold_elements)
```
%% Output
bs4.element.ResultSet
%% Cell type:code id: tags:
``` python
type(bold_elements[0])
```
%% Output
bs4.element.Tag
%% Cell type:markdown id: tags:
Use a for loop to print the text of each "b" element.
%% Cell type:code id: tags:
``` python
for element in bold_elements:
print(...)
```
%% Output
To Do List
More
%% Cell type:markdown id: tags:
Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements.
%% Cell type:code id: tags:
``` python
# only searches for elements, not text
# print(bs_obj.find_all("Sleep"))
# if not present returns None
# print(bs_obj.find("Sleep"))
```
%% Output
[]
None
%% Cell type:markdown id: tags:
You can invoke `find` or `find_all` on other BeautifulSoup object instances.
Find all `li` elements and find `b` element inside the second `li` element.
%% Cell type:code id: tags:
``` python
li_elements = ...
li_elements
```
%% Output
'More'
%% Cell type:code id: tags:
``` python
li_elements[1].find("b")
```
%% Output
<b>More</b>
%% Cell type:code id: tags:
``` python
li_elements[1].find("b").text
```
%% Output
'More'
%% Cell type:markdown id: tags:
### DOM trees are hierarchical. You can use `.children` on any element to gets its children.
%% Cell type:markdown id: tags:
Find all the children of "ul" element.
%% Cell type:code id: tags:
``` python
ul_elements = ...
ul_elements.children
```
%% Output
[<li>Eat Healthy</li>, <li>Sleep <b>More</b></li>, <li>Exercise</li>]
%% Cell type:markdown id: tags:
Find text of every child element.
%% Cell type:code id: tags:
``` python
```
%% Output
['Eat Healthy', 'Sleep More', 'Exercise']
%% Cell type:markdown id: tags:
Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`.
%% Cell type:markdown id: tags:
To understand `attribute`, let's go back to the table from warmup 1.
New syntax, you can use `"""some really long string"""` to split a string across multiple lines.
%% Cell type:code id: tags:
``` python
html_string = """
<table>
<tr>
<th>University</th>
<th>Department</th>
</tr>
<tr>
<td>UW-Madison</td>
<td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
</tr>
<tr>
<td>UW-Madison</td>
<td><a href = "https://stat.wisc.edu/">Statistics</a></td>
</tr>
<tr>
<td>UW-Madison</td>
<td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
</tr>
<tr>
<td>UC Berkeley</td>
<td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
</tr>
</table>
"""
```
%% Cell type:markdown id: tags:
Find the table headers.
%% Cell type:code id: tags:
``` python
bs_obj = BeautifulSoup(html_string, "html.parser")
th_elements = ...
th_elements
```
%% Output
[<th>University</th>, <th>Department</th>]
%% Cell type:markdown id: tags:
Find the first anchor element, extract its text.
%% Cell type:code id: tags:
``` python
anchor_element = ...
anchor_element
```
%% Output
'Computer Sciences'
%% Cell type:markdown id: tags:
You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.
Now, let's get the attributes of the anchor element.
%% Cell type:code id: tags:
``` python
anchor_element.attrs
```
%% Output
{'href': 'https://www.cs.wisc.edu/'}
%% Cell type:markdown id: tags:
What is the return value type of `.attrs`?
%% Cell type:code id: tags:
``` python
type(anchor_element.attrs)
```
%% Output
dict
%% Cell type:markdown id: tags:
Extract the hyperlink.
%% Cell type:code id: tags:
``` python
```
%% Output
'https://www.cs.wisc.edu/'
%% Cell type:markdown id: tags:
Extract hyperlinks for each department and populate department name and link into a `dict`.
%% Cell type:code id: tags:
``` python
department_urls = {} # Key: department name; Value: website URL
anchor_elements = bs_obj.find_all("a")
anchor_elements
```
%% Output
Computer Sciences https://www.cs.wisc.edu/
Statistics https://stat.wisc.edu/
CDIS https://cdis.wisc.edu/
Electrical Engineering and Computer Sciences https://eecs.berkeley.edu/
{'Computer Sciences': 'https://www.cs.wisc.edu/',
'Statistics': 'https://stat.wisc.edu/',
'CDIS': 'https://cdis.wisc.edu/',
'Electrical Engineering and Computer Sciences': 'https://eecs.berkeley.edu/'}
%% Cell type:markdown id: tags:
#### Self-practice: Find all anchor links that include piazza in the CS 220 page
%% Cell type:code id: tags:
``` python
# Get this page using requests.
url = "https://cs220.cs.wisc.edu/f22/syllabus.html"
r = ...
# make sure there is no error
# read the entire contents of the page into a single string variable
html_data = ...
# create a BeautifulSoup object
bs_obj = ...
# find all anchor elements
anchor_elements = ..
# print out all URLS to piazza
```
%% Output
/Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
warnings.warn(
['https://piazza.com/wisc/fall2022/cs220/home',
'https://piazza.com/wisc/fall2022/cs220/home']
%% Cell type:markdown id: tags:
### Scraping Tables
### Parsing small_movies html table to extract `small_movies.json`
%% Cell type:markdown id: tags:
### Step 1: Read `small_movies.html` content into a variable
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Step 2: Initialize BeautifulSoup object instance
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Step 3: Find table element
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Step 4: Find all th tags, to parse the table header
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Step 5: Scrape second row, convert data to appropriate types, and populate data into a row dictionary
- "Year", "Runtime": `int` conversion
- "Revenue": format_revenue(...) conversion
- "Rating": `float` conversion
%% Cell type:code id: tags:
``` python
def format_revenue(revenue):
if type(revenue) == float: # need this in here if we run code multiple times
return revenue
elif revenue[-1] == 'M': # some have an "M" at the end
return float(revenue[:-1]) * 1e6
else: # otherwise, assume millions.
return float(revenue) * 1e6
```
%% Cell type:code id: tags:
``` python
# Why second row? Because first row has the header information.
```
%% Cell type:markdown id: tags:
### Step 6: Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list
- "Year", "Runtime": `int` conversion
- "Revenue": format_revenue(...) conversion
- "Rating": `float` conversion
You can compare your parsing output to `small_movies.json` file contents, to confirm your result.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Final step: convert steps 1 through 6 into a function and use that function to parse `full_movies.html` file.
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
full_movies_data = parse_html("full_movies.html")
# full_movies_data
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment