Skip to content
Snippets Groups Projects
Commit 8f41adfc authored by gsingh58's avatar gsingh58
Browse files

Lec31 updated

parent 83e67fce
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Web 3: Scraping Web Data # Web 3: Scraping Web Data
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# import statements # import statements
import requests import requests
from bs4 import BeautifulSoup # bs4 is the module, BeautifulSoup is the type from bs4 import BeautifulSoup # bs4 is the module, BeautifulSoup is the type
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Warmup 1: HTML table and hyperlinks ### Warmup 1: HTML table and hyperlinks
In order to scrape web pages, you need to know the HTML syntax for tables and hyperlinks. In order to scrape web pages, you need to know the HTML syntax for tables and hyperlinks.
TODO: Add another row or two to the table below TODO: Add another row or two to the table below
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<table> <table>
<tr> <tr>
<th>University</th> <th>University</th>
<th>Department</th> <th>Department</th>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td> <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://stat.wisc.edu/">Statistics</a></td> <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://cdis.wisc.edu/">CDIS</a></td> <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
</tr> </tr>
<tr> <tr>
<td>UC Berkeley</td> <td>UC Berkeley</td>
<td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td> <td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
</tr> </tr>
</table> </table>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Warmup 2: Scraping data from syllabus page ### Warmup 2: Scraping data from syllabus page
URL: https://cs220.cs.wisc.edu/s23/syllabus.html URL: https://cs220.cs.wisc.edu/s23/syllabus.html
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Get this page using requests. # Get this page using requests.
url = "https://cs220.cs.wisc.edu/s23/syllabus.html" url = "https://cs220.cs.wisc.edu/f23/syllabus.html"
r = requests.get(url) r = requests.get(url)
# make sure there is no error # make sure there is no error
r.raise_for_status() r.raise_for_status()
# read the entire contents of the page into a single string variable # read the entire contents of the page into a single string variable
contents = r.text contents = r.text
# split the contents into list of strings using newline separator # split the contents into list of strings using newline separator
content_list = contents.split("\n") content_list = contents.split("\n")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Warmup 2a: Find all sentences that contain "CS220" #### Warmup 2a: Find all sentences that contain "CS220"
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
cs220_sentences = [sentence for sentence in content_list if "CS220" in sentence] cs220_sentences = [sentence for sentence in content_list if "CS220" in sentence]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Warmup 2b: Extract title tag's value #### Warmup 2b: Extract title tag's value
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
title_tag = cs220_sentences[0] title_tag = cs220_sentences[0]
print(title_tag) print(title_tag)
title_tag = title_tag.strip() title_tag = title_tag.strip()
print(title_tag) print(title_tag)
title_tag_parts = title_tag.split(">") title_tag_parts = title_tag.split(">")
print(title_tag_parts) print(title_tag_parts)
title_details = title_tag_parts[1] title_details = title_tag_parts[1]
title_detail_parts = title_details.split("<") title_detail_parts = title_details.split("<")
title_detail_parts[0] # finally, we are able to extract the title tag's data title_detail_parts[0] # finally, we are able to extract the title tag's data
# Takeaway: It would be nice if there were a module that could make finding easy! # Takeaway: It would be nice if there were a module that could make finding easy!
``` ```
%% Output %% Output
<title>CS220</title> <title>CS220</title>
<title>CS220</title> <title>CS220</title>
['<title', 'CS220</title', ''] ['<title', 'CS220</title', '']
'CS220' 'CS220'
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Learning Objectives: ### Learning Objectives:
- Using the Document Object Model of web pages - Using the Document Object Model of web pages
- describe the 3 things a DOM element may contain, and give examples of each - describe the 3 things a DOM element may contain, and give examples of each
- given an html string, identify the correct DOM tree of elements - given an html string, identify the correct DOM tree of elements
- Create BeautifulSoup objects from an html string and use prettify to display - Create BeautifulSoup objects from an html string and use prettify to display
- Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag - Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag
- Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs - Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs
- Use BeautifulSoup to scrape a live web site. - Use BeautifulSoup to scrape a live web site.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Document Object Model ### Document Object Model
In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements. In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.
<div> <div>
<img src="attachment:image.png" width="600"/> <img src="attachment:image.png" width="600"/>
</div> </div>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Take a look at the HTML in the below cell. ### Take a look at the HTML in the below cell.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<b>To Do List</b> <b>To Do List</b>
<ul> <ul>
<li>Eat Healthy</li> <li>Eat Healthy</li>
<li>Sleep <b>More</b></li> <li>Sleep <b>More</b></li>
<li>Exercise</li> <li>Exercise</li>
</ul> </ul>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### BeautifulSoup constructor ### BeautifulSoup constructor
- takes a html, as a string, as argument and parses it - takes a html, as a string, as argument and parses it
- Syntax: `BeautifulSoup(<html_string>, "html.parser")` - Syntax: `BeautifulSoup(<html_string>, "html.parser")`
- Second argument specifies what kind of parsing we want done - Second argument specifies what kind of parsing we want done
New syntax, you can use `"""some really long string"""` to split a string across multiple lines. New syntax, you can use `"""some really long string"""` to split a string across multiple lines.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
html_string = """ html_string = """
<b>To Do List</b> <b>To Do List</b>
<ul> <ul>
<li>Eat Healthy</li> <li>Eat Healthy</li>
<li>Sleep <b>More</b></li> <li>Sleep <b>More</b></li>
<li>Exercise</li> <li>Exercise</li>
</ul> </ul>
""" """
bs_obj = BeautifulSoup(html_string, "html.parser") bs_obj = BeautifulSoup(html_string, "html.parser")
type(bs_obj) type(bs_obj)
``` ```
%% Output %% Output
bs4.BeautifulSoup bs4.BeautifulSoup
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## BeautifulSoup operations ## BeautifulSoup operations
- `prettify()` returns a formatted representation of the raw HTML - `prettify()` returns a formatted representation of the raw HTML
### A BeautifulSoup object can be searched for elements using: ### A BeautifulSoup object can be searched for elements using:
- `find("")` returns the first element matching the tag string, None otherwise - `find("")` returns the first element matching the tag string, None otherwise
- `find_all("")` returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise - `find_all("")` returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise
### Beautiful Soup Elements can be inspected by using: ### Beautiful Soup Elements can be inspected by using:
- `get_text()` returns the text associated with this element, if applicable; does not return the child elements associated with that element - `get_text()` returns the text associated with this element, if applicable; does not return the child elements associated with that element
- `.children` all children of this element (can be converted into a list) - `.children` all children of this element (can be converted into a list)
- `.attrs` the atribute associated with that element / tag. - `.attrs` the atribute associated with that element / tag.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`prettify()` returns a formatted representation of the raw HTML `prettify()` returns a formatted representation of the raw HTML
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print(bs_obj.prettify()) print(bs_obj.prettify())
``` ```
%% Output %% Output
<b> <b>
To Do List To Do List
</b> </b>
<ul> <ul>
<li> <li>
Eat Healthy Eat Healthy
</li> </li>
<li> <li>
Sleep Sleep
<b> <b>
More More
</b> </b>
</li> </li>
<li> <li>
Exercise Exercise
</li> </li>
</ul> </ul>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`find` returns the first HTML 'tag' matching the string "b" `find` returns the first HTML 'tag' matching the string "b"
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
element = bs_obj.find("b") element = bs_obj.find("b")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
What is the type of find's return value? What is the type of find's return value?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print(type(element)) print(type(element))
``` ```
%% Output %% Output
<class 'bs4.element.Tag'> <class 'bs4.element.Tag'>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
How do we extract the text of the "b" element and what is its type? How do we extract the text of the "b" element and what is its type?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
text = element.get_text() text = element.get_text()
print(text, type(text)) print(text, type(text))
``` ```
%% Output %% Output
To Do List <class 'str'> To Do List <class 'str'>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`find` returns None if it cannot find that element. `find` returns None if it cannot find that element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# assert that this html string has a <ul> tag # assert that this html string has a <ul> tag
assert bs_obj.find("ul") != None assert bs_obj.find("ul") != None
# assert that this does not have an <a> tag # assert that this does not have an <a> tag
assert bs_obj.find("a") == None assert bs_obj.find("a") == None
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`find_all` returns an iterable of all matching elements (HTML 'tags') matching the string "b" `find_all` returns an iterable of all matching elements (HTML 'tags') matching the string "b"
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
element_list = bs_obj.find_all("b") element_list = bs_obj.find_all("b")
element_list element_list
``` ```
%% Output %% Output
[<b>To Do List</b>, <b>More</b>] [<b>To Do List</b>, <b>More</b>]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
What is the type of return value of `find_all`? What is the type of return value of `find_all`?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
type(element_list) type(element_list)
``` ```
%% Output %% Output
bs4.element.ResultSet bs4.element.ResultSet
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
type(element_list[0]) type(element_list[0])
``` ```
%% Output %% Output
bs4.element.Tag bs4.element.Tag
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Use a for loop to print the text of each "b" element. Use a for loop to print the text of each "b" element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
for element in element_list: for element in element_list:
print(element.get_text()) print(element.get_text())
``` ```
%% Output %% Output
To Do List To Do List
More More
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements. Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# only searches for elements, not text # only searches for elements, not text
print(bs_obj.find_all("Sleep")) print(bs_obj.find_all("Sleep"))
# if not present returns None # if not present returns None
print(bs_obj.find("Sleep")) print(bs_obj.find("Sleep"))
``` ```
%% Output %% Output
[] []
None None
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
You can invoke `find` or `find_all` on other BeautifulSoup object instances. You can invoke `find` or `find_all` on other BeautifulSoup object instances.
Find all `li` elements and find `b` element inside the second `li` element. Find all `li` elements and find `b` element inside the second `li` element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
li_elements = bs_obj.find_all("li") li_elements = bs_obj.find_all("li")
second_li = li_elements[1] second_li = li_elements[1]
second_li.find("b") second_li.find("b")
``` ```
%% Output %% Output
<b>More</b> <b>More</b>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
DOM trees are hierarchical. You can use `.children` on any element to gets its children. DOM trees are hierarchical. You can use `.children` on any element to gets its children.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Find all the children of "ul" element. Find all the children of "ul" element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
element = bs_obj.find("ul") element = bs_obj.find("ul")
children_list = list(element.children) children_list = list(element.children)
children_list children_list
``` ```
%% Output %% Output
['\n', ['\n',
<li>Eat Healthy</li>, <li>Eat Healthy</li>,
'\n', '\n',
<li>Sleep <b>More</b></li>, <li>Sleep <b>More</b></li>,
'\n', '\n',
<li>Exercise</li>, <li>Exercise</li>,
'\n'] '\n']
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Find text of every child element. Find text of every child element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
for child in children_list: for child in children_list:
print(child.get_text()) print(child.get_text())
``` ```
%% Output %% Output
Eat Healthy Eat Healthy
Sleep More Sleep More
Exercise Exercise
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()` Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()`
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
To understand `attribute`, let's go back to the table from warmup 1. To understand `attribute`, let's go back to the table from warmup 1.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
html_string = """ html_string = """
<table> <table>
<tr> <tr>
<th>University</th> <th>University</th>
<th>Department</th> <th>Department</th>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td> <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://stat.wisc.edu/">Statistics</a></td> <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://cdis.wisc.edu/">CDIS</a></td> <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
</tr> </tr>
<tr> <tr>
<td>UC Berkeley</td> <td>UC Berkeley</td>
<td> <td>
<a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences <a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences
</a> </a>
</td> </td>
</tr> </tr>
</table> </table>
""" """
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Find the table headers. Find the table headers.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
bs_obj = BeautifulSoup(html_string, "html.parser") bs_obj = BeautifulSoup(html_string, "html.parser")
th_elements = bs_obj.find_all("th") # works only if there is one table in that whole HTML th_elements = bs_obj.find_all("th") # works only if there is one table in that whole HTML
for th in th_elements: for th in th_elements:
print(th.get_text()) print(th.get_text())
``` ```
%% Output %% Output
University University
Department Department
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Find the first anchor element, extract its text. Find the first anchor element, extract its text.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
anchor = bs_obj.find("a") anchor = bs_obj.find("a")
print(anchor.get_text()) print(anchor.get_text())
``` ```
%% Output %% Output
Computer Sciences Computer Sciences
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value. You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.
Now, let's get the attributes of the anchor element. Now, let's get the attributes of the anchor element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
anchor_attributes = anchor.attrs anchor_attributes = anchor.attrs
anchor_attributes anchor_attributes
``` ```
%% Output %% Output
{'href': 'https://www.cs.wisc.edu/'} {'href': 'https://www.cs.wisc.edu/'}
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
What is the return value type of `.attrs`? What is the return value type of `.attrs`?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print(type(anchor_attributes)) print(type(anchor_attributes))
``` ```
%% Output %% Output
<class 'dict'> <class 'dict'>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Extract the hyperlink. Extract the hyperlink.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
anchor_attributes["href"] anchor_attributes["href"]
``` ```
%% Output %% Output
'https://www.cs.wisc.edu/' 'https://www.cs.wisc.edu/'
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Extract hyperlinks for each department and populate department name and link into a `dict`. Extract hyperlinks for each department and populate department name and link into a `dict`.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
department_urls = {} # Key: department name; Value: website URL department_urls = {} # Key: department name; Value: website URL
tr_elements = bs_obj.find_all("tr") tr_elements = bs_obj.find_all("tr")
for tr in tr_elements: for tr in tr_elements:
if tr.find("td") != None: # this should handle row containing th's if tr.find("td") != None: # this should handle row containing th's
anchor = tr.find("a") anchor = tr.find("a")
name = anchor.get_text() name = anchor.get_text()
website = anchor.attrs["href"] website = anchor.attrs["href"]
department_urls[name] = website department_urls[name] = website
department_urls department_urls
``` ```
%% Output %% Output
{'Computer Sciences': 'https://www.cs.wisc.edu/', {'Computer Sciences': 'https://www.cs.wisc.edu/',
'Statistics': 'https://stat.wisc.edu/', 'Statistics': 'https://stat.wisc.edu/',
'CDIS': 'https://cdis.wisc.edu/', 'CDIS': 'https://cdis.wisc.edu/',
'Electrical Engineering and Computer Sciences\n ': 'https://eecs.berkeley.edu/'} 'Electrical Engineering and Computer Sciences\n ': 'https://eecs.berkeley.edu/'}
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Self-practice: Extract title of the CS220 syllabus page (from warmup 2) #### Self-practice: Extract title of the CS220 syllabus page (from warmup 2)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Get this page using requests. # Get this page using requests.
url = "https://cs220.cs.wisc.edu/s23/syllabus.html" url = "https://cs220.cs.wisc.edu/s23/syllabus.html"
r = requests.get(url) r = requests.get(url)
# make sure there is no error # make sure there is no error
r.raise_for_status() r.raise_for_status()
# read the entire contents of the page into a single string variable # read the entire contents of the page into a single string variable
contents = r.text contents = r.text
# split the contents into list of strings using newline separator # split the contents into list of strings using newline separator
bs_obj = BeautifulSoup(contents, "html.parser") bs_obj = BeautifulSoup(contents, "html.parser")
bs_obj.find("title").get_text() bs_obj.find("title").get_text()
``` ```
%% Output %% Output
'CS220' 'CS220'
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Parsing small_movies html table to extract `small_movies.json` ## Parsing small_movies html table to extract `small_movies.json`
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 1: Read `small_movies.html` content into a variable ### Step 1: Read `small_movies.html` content into a variable
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
f = open("small_movies.html") f = open("small_movies.html")
small_movies_str = f.read() small_movies_str = f.read()
f.close() f.close()
# small_movies_str # small_movies_str
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 2: Initialize BeautifulSoup object instance ### Step 2: Initialize BeautifulSoup object instance
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
bs_obj = BeautifulSoup(small_movies_str, "html.parser") bs_obj = BeautifulSoup(small_movies_str, "html.parser")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 3: Find table element ### Step 3: Find table element
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
table = bs_obj.find("table") # works only when you have exactly 1 table table = bs_obj.find("table") # works only when you have exactly 1 table
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 4: Find all th tags, to parse the table header ### Step 4: Find all th tags, to parse the table header
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
header = [th.get_text() for th in table.find_all('th')] header = [th.get_text() for th in table.find_all('th')]
header header
``` ```
%% Output %% Output
['Title', 'Genre', 'Director', 'Cast', 'Year', 'Runtime', 'Rating', 'Revenue'] ['Title', 'Genre', 'Director', 'Cast', 'Year', 'Runtime', 'Rating', 'Revenue']
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 5: Scrape second row, convert data to appropriate types, and populate data into a row dictionary ### Step 5: Scrape second row, convert data to appropriate types, and populate data into a row dictionary
- "Year", "Runtime": `int` conversion - "Year", "Runtime": `int` conversion
- "Revenue": format_revenue(...) conversion - "Revenue": format_revenue(...) conversion
- "Rating": `float` conversion - "Rating": `float` conversion
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def format_revenue(revenue): def format_revenue(revenue):
if type(revenue) == float: # need this in here if we run code multiple times if type(revenue) == float: # need this in here if we run code multiple times
return revenue return revenue
elif revenue[-1] == 'M': # some have an "M" at the end elif revenue[-1] == 'M': # some have an "M" at the end
return float(revenue[:-1]) * 1e6 return float(revenue[:-1]) * 1e6
else: # otherwise, assume millions. else: # otherwise, assume millions.
return float(revenue) * 1e6 return float(revenue) * 1e6
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Why second row? Because first row has the header information. # Why second row? Because first row has the header information.
movie = {} movie = {}
tr_elements = table.find_all('tr') tr_elements = table.find_all('tr')
tr = tr_elements[1] tr = tr_elements[1]
td_elements = tr.find_all('td') td_elements = tr.find_all('td')
for idx in range(len(td_elements)): for idx in range(len(td_elements)):
td = td_elements[idx] td = td_elements[idx]
val = td.get_text() val = td.get_text()
if header[idx] in ["Year", "Runtime"]: if header[idx] in ["Year", "Runtime"]:
movie[header[idx]] = int(val) movie[header[idx]] = int(val)
elif header[idx] == "Revenue": elif header[idx] == "Revenue":
revenue = format_revenue(val) revenue = format_revenue(val)
movie[header[idx]] = revenue movie[header[idx]] = revenue
elif header[idx] == "Rating": elif header[idx] == "Rating":
movie[header[idx]] = float(val) movie[header[idx]] = float(val)
else: else:
movie[header[idx]] = val movie[header[idx]] = val
movie movie
``` ```
%% Output %% Output
{'Title': 'Guardians of the Galaxy', {'Title': 'Guardians of the Galaxy',
'Genre': 'Action,Adventure,Sci-Fi', 'Genre': 'Action,Adventure,Sci-Fi',
'Director': 'James Gunn', 'Director': 'James Gunn',
'Cast': 'Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana', 'Cast': 'Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana',
'Year': 2014, 'Year': 2014,
'Runtime': 121, 'Runtime': 121,
'Rating': 8.1, 'Rating': 8.1,
'Revenue': 333130000.0} 'Revenue': 333130000.0}
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 6: Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list ### Step 6: Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list
- "Year", "Runtime": `int` conversion - "Year", "Runtime": `int` conversion
- "Revenue": format_revenue(...) conversion - "Revenue": format_revenue(...) conversion
- "Rating": `float` conversion - "Rating": `float` conversion
You can compare your parsing output to `small_movies.json` file contents, to confirm your result. You can compare your parsing output to `small_movies.json` file contents, to confirm your result.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
movies_data = [] movies_data = []
tr_elements = table.find_all('tr') tr_elements = table.find_all('tr')
for tr in tr_elements[1:]: # Skip first row (header row) for tr in tr_elements[1:]: # Skip first row (header row)
movie = {} movie = {}
td_elements = tr.find_all('td') td_elements = tr.find_all('td')
for idx in range(len(td_elements)): for idx in range(len(td_elements)):
td = td_elements[idx] td = td_elements[idx]
val = td.get_text() val = td.get_text()
if header[idx] in ["Year", "Runtime"]: if header[idx] in ["Year", "Runtime"]:
movie[header[idx]] = int(val) movie[header[idx]] = int(val)
elif header[idx] == "Revenue": elif header[idx] == "Revenue":
revenue = format_revenue(val) revenue = format_revenue(val)
movie[header[idx]] = revenue movie[header[idx]] = revenue
elif header[idx] == "Rating": elif header[idx] == "Rating":
movie[header[idx]] = float(val) movie[header[idx]] = float(val)
else: else:
movie[header[idx]] = val movie[header[idx]] = val
movies_data.append(movie) movies_data.append(movie)
movies_data movies_data
``` ```
%% Output %% Output
[{'Title': 'Guardians of the Galaxy', [{'Title': 'Guardians of the Galaxy',
'Genre': 'Action,Adventure,Sci-Fi', 'Genre': 'Action,Adventure,Sci-Fi',
'Director': 'James Gunn', 'Director': 'James Gunn',
'Cast': 'Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana', 'Cast': 'Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana',
'Year': 2014, 'Year': 2014,
'Runtime': 121, 'Runtime': 121,
'Rating': 8.1, 'Rating': 8.1,
'Revenue': 333130000.0}, 'Revenue': 333130000.0},
{'Title': 'Prometheus', {'Title': 'Prometheus',
'Genre': 'Adventure,Mystery,Sci-Fi', 'Genre': 'Adventure,Mystery,Sci-Fi',
'Director': 'Ridley Scott', 'Director': 'Ridley Scott',
'Cast': 'Noomi Rapace, Logan Marshall-Green, Michael fassbender, Charlize Theron', 'Cast': 'Noomi Rapace, Logan Marshall-Green, Michael fassbender, Charlize Theron',
'Year': 2012, 'Year': 2012,
'Runtime': 124, 'Runtime': 124,
'Rating': 7.0, 'Rating': 7.0,
'Revenue': 126460000.0}, 'Revenue': 126460000.0},
{'Title': 'Split', {'Title': 'Split',
'Genre': 'Horror,Thriller', 'Genre': 'Horror,Thriller',
'Director': 'M. Night Shyamalan', 'Director': 'M. Night Shyamalan',
'Cast': 'James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula', 'Cast': 'James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula',
'Year': 2016, 'Year': 2016,
'Runtime': 117, 'Runtime': 117,
'Rating': 7.3, 'Rating': 7.3,
'Revenue': 138120000.0}, 'Revenue': 138120000.0},
{'Title': 'Sing', {'Title': 'Sing',
'Genre': 'Animation,Comedy,Family', 'Genre': 'Animation,Comedy,Family',
'Director': 'Christophe Lourdelet', 'Director': 'Christophe Lourdelet',
'Cast': 'Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson', 'Cast': 'Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson',
'Year': 2016, 'Year': 2016,
'Runtime': 108, 'Runtime': 108,
'Rating': 7.2, 'Rating': 7.2,
'Revenue': 270320000.0}, 'Revenue': 270320000.0},
{'Title': 'Suicide Squad', {'Title': 'Suicide Squad',
'Genre': 'Action,Adventure,Fantasy', 'Genre': 'Action,Adventure,Fantasy',
'Director': 'David Ayer', 'Director': 'David Ayer',
'Cast': 'Will Smith, Jared Leto, Margot Robbie, Viola Davis', 'Cast': 'Will Smith, Jared Leto, Margot Robbie, Viola Davis',
'Year': 2016, 'Year': 2016,
'Runtime': 123, 'Runtime': 123,
'Rating': 6.2, 'Rating': 6.2,
'Revenue': 325020000.0}, 'Revenue': 325020000.0},
{'Title': 'The Great Wall', {'Title': 'The Great Wall',
'Genre': 'Action,Adventure,Fantasy', 'Genre': 'Action,Adventure,Fantasy',
'Director': 'Yimou Zhang', 'Director': 'Yimou Zhang',
'Cast': 'Matt Damon, Tian Jing, Willem Dafoe, Andy Lau', 'Cast': 'Matt Damon, Tian Jing, Willem Dafoe, Andy Lau',
'Year': 2016, 'Year': 2016,
'Runtime': 103, 'Runtime': 103,
'Rating': 6.1, 'Rating': 6.1,
'Revenue': 45130000.0}, 'Revenue': 45130000.0},
{'Title': 'La La Land', {'Title': 'La La Land',
'Genre': 'Comedy,Drama,Music', 'Genre': 'Comedy,Drama,Music',
'Director': 'Damien Chazelle', 'Director': 'Damien Chazelle',
'Cast': 'Ryan Gosling, Emma Stone, Rosemarie DeWitt, J.K. Simmons', 'Cast': 'Ryan Gosling, Emma Stone, Rosemarie DeWitt, J.K. Simmons',
'Year': 2016, 'Year': 2016,
'Runtime': 128, 'Runtime': 128,
'Rating': 8.3, 'Rating': 8.3,
'Revenue': 151060000.0}, 'Revenue': 151060000.0},
{'Title': 'Mindhorn', {'Title': 'Mindhorn',
'Genre': 'Comedy', 'Genre': 'Comedy',
'Director': 'Sean Foley', 'Director': 'Sean Foley',
'Cast': 'Essie Davis, Andrea Riseborough, Julian Barratt,Kenneth Branagh', 'Cast': 'Essie Davis, Andrea Riseborough, Julian Barratt,Kenneth Branagh',
'Year': 2016, 'Year': 2016,
'Runtime': 89, 'Runtime': 89,
'Rating': 6.4, 'Rating': 6.4,
'Revenue': 0.0}, 'Revenue': 0.0},
{'Title': 'The Lost City of Z', {'Title': 'The Lost City of Z',
'Genre': 'Action,Adventure,Biography', 'Genre': 'Action,Adventure,Biography',
'Director': 'James Gray', 'Director': 'James Gray',
'Cast': 'Charlie Hunnam, Robert Pattinson, Sienna Miller, Tom Holland', 'Cast': 'Charlie Hunnam, Robert Pattinson, Sienna Miller, Tom Holland',
'Year': 2016, 'Year': 2016,
'Runtime': 141, 'Runtime': 141,
'Rating': 7.1, 'Rating': 7.1,
'Revenue': 8010000.0}, 'Revenue': 8010000.0},
{'Title': 'Passengers', {'Title': 'Passengers',
'Genre': 'Adventure,Drama,Romance', 'Genre': 'Adventure,Drama,Romance',
'Director': 'Morten Tyldum', 'Director': 'Morten Tyldum',
'Cast': 'Jennifer Lawrence, Chris Pratt, Michael Sheen,Laurence Fishburne', 'Cast': 'Jennifer Lawrence, Chris Pratt, Michael Sheen,Laurence Fishburne',
'Year': 2016, 'Year': 2016,
'Runtime': 116, 'Runtime': 116,
'Rating': 7.0, 'Rating': 7.0,
'Revenue': 100010000.0}] 'Revenue': 100010000.0}]
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Final step: convert steps 1 through 6 into a function and use that function to parse `full_movies.html` file. ### Final step: convert steps 1 through 6 into a function and use that function to parse `full_movies.html` file.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def parse_html(html_file): def parse_html(html_file):
f = open(html_file) f = open(html_file)
small_movies_str = f.read() small_movies_str = f.read()
f.close() f.close()
bs_obj = BeautifulSoup(small_movies_str, "html.parser") bs_obj = BeautifulSoup(small_movies_str, "html.parser")
table = bs_obj.find("table") # works only when you have exactly 1 table table = bs_obj.find("table") # works only when you have exactly 1 table
header = [th.get_text() for th in table.find_all('th')] header = [th.get_text() for th in table.find_all('th')]
movies_data = [] movies_data = []
tr_elements = table.find_all('tr') tr_elements = table.find_all('tr')
for tr in tr_elements[1:]: # Skip first row (header row) for tr in tr_elements[1:]: # Skip first row (header row)
movie = {} movie = {}
td_elements = tr.find_all('td') td_elements = tr.find_all('td')
for idx in range(len(td_elements)): for idx in range(len(td_elements)):
td = td_elements[idx] td = td_elements[idx]
val = td.get_text() val = td.get_text()
if header[idx] in ["Year", "Runtime"]: if header[idx] in ["Year", "Runtime"]:
movie[header[idx]] = int(val) movie[header[idx]] = int(val)
elif header[idx] == "Revenue": elif header[idx] == "Revenue":
revenue = format_revenue(val) revenue = format_revenue(val)
movie[header[idx]] = revenue movie[header[idx]] = revenue
elif header[idx] == "Rating": elif header[idx] == "Rating":
movie[header[idx]] = float(val) movie[header[idx]] = float(val)
else: else:
movie[header[idx]] = val movie[header[idx]] = val
movies_data.append(movie) movies_data.append(movie)
return movies_data return movies_data
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
full_movies_data = parse_html("full_movies.html") full_movies_data = parse_html("full_movies.html")
# full_movies_data # full_movies_data
``` ```
......
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Web3: Scraping Web Data # Web3: Scraping Web Data
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# import statements # import statements
import requests import requests
from bs4 import BeautifulSoup # bs4 is the module, BeautifulSoup is the type from bs4 import BeautifulSoup # bs4 is the module, BeautifulSoup is the type
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Warmup 1: HTML table and hyperlinks ### Warmup 1: HTML table and hyperlinks
In order to scrape web pages, you need to know the HTML syntax for tables and hyperlinks. In order to scrape web pages, you need to know the HTML syntax for tables and hyperlinks.
TODO: Add another row or two to the table below TODO: Add another row or two to the table below
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<table> <table>
<tr> <tr>
<th>University</th> <th>University</th>
<th>Department</th> <th>Department</th>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td> <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://stat.wisc.edu/">Statistics</a></td> <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://cdis.wisc.edu/">CDIS</a></td> <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
</tr> </tr>
<tr> <tr>
<td>UC Berkeley</td> <td>UC Berkeley</td>
<td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td> <td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
</tr> </tr>
</table> </table>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Warmup 2: Scraping data from syllabus page ### Warmup 2: Scraping data from syllabus page
URL: https://cs220.cs.wisc.edu/s23/syllabus.html URL: https://cs220.cs.wisc.edu/s23/syllabus.html
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Get this page using requests. # Get this page using requests.
url = "https://cs220.cs.wisc.edu/s23/syllabus.html" url = "https://cs220.cs.wisc.edu/f23/syllabus.html"
# make sure there is no error # make sure there is no error
# read the entire contents of the page into a single string variable # read the entire contents of the page into a single string variable
# split the contents into list of strings using newline separator # split the contents into list of strings using newline separator
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Warmup 2a: Find all sentences that contain "CS220" #### Warmup 2a: Find all sentences that contain "CS220"
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Warmup 2b: Extract title tag's value #### Warmup 2b: Extract title tag's value
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# finally, we are able to extract the title tag's data # finally, we are able to extract the title tag's data
# Takeaway: It would be nice if there were a module that could make finding easy! # Takeaway: It would be nice if there were a module that could make finding easy!
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Learning Objectives: ### Learning Objectives:
- Using the Document Object Model of web pages - Using the Document Object Model of web pages
- describe the 3 things a DOM element may contain, and give examples of each - describe the 3 things a DOM element may contain, and give examples of each
- given an html string, identify the correct DOM tree of elements - given an html string, identify the correct DOM tree of elements
- Create BeautifulSoup objects from an html string and use prettify to display - Create BeautifulSoup objects from an html string and use prettify to display
- Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag - Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag
- Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs - Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs
- Use BeautifulSoup to scrape a live web site. - Use BeautifulSoup to scrape a live web site.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Document Object Model ### Document Object Model
In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements. In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.
<div> <div>
<img src="attachment:image.png" width="600"/> <img src="attachment:image.png" width="600"/>
</div> </div>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Take a look at the HTML in the below cell. ### Take a look at the HTML in the below cell.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<b>To Do List</b> <b>To Do List</b>
<ul> <ul>
<li>Eat Healthy</li> <li>Eat Healthy</li>
<li>Sleep <b>More</b></li> <li>Sleep <b>More</b></li>
<li>Exercise</li> <li>Exercise</li>
</ul> </ul>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### BeautifulSoup constructor ### BeautifulSoup constructor
- takes a html, as a string, as argument and parses it - takes a html, as a string, as argument and parses it
- Syntax: `BeautifulSoup(<html_string>, "html.parser")` - Syntax: `BeautifulSoup(<html_string>, "html.parser")`
- Second argument specifies what kind of parsing we want done - Second argument specifies what kind of parsing we want done
New syntax, you can use `"""some really long string"""` to split a string across multiple lines. New syntax, you can use `"""some really long string"""` to split a string across multiple lines.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
html_string = """ html_string = """
<b>To Do List</b> <b>To Do List</b>
<ul> <ul>
<li>Eat Healthy</li> <li>Eat Healthy</li>
<li>Sleep <b>More</b></li> <li>Sleep <b>More</b></li>
<li>Exercise</li> <li>Exercise</li>
</ul> </ul>
""" """
type(bs_obj) type(bs_obj)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## BeautifulSoup operations ## BeautifulSoup operations
- `prettify()` returns a formatted representation of the raw HTML - `prettify()` returns a formatted representation of the raw HTML
### A BeautifulSoup object can be searched for elements using: ### A BeautifulSoup object can be searched for elements using:
- `find("")` returns the first element matching the tag string, None otherwise - `find("")` returns the first element matching the tag string, None otherwise
- `find_all("")` returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise - `find_all("")` returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise
### Beautiful Soup Elements can be inspected by using: ### Beautiful Soup Elements can be inspected by using:
- `get_text()` returns the text associated with this element, if applicable; does not return the child elements associated with that element - `get_text()` returns the text associated with this element, if applicable; does not return the child elements associated with that element
- `.children` all children of this element (can be converted into a list) - `.children` all children of this element (can be converted into a list)
- `.attrs` the atribute associated with that element / tag. - `.attrs` the atribute associated with that element / tag.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`prettify()` returns a formatted representation of the raw HTML `prettify()` returns a formatted representation of the raw HTML
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`find` returns the first HTML 'tag' matching the string "b" `find` returns the first HTML 'tag' matching the string "b"
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
What is the type of find's return value? What is the type of find's return value?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
How do we extract the text of the "b" element and what is its type? How do we extract the text of the "b" element and what is its type?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`find` returns None if it cannot find that element. `find` returns None if it cannot find that element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# assert that this html string has a <ul> tag # assert that this html string has a <ul> tag
# assert that this does not have an <a> tag # assert that this does not have an <a> tag
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`find_all` returns an iterable of all matching elements (HTML 'tags') matching the string "b" `find_all` returns an iterable of all matching elements (HTML 'tags') matching the string "b"
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
What is the type of return value of `find_all`? What is the type of return value of `find_all`?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Use a for loop to print the text of each "b" element. Use a for loop to print the text of each "b" element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements. Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# only searches for elements, not text # only searches for elements, not text
print(bs_obj.find_all("Sleep")) print(bs_obj.find_all("Sleep"))
# if not present returns None # if not present returns None
print(bs_obj.find("Sleep")) print(bs_obj.find("Sleep"))
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
You can invoke `find` or `find_all` on other BeautifulSoup object instances. You can invoke `find` or `find_all` on other BeautifulSoup object instances.
Find all `li` elements and find `b` element inside the second `li` element. Find all `li` elements and find `b` element inside the second `li` element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
DOM trees are hierarchical. You can use `.children` on any element to gets its children. DOM trees are hierarchical. You can use `.children` on any element to gets its children.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Find all the children of "ul" element. Find all the children of "ul" element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Find text of every child element. Find text of every child element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()` Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()`
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
To understand `attribute`, let's go back to the table from warmup 1. To understand `attribute`, let's go back to the table from warmup 1.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
html_string = """ html_string = """
<table> <table>
<tr> <tr>
<th>University</th> <th>University</th>
<th>Department</th> <th>Department</th>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td> <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://stat.wisc.edu/">Statistics</a></td> <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://cdis.wisc.edu/">CDIS</a></td> <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
</tr> </tr>
<tr> <tr>
<td>UC Berkeley</td> <td>UC Berkeley</td>
<td> <td>
<a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences <a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences
</a> </a>
</td> </td>
</tr> </tr>
</table> </table>
""" """
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Find the table headers. Find the table headers.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Find the first anchor element, extract its text. Find the first anchor element, extract its text.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value. You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.
Now, let's get the attributes of the anchor element. Now, let's get the attributes of the anchor element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
What is the return value type of `.attrs`? What is the return value type of `.attrs`?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Extract the hyperlink. Extract the hyperlink.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Extract hyperlinks for each department and populate department name and link into a `dict`. Extract hyperlinks for each department and populate department name and link into a `dict`.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
department_urls = {} # Key: department name; Value: website URL department_urls = {} # Key: department name; Value: website URL
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Self-practice: Extract title of the CS220 syllabus page (from warmup 2) #### Self-practice: Extract title of the CS220 syllabus page (from warmup 2)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Get this page using requests. # Get this page using requests.
url = "https://cs220.cs.wisc.edu/s23/syllabus.html" url = "https://cs220.cs.wisc.edu/s23/syllabus.html"
# make sure there is no error # make sure there is no error
# read the entire contents of the page into a single string variable # read the entire contents of the page into a single string variable
# use BeautifulSoup to extract title # use BeautifulSoup to extract title
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Parsing small_movies html table to extract `small_movies.json` ## Parsing small_movies html table to extract `small_movies.json`
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 1: Read `small_movies.html` content into a variable ### Step 1: Read `small_movies.html` content into a variable
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 2: Initialize BeautifulSoup object instance ### Step 2: Initialize BeautifulSoup object instance
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 3: Find table element ### Step 3: Find table element
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 4: Find all th tags, to parse the table header ### Step 4: Find all th tags, to parse the table header
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 5: Scrape second row, convert data to appropriate types, and populate data into a row dictionary ### Step 5: Scrape second row, convert data to appropriate types, and populate data into a row dictionary
- "Year", "Runtime": `int` conversion - "Year", "Runtime": `int` conversion
- "Revenue": format_revenue(...) conversion - "Revenue": format_revenue(...) conversion
- "Rating": `float` conversion - "Rating": `float` conversion
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def format_revenue(revenue): def format_revenue(revenue):
if type(revenue) == float: # need this in here if we run code multiple times if type(revenue) == float: # need this in here if we run code multiple times
return revenue return revenue
elif revenue[-1] == 'M': # some have an "M" at the end elif revenue[-1] == 'M': # some have an "M" at the end
return float(revenue[:-1]) * 1e6 return float(revenue[:-1]) * 1e6
else: # otherwise, assume millions. else: # otherwise, assume millions.
return float(revenue) * 1e6 return float(revenue) * 1e6
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Why second row? Because first row has the header information. # Why second row? Because first row has the header information.
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 6: Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list ### Step 6: Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list
- "Year", "Runtime": `int` conversion - "Year", "Runtime": `int` conversion
- "Revenue": format_revenue(...) conversion - "Revenue": format_revenue(...) conversion
- "Rating": `float` conversion - "Rating": `float` conversion
You can compare your parsing output to `small_movies.json` file contents, to confirm your result. You can compare your parsing output to `small_movies.json` file contents, to confirm your result.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Final step: convert steps 1 through 6 into a function and use that function to parse `full_movies.html` file. ### Final step: convert steps 1 through 6 into a function and use that function to parse `full_movies.html` file.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
full_movies_data = parse_html("full_movies.html") full_movies_data = parse_html("full_movies.html")
# full_movies_data # full_movies_data
``` ```
......
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Web3: Scraping Web Data # Web3: Scraping Web Data
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# import statements # import statements
import requests import requests
from bs4 import BeautifulSoup # bs4 is the module, BeautifulSoup is the type from bs4 import BeautifulSoup # bs4 is the module, BeautifulSoup is the type
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Warmup 1: HTML table and hyperlinks ### Warmup 1: HTML table and hyperlinks
In order to scrape web pages, you need to know the HTML syntax for tables and hyperlinks. In order to scrape web pages, you need to know the HTML syntax for tables and hyperlinks.
TODO: Add another row or two to the table below TODO: Add another row or two to the table below
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<table> <table>
<tr> <tr>
<th>University</th> <th>University</th>
<th>Department</th> <th>Department</th>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td> <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://stat.wisc.edu/">Statistics</a></td> <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://cdis.wisc.edu/">CDIS</a></td> <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
</tr> </tr>
<tr> <tr>
<td>UC Berkeley</td> <td>UC Berkeley</td>
<td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td> <td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
</tr> </tr>
</table> </table>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Warmup 2: Scraping data from syllabus page ### Warmup 2: Scraping data from syllabus page
URL: https://cs220.cs.wisc.edu/s23/syllabus.html URL: https://cs220.cs.wisc.edu/s23/syllabus.html
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Get this page using requests. # Get this page using requests.
url = "https://cs220.cs.wisc.edu/s23/syllabus.html" url = "https://cs220.cs.wisc.edu/f23/syllabus.html"
# make sure there is no error # make sure there is no error
# read the entire contents of the page into a single string variable # read the entire contents of the page into a single string variable
# split the contents into list of strings using newline separator # split the contents into list of strings using newline separator
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Warmup 2a: Find all sentences that contain "CS220" #### Warmup 2a: Find all sentences that contain "CS220"
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Warmup 2b: Extract title tag's value #### Warmup 2b: Extract title tag's value
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# finally, we are able to extract the title tag's data # finally, we are able to extract the title tag's data
# Takeaway: It would be nice if there were a module that could make finding easy! # Takeaway: It would be nice if there were a module that could make finding easy!
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Learning Objectives: ### Learning Objectives:
- Using the Document Object Model of web pages - Using the Document Object Model of web pages
- describe the 3 things a DOM element may contain, and give examples of each - describe the 3 things a DOM element may contain, and give examples of each
- given an html string, identify the correct DOM tree of elements - given an html string, identify the correct DOM tree of elements
- Create BeautifulSoup objects from an html string and use prettify to display - Create BeautifulSoup objects from an html string and use prettify to display
- Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag - Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag
- Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs - Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs
- Use BeautifulSoup to scrape a live web site. - Use BeautifulSoup to scrape a live web site.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Document Object Model ### Document Object Model
In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements. In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.
<div> <div>
<img src="attachment:image.png" width="600"/> <img src="attachment:image.png" width="600"/>
</div> </div>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Take a look at the HTML in the below cell. ### Take a look at the HTML in the below cell.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<b>To Do List</b> <b>To Do List</b>
<ul> <ul>
<li>Eat Healthy</li> <li>Eat Healthy</li>
<li>Sleep <b>More</b></li> <li>Sleep <b>More</b></li>
<li>Exercise</li> <li>Exercise</li>
</ul> </ul>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### BeautifulSoup constructor ### BeautifulSoup constructor
- takes a html, as a string, as argument and parses it - takes a html, as a string, as argument and parses it
- Syntax: `BeautifulSoup(<html_string>, "html.parser")` - Syntax: `BeautifulSoup(<html_string>, "html.parser")`
- Second argument specifies what kind of parsing we want done - Second argument specifies what kind of parsing we want done
New syntax, you can use `"""some really long string"""` to split a string across multiple lines. New syntax, you can use `"""some really long string"""` to split a string across multiple lines.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
html_string = """ html_string = """
<b>To Do List</b> <b>To Do List</b>
<ul> <ul>
<li>Eat Healthy</li> <li>Eat Healthy</li>
<li>Sleep <b>More</b></li> <li>Sleep <b>More</b></li>
<li>Exercise</li> <li>Exercise</li>
</ul> </ul>
""" """
type(bs_obj) type(bs_obj)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## BeautifulSoup operations ## BeautifulSoup operations
- `prettify()` returns a formatted representation of the raw HTML - `prettify()` returns a formatted representation of the raw HTML
### A BeautifulSoup object can be searched for elements using: ### A BeautifulSoup object can be searched for elements using:
- `find("")` returns the first element matching the tag string, None otherwise - `find("")` returns the first element matching the tag string, None otherwise
- `find_all("")` returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise - `find_all("")` returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise
### Beautiful Soup Elements can be inspected by using: ### Beautiful Soup Elements can be inspected by using:
- `get_text()` returns the text associated with this element, if applicable; does not return the child elements associated with that element - `get_text()` returns the text associated with this element, if applicable; does not return the child elements associated with that element
- `.children` all children of this element (can be converted into a list) - `.children` all children of this element (can be converted into a list)
- `.attrs` the atribute associated with that element / tag. - `.attrs` the atribute associated with that element / tag.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`prettify()` returns a formatted representation of the raw HTML `prettify()` returns a formatted representation of the raw HTML
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`find` returns the first HTML 'tag' matching the string "b" `find` returns the first HTML 'tag' matching the string "b"
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
What is the type of find's return value? What is the type of find's return value?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
How do we extract the text of the "b" element and what is its type? How do we extract the text of the "b" element and what is its type?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`find` returns None if it cannot find that element. `find` returns None if it cannot find that element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# assert that this html string has a <ul> tag # assert that this html string has a <ul> tag
# assert that this does not have an <a> tag # assert that this does not have an <a> tag
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`find_all` returns an iterable of all matching elements (HTML 'tags') matching the string "b" `find_all` returns an iterable of all matching elements (HTML 'tags') matching the string "b"
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
What is the type of return value of `find_all`? What is the type of return value of `find_all`?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Use a for loop to print the text of each "b" element. Use a for loop to print the text of each "b" element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements. Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# only searches for elements, not text # only searches for elements, not text
print(bs_obj.find_all("Sleep")) print(bs_obj.find_all("Sleep"))
# if not present returns None # if not present returns None
print(bs_obj.find("Sleep")) print(bs_obj.find("Sleep"))
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
You can invoke `find` or `find_all` on other BeautifulSoup object instances. You can invoke `find` or `find_all` on other BeautifulSoup object instances.
Find all `li` elements and find `b` element inside the second `li` element. Find all `li` elements and find `b` element inside the second `li` element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
DOM trees are hierarchical. You can use `.children` on any element to gets its children. DOM trees are hierarchical. You can use `.children` on any element to gets its children.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Find all the children of "ul" element. Find all the children of "ul" element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Find text of every child element. Find text of every child element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()` Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()`
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
To understand `attribute`, let's go back to the table from warmup 1. To understand `attribute`, let's go back to the table from warmup 1.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
html_string = """ html_string = """
<table> <table>
<tr> <tr>
<th>University</th> <th>University</th>
<th>Department</th> <th>Department</th>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td> <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://stat.wisc.edu/">Statistics</a></td> <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
</tr> </tr>
<tr> <tr>
<td>UW-Madison</td> <td>UW-Madison</td>
<td><a href = "https://cdis.wisc.edu/">CDIS</a></td> <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
</tr> </tr>
<tr> <tr>
<td>UC Berkeley</td> <td>UC Berkeley</td>
<td> <td>
<a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences <a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences
</a> </a>
</td> </td>
</tr> </tr>
</table> </table>
""" """
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Find the table headers. Find the table headers.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Find the first anchor element, extract its text. Find the first anchor element, extract its text.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value. You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.
Now, let's get the attributes of the anchor element. Now, let's get the attributes of the anchor element.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
What is the return value type of `.attrs`? What is the return value type of `.attrs`?
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Extract the hyperlink. Extract the hyperlink.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Extract hyperlinks for each department and populate department name and link into a `dict`. Extract hyperlinks for each department and populate department name and link into a `dict`.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
department_urls = {} # Key: department name; Value: website URL department_urls = {} # Key: department name; Value: website URL
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Self-practice: Extract title of the CS220 syllabus page (from warmup 2) #### Self-practice: Extract title of the CS220 syllabus page (from warmup 2)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Get this page using requests. # Get this page using requests.
url = "https://cs220.cs.wisc.edu/s23/syllabus.html" url = "https://cs220.cs.wisc.edu/s23/syllabus.html"
# make sure there is no error # make sure there is no error
# read the entire contents of the page into a single string variable # read the entire contents of the page into a single string variable
# use BeautifulSoup to extract title # use BeautifulSoup to extract title
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Parsing small_movies html table to extract `small_movies.json` ## Parsing small_movies html table to extract `small_movies.json`
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 1: Read `small_movies.html` content into a variable ### Step 1: Read `small_movies.html` content into a variable
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 2: Initialize BeautifulSoup object instance ### Step 2: Initialize BeautifulSoup object instance
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 3: Find table element ### Step 3: Find table element
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 4: Find all th tags, to parse the table header ### Step 4: Find all th tags, to parse the table header
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 5: Scrape second row, convert data to appropriate types, and populate data into a row dictionary ### Step 5: Scrape second row, convert data to appropriate types, and populate data into a row dictionary
- "Year", "Runtime": `int` conversion - "Year", "Runtime": `int` conversion
- "Revenue": format_revenue(...) conversion - "Revenue": format_revenue(...) conversion
- "Rating": `float` conversion - "Rating": `float` conversion
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def format_revenue(revenue): def format_revenue(revenue):
if type(revenue) == float: # need this in here if we run code multiple times if type(revenue) == float: # need this in here if we run code multiple times
return revenue return revenue
elif revenue[-1] == 'M': # some have an "M" at the end elif revenue[-1] == 'M': # some have an "M" at the end
return float(revenue[:-1]) * 1e6 return float(revenue[:-1]) * 1e6
else: # otherwise, assume millions. else: # otherwise, assume millions.
return float(revenue) * 1e6 return float(revenue) * 1e6
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# Why second row? Because first row has the header information. # Why second row? Because first row has the header information.
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Step 6: Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list ### Step 6: Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list
- "Year", "Runtime": `int` conversion - "Year", "Runtime": `int` conversion
- "Revenue": format_revenue(...) conversion - "Revenue": format_revenue(...) conversion
- "Rating": `float` conversion - "Rating": `float` conversion
You can compare your parsing output to `small_movies.json` file contents, to confirm your result. You can compare your parsing output to `small_movies.json` file contents, to confirm your result.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Final step: convert steps 1 through 6 into a function and use that function to parse `full_movies.html` file. ### Final step: convert steps 1 through 6 into a function and use that function to parse `full_movies.html` file.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
full_movies_data = parse_html("full_movies.html") full_movies_data = parse_html("full_movies.html")
# full_movies_data # full_movies_data
``` ```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment