Skip to content
Snippets Groups Projects
Commit a3a565fd authored by msyamkumar's avatar msyamkumar
Browse files

Merge branch 'main' of git.doit.wisc.edu:cdis/cs/courses/cs220/cs220-lecture-material

parents a4737065 12827f8d
No related branches found
No related tags found
No related merge requests found
Showing
with 841 additions and 11125 deletions
%% Cell type:markdown id: tags:
# Web3: Scraping Web Data
%% Cell type:code id: tags:
``` python
# import statements
import requests
from bs4 import BeautifulSoup # bs4 is the module, BeautifulSoup is the type
```
%% Cell type:markdown id: tags:
### Warmup 1: HTML table and hyperlinks
In order to scrape web pages, you need to know the HTML syntax for tables and hyperlinks.
TODO: Add another row or two to the table below
%% Cell type:markdown id: tags:
<table>
<tr>
<th>University</th>
<th>Department</th>
</tr>
<tr>
<td>UW-Madison</td>
<td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
</tr>
<tr>
<td>UW-Madison</td>
<td><a href = "https://stat.wisc.edu/">Statistics</a></td>
</tr>
<tr>
<td>UW-Madison</td>
<td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
</tr>
<tr>
<td>UC Berkeley</td>
<td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
</tr>
</table>
%% Cell type:markdown id: tags:
### Warmup 2: Scraping data from syllabus page
URL: https://www.msyamkumar.com/cs220/s22/syllabus.html
%% Cell type:code id: tags:
``` python
# Get this page using requests.
url = "https://www.msyamkumar.com/cs220/s22/syllabus.html"
url = "https://cs220.cs.wisc.edu/f22/syllabus.html"
r = requests.get(url, verify=False)
# make sure there is no error
# read the entire contents of the page into a single string variable
html_str = ...
# split the contents into list of strings using newline separator
#html_lines = ...
#html_lines[:10]
```
%% Cell type:markdown id: tags:
%% Output
#### Warmup 2a: Find all sentences that contain "Meena"
/Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
warnings.warn(
%% Cell type:code id: tags:
``` python
```
['<!doctype html>',
'<html lang="en">',
' <head>',
' <meta charset="utf-8">',
' <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">',
' <meta name="description" content="">',
' <meta name="author" content="">',
'',
' <!-- Google Auth stuff -->',
' <meta name="google-signin-scope" content="profile email">']
%% Cell type:markdown id: tags:
#### Warmup 2b: Extract title tag's value
#### Warmup 2: find all lines with 'Kuemmel'
%% Cell type:code id: tags:
``` python
# finally, we are able to extract the title tag's data
# Takeaway: It would be nice if there were a module that could make finding easy!
```
%% Output
<li>Andrew Kuemmel (Teaching Faculty - Department of Computer Sciences) kuemmel@wisc.edu</li>
%% Cell type:markdown id: tags:
### Learning Objectives:
- Using the Document Object Model of web pages
- describe the 3 things a DOM element may contain, and give examples of each
- given an html string, identify the correct DOM tree of elements
- Create BeautifulSoup objects from an html string and use prettify to display
- Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag
- Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs
- Use BeautifulSoup to scrape a live web site.
%% Cell type:markdown id: tags:
### Document Object Model
In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.
<div>
<img src="attachment:image.png" width="600"/>
</div>
%% Cell type:markdown id: tags:
### Take a look at the HTML in the below cell.
%% Cell type:markdown id: tags:
<b>To Do List</b>
<ul>
<li>Eat Healthy</li>
<li>Sleep <b>More</b></li>
<li>Exercise</li>
</ul>
%% Cell type:markdown id: tags:
### BeautifulSoup constructor
- takes a html, as a string, as argument and parses it
- Syntax: `BeautifulSoup(<html_string>, "html.parser")`
- Second argument specifies what kind of parsing we want done
%% Cell type:code id: tags:
``` python
html_string = "<b>To Do List</b><ul><li>Eat Healthy</li><li>Sleep <b>More</b></li><li>Exercise</li></ul>"
bs_obj = BeautifulSoup(..., "html.parser")
type(bs_obj)
```
%% Output
bs4.BeautifulSoup
%% Cell type:markdown id: tags:
## BeautifulSoup operations
- `prettify()` returns a formatted representation of the raw HTML
### A BeautifulSoup object can be searched for elements using:
- `find("")` returns the first element matching the tag string, None otherwise
- `find_all("")` returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise
### Beautiful Soup Elements can be inspected by using:
- `get_text()` returns the text associated with this element, if applicable; does not return the child elements associated with that element
- `text` returns the text associated with this element, if applicable; does not return the child elements associated with that element
- `.children` all children of this element (can be converted into a list)
- `.attrs` the atribute associated with that element / tag.
%% Cell type:markdown id: tags:
`prettify()` returns a formatted representation of the raw HTML
%% Cell type:code id: tags:
``` python
# bs_obj.prettify()
```
%% Output
'<b>\n To Do List\n</b>\n<ul>\n <li>\n Eat Healthy\n </li>\n <li>\n Sleep\n <b>\n More\n </b>\n </li>\n <li>\n Exercise\n </li>\n</ul>'
%% Cell type:markdown id: tags:
`find` returns the first HTML 'tag' matching the string "b"
%% Cell type:code id: tags:
``` python
# bs_obj.find("b")
```
%% Output
<b>To Do List</b>
%% Cell type:markdown id: tags:
What is the type of find's return value?
%% Cell type:code id: tags:
``` python
```
%% Output
bs4.element.Tag
%% Cell type:markdown id: tags:
How do we extract the text of the "b" element and what is its type?
%% Cell type:code id: tags:
``` python
```
%% Output
'To Do List'
%% Cell type:markdown id: tags:
`find` returns None if it cannot find that element.
%% Cell type:code id: tags:
``` python
# assert that this html string has a <ul> tag
assert bs_obj.find("ul") ...
# assert that this does not have an <a> tag
assert bs_obj.find("a") ...
```
%% Cell type:markdown id: tags:
`find_all` returns an iterable of all matching elements (HTML 'tags') matching the string "b"
%% Cell type:code id: tags:
``` python
bold_elements = ...
bold_elements
```
%% Output
[<b>To Do List</b>, <b>More</b>]
%% Cell type:markdown id: tags:
What is the type of return value of `find_all`?
%% Cell type:code id: tags:
``` python
type(bold_elements)
```
%% Output
bs4.element.ResultSet
%% Cell type:code id: tags:
``` python
type(bold_elements[0])
```
%% Output
bs4.element.Tag
%% Cell type:markdown id: tags:
Use a for loop to print the text of each "b" element.
%% Cell type:code id: tags:
``` python
for element in bold_elements:
print(...)
```
%% Output
To Do List
More
%% Cell type:markdown id: tags:
Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements.
%% Cell type:code id: tags:
``` python
# only searches for elements, not text
print(bs_obj.find_all("Sleep"))
# print(bs_obj.find_all("Sleep"))
# if not present returns None
print(bs_obj.find("Sleep"))
# print(bs_obj.find("Sleep"))
```
%% Output
[]
None
%% Cell type:markdown id: tags:
You can invoke `find` or `find_all` on other BeautifulSoup object instances.
Find all `li` elements and find `b` element inside the second `li` element.
%% Cell type:code id: tags:
``` python
li_elements = ...
li_elements
```
%% Output
'More'
%% Cell type:code id: tags:
``` python
li_elements[1].find("b")
```
%% Output
<b>More</b>
%% Cell type:code id: tags:
``` python
li_elements[1].find("b").text
```
%% Output
'More'
%% Cell type:markdown id: tags:
DOM trees are hierarchical. You can use `.children` on any element to gets its children.
### DOM trees are hierarchical. You can use `.children` on any element to gets its children.
%% Cell type:markdown id: tags:
Find all the children of "ul" element.
%% Cell type:code id: tags:
``` python
ul_elements = ...
ul_elements.children
```
%% Output
[<li>Eat Healthy</li>, <li>Sleep <b>More</b></li>, <li>Exercise</li>]
%% Cell type:markdown id: tags:
Find text of every child element.
%% Cell type:code id: tags:
``` python
```
%% Output
['Eat Healthy', 'Sleep More', 'Exercise']
%% Cell type:markdown id: tags:
Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()`
Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`.
%% Cell type:markdown id: tags:
To understand `attribute`, let's go back to the table from warmup 1.
New syntax, you can use `"""some really long string"""` to split a string across multiple lines.
%% Cell type:code id: tags:
``` python
html_string = """
<table>
<tr>
<th>University</th>
<th>Department</th>
</tr>
<tr>
<td>UW-Madison</td>
<td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
</tr>
<tr>
<td>UW-Madison</td>
<td><a href = "https://stat.wisc.edu/">Statistics</a></td>
</tr>
<tr>
<td>UW-Madison</td>
<td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
</tr>
<tr>
<td>UC Berkeley</td>
<td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
</tr>
</table>
"""
```
%% Cell type:markdown id: tags:
Find the table headers.
%% Cell type:code id: tags:
``` python
bs_obj = BeautifulSoup(html_string, "html.parser")
th_elements = ...
th_elements
```
%% Output
[<th>University</th>, <th>Department</th>]
%% Cell type:markdown id: tags:
Find the first anchor element, extract its text.
%% Cell type:code id: tags:
``` python
anchor_element = ...
anchor_element
```
%% Output
'Computer Sciences'
%% Cell type:markdown id: tags:
You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.
Now, let's get the attributes of the anchor element.
%% Cell type:code id: tags:
``` python
anchor_element.attrs
```
%% Output
{'href': 'https://www.cs.wisc.edu/'}
%% Cell type:markdown id: tags:
What is the return value type of `.attrs`?
%% Cell type:code id: tags:
``` python
type(anchor_element.attrs)
```
%% Output
dict
%% Cell type:markdown id: tags:
Extract the hyperlink.
%% Cell type:code id: tags:
``` python
```
%% Output
'https://www.cs.wisc.edu/'
%% Cell type:markdown id: tags:
Extract hyperlinks for each department and populate department name and link into a `dict`.
%% Cell type:code id: tags:
``` python
department_urls = {} # Key: department name; Value: website URL
anchor_elements = bs_obj.find_all("a")
anchor_elements
```
%% Output
Computer Sciences https://www.cs.wisc.edu/
Statistics https://stat.wisc.edu/
CDIS https://cdis.wisc.edu/
Electrical Engineering and Computer Sciences https://eecs.berkeley.edu/
{'Computer Sciences': 'https://www.cs.wisc.edu/',
'Statistics': 'https://stat.wisc.edu/',
'CDIS': 'https://cdis.wisc.edu/',
'Electrical Engineering and Computer Sciences': 'https://eecs.berkeley.edu/'}
%% Cell type:markdown id: tags:
#### Self-practice: Extract title of the CS220 syllabus page (from warmup 2)
#### Self-practice: Find all anchor links that include piazza in the CS 220 page
%% Cell type:code id: tags:
``` python
# Get this page using requests.
url = "https://www.msyamkumar.com/cs220/s22/syllabus.html"
url = "https://cs220.cs.wisc.edu/f22/syllabus.html"
r = ...
# make sure there is no error
# read the entire contents of the page into a single string variable
html_data = ...
# use BeautifulSoup to extract title
# create a BeautifulSoup object
bs_obj = ...
# find all anchor elements
anchor_elements = ..
# print out all URLS to piazza
```
%% Cell type:markdown id: tags:
%% Output
/Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
warnings.warn(
## Parsing small_movies html table to extract `small_movies.json`
['https://piazza.com/wisc/fall2022/cs220/home',
'https://piazza.com/wisc/fall2022/cs220/home']
%% Cell type:markdown id: tags:
### https://www.msyamkumar.com/cs220/f21/syllabus.html
### Scraping Tables
### Parsing small_movies html table to extract `small_movies.json`
%% Cell type:markdown id: tags:
### Step 1: Read `small_movies.html` content into a variable
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Step 2: Initialize BeautifulSoup object instance
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Step 3: Find table element
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Step 4: Find all th tags, to parse the table header
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Step 5: Scrape second row, convert data to appropriate types, and populate data into a row dictionary
- "Year", "Runtime": `int` conversion
- "Revenue": format_revenue(...) conversion
- "Rating": `float` conversion
%% Cell type:code id: tags:
``` python
def format_revenue(revenue):
if type(revenue) == float: # need this in here if we run code multiple times
return revenue
elif revenue[-1] == 'M': # some have an "M" at the end
return float(revenue[:-1]) * 1e6
else: # otherwise, assume millions.
return float(revenue) * 1e6
```
%% Cell type:code id: tags:
``` python
# Why second row? Because first row has the header information.
```
%% Cell type:markdown id: tags:
### Step 6: Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list
- "Year", "Runtime": `int` conversion
- "Revenue": format_revenue(...) conversion
- "Rating": `float` conversion
You can compare your parsing output to `small_movies.json` file contents, to confirm your result.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Final step: convert steps 1 through 6 into a function and use that function to parse `full_movies.html` file.
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
full_movies_data = parse_html("full_movies.html")
# full_movies_data
```
......
This diff is collapsed.
[
{
"Title": "Guardians of the Galaxy",
"Genre": "Action,Adventure,Sci-Fi",
"Director": "James Gunn",
"Cast": "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana",
"Year": 2014,
"Runtime": 121,
"Rating": 8.1,
"Revenue": 333130000.0
},
{
"Title": "Prometheus",
"Genre": "Adventure,Mystery,Sci-Fi",
"Director": "Ridley Scott",
"Cast": "Noomi Rapace, Logan Marshall-Green, Michael fassbender, Charlize Theron",
"Year": 2012,
"Runtime": 124,
"Rating": 7.0,
"Revenue": 126460000.0
},
{
"Title": "Split",
"Genre": "Horror,Thriller",
"Director": "M. Night Shyamalan",
"Cast": "James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula",
"Year": 2016,
"Runtime": 117,
"Rating": 7.3,
"Revenue": 138120000.0
},
{
"Title": "Sing",
"Genre": "Animation,Comedy,Family",
"Director": "Christophe Lourdelet",
"Cast": "Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson",
"Year": 2016,
"Runtime": 108,
"Rating": 7.2,
"Revenue": 270320000.0
},
{
"Title": "Suicide Squad",
"Genre": "Action,Adventure,Fantasy",
"Director": "David Ayer",
"Cast": "Will Smith, Jared Leto, Margot Robbie, Viola Davis",
"Year": 2016,
"Runtime": 123,
"Rating": 6.2,
"Revenue": 325020000.0
},
{
"Title": "The Great Wall",
"Genre": "Action,Adventure,Fantasy",
"Director": "Yimou Zhang",
"Cast": "Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",
"Year": 2016,
"Runtime": 103,
"Rating": 6.1,
"Revenue": 45130000.0
},
{
"Title": "La La Land",
"Genre": "Comedy,Drama,Music",
"Director": "Damien Chazelle",
"Cast": "Ryan Gosling, Emma Stone, Rosemarie DeWitt, J.K. Simmons",
"Year": 2016,
"Runtime": 128,
"Rating": 8.3,
"Revenue": 151060000.0
},
{
"Title": "Mindhorn",
"Genre": "Comedy",
"Director": "Sean Foley",
"Cast": "Essie Davis, Andrea Riseborough, Julian Barratt,Kenneth Branagh",
"Year": 2016,
"Runtime": 89,
"Rating": 6.4,
"Revenue": 0.0
},
{
"Title": "The Lost City of Z",
"Genre": "Action,Adventure,Biography",
"Director": "James Gray",
"Cast": "Charlie Hunnam, Robert Pattinson, Sienna Miller, Tom Holland",
"Year": 2016,
"Runtime": 141,
"Rating": 7.1,
"Revenue": 8010000.0
},
{
"Title": "Passengers",
"Genre": "Adventure,Drama,Romance",
"Director": "Morten Tyldum",
"Cast": "Jennifer Lawrence, Chris Pratt, Michael Sheen,Laurence Fishburne",
"Year": 2016,
"Runtime": 116,
"Rating": 7.0,
"Revenue": 100010000.0
}
]
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment