" <td><ahref = \"https://eecs.berkeley.edu/\">Electrical Engineering and Computer Sciences</a></td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Warmup 2: Scraping data from syllabus page"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings\n",
"In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.\n",
"Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To understand `attribute`, let's go back to the table from warmup 1.\n",
"\n",
"\n",
"New syntax, you can use `\"\"\"some really long string\"\"\"` to split a string across multiple lines."
"Find the first anchor element, extract its text."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Computer Sciences'"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"anchor_element = ...\n",
"anchor_element"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.\n",
"\n",
"Now, let's get the attributes of the anchor element."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'href': 'https://www.cs.wisc.edu/'}"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"anchor_element.attrs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What is the return value type of `.attrs`?"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dict"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(anchor_element.attrs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extract the hyperlink."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'https://www.cs.wisc.edu/'"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extract hyperlinks for each department and populate department name and link into a `dict`."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Computer Sciences https://www.cs.wisc.edu/\n",
"Statistics https://stat.wisc.edu/\n",
"CDIS https://cdis.wisc.edu/\n",
"Electrical Engineering and Computer Sciences https://eecs.berkeley.edu/\n"
" 'Electrical Engineering and Computer Sciences': 'https://eecs.berkeley.edu/'}"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"department_urls = {} # Key: department name; Value: website URL\n",
"\n",
"anchor_elements = bs_obj.find_all(\"a\")\n",
"anchor_elements\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Self-practice: Find all anchor links that include piazza in the CS 220 page"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings\n",
<td><ahref = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
</tr>
</table>
%% Cell type:markdown id: tags:
### Warmup 2: Scraping data from syllabus page
%% Cell type:code id: tags:
``` python
# Get this page using requests.
url="https://cs220.cs.wisc.edu/f22/syllabus.html"
r=requests.get(url,verify=False)
# make sure there is no error
# read the entire contents of the page into a single string variable
html_str=...
# split the contents into list of strings using newline separator
#html_lines = ...
#html_lines[:10]
```
%% Output
/Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
# Takeaway: It would be nice if there were a module that could make finding easy!
```
%% Output
<li>Andrew Kuemmel (Teaching Faculty - Department of Computer Sciences) kuemmel@wisc.edu</li>
%% Cell type:markdown id: tags:
### Learning Objectives:
- Using the Document Object Model of web pages
- describe the 3 things a DOM element may contain, and give examples of each
- given an html string, identify the correct DOM tree of elements
- Create BeautifulSoup objects from an html string and use prettify to display
- Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag
- Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs
- Use BeautifulSoup to scrape a live web site.
%% Cell type:markdown id: tags:
### Document Object Model
In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.
<div>
<imgsrc="attachment:image.png"width="600"/>
</div>
%% Cell type:markdown id: tags:
### Take a look at the HTML in the below cell.
%% Cell type:markdown id: tags:
<b>To Do List</b>
<ul>
<li>Eat Healthy</li>
<li>Sleep <b>More</b></li>
<li>Exercise</li>
</ul>
%% Cell type:markdown id: tags:
### BeautifulSoup constructor
- takes a html, as a string, as argument and parses it
Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`.
%% Cell type:markdown id: tags:
To understand `attribute`, let's go back to the table from warmup 1.
New syntax, you can use `"""some really long string"""` to split a string across multiple lines.
You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.
Now, let's get the attributes of the anchor element.
%% Cell type:code id: tags:
``` python
anchor_element.attrs
```
%% Output
{'href': 'https://www.cs.wisc.edu/'}
%% Cell type:markdown id: tags:
What is the return value type of `.attrs`?
%% Cell type:code id: tags:
``` python
type(anchor_element.attrs)
```
%% Output
dict
%% Cell type:markdown id: tags:
Extract the hyperlink.
%% Cell type:code id: tags:
``` python
```
%% Output
'https://www.cs.wisc.edu/'
%% Cell type:markdown id: tags:
Extract hyperlinks for each department and populate department name and link into a `dict`.
%% Cell type:code id: tags:
``` python
department_urls = {} # Key: department name; Value: website URL
anchor_elements = bs_obj.find_all("a")
anchor_elements
```
%% Output
Computer Sciences https://www.cs.wisc.edu/
Statistics https://stat.wisc.edu/
CDIS https://cdis.wisc.edu/
Electrical Engineering and Computer Sciences https://eecs.berkeley.edu/
{'Computer Sciences': 'https://www.cs.wisc.edu/',
'Statistics': 'https://stat.wisc.edu/',
'CDIS': 'https://cdis.wisc.edu/',
'Electrical Engineering and Computer Sciences': 'https://eecs.berkeley.edu/'}
%% Cell type:markdown id: tags:
#### Self-practice: Find all anchor links that include piazza in the CS 220 page
# read the entire contents of the page into a single string variable
html_data = ...
# create a BeautifulSoup object
bs_obj = ...
# find all anchor elements
anchor_elements = ..
# print out all URLS to piazza
```
%% Output
/Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
warnings.warn(
['https://piazza.com/wisc/fall2022/cs220/home',
'https://piazza.com/wisc/fall2022/cs220/home']
%% Cell type:markdown id: tags:
### Scraping Tables
### Parsing small_movies html table to extract `small_movies.json`
%% Cell type:markdown id: tags:
### Step 1: Read `small_movies.html` content into a variable