"In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.\n",
"- Second argument specifies what kind of parsing we want done"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"html_string = \"<b>To Do List</b><ul><li>Eat Healthy</li><li>Sleep <b>More</b></li><li>Exercise</li></ul>\"\n",
"\n",
"\n",
"\n",
"type(bs_obj)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## BeautifulSoup operations\n",
"- `prettify()` returns a formatted representation of the raw HTML\n",
"\n",
"### A BeautifulSoup object can be searched for elements using:\n",
"- `find(\"\")` returns the first element matching the tag string, None otherwise\n",
"- `find_all(\"\")` returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise\n",
"\n",
"### Beautiful Soup Elements can be inspected by using:\n",
"- `get_text()` returns the text associated with this element, if applicable; does not return the child elements associated with that element\n",
"- `.children` all children of this element (can be converted into a list)\n",
"- `.attrs` the atribute associated with that element / tag."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`prettify()` returns a formatted representation of the raw HTML"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`find` returns the first HTML 'tag' matching the string \"b\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What is the type of find's return value?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How do we extract the text of the \"b\" element and what is its type?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`find` returns None if it cannot find that element."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# assert that this html string has a <ul> tag\n",
"\n",
"# assert that this does not have an <a> tag\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`find_all` returns an iterable of all matching elements (HTML 'tags') matching the string \"b\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What is the type of return value of `find_all`?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use a for loop to print the text of each \"b\" element."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# only searches for elements, not text\n",
"print(bs_obj.find_all(\"Sleep\")) \n",
"# if not present returns None\n",
"print(bs_obj.find(\"Sleep\")) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can invoke `find` or `find_all` on other BeautifulSoup object instances.\n",
"\n",
"Find all `li` elements and find `b` element inside the second `li` element."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"DOM trees are hierarchical. You can use `.children` on any element to gets its children."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find all the children of \"ul\" element."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find text of every child element."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To understand `attribute`, let's go back to the table from warmup 1.\n",
"\n",
"\n",
"New syntax, you can use `\"\"\"some really long string\"\"\"` to split a string across multiple lines."
" <td><a href = \"https://eecs.berkeley.edu/\">Electrical Engineering and Computer Sciences</a></td>\n",
" </tr>\n",
"</table>\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the table headers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the first anchor element, extract its text."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.\n",
"\n",
"Now, let's get the attributes of the anchor element."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What is the return value type of `.attrs`?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extract the hyperlink."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extract hyperlinks for each department and populate department name and link into a `dict`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"department_urls = {} # Key: department name; Value: website URL\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Self-practice: Extract title of the CS220 syllabus page (from warmup 2)"
# read the entire contents of the page into a single string variable
# split the contents into list of strings using newline separator
```
%% Cell type:markdown id: tags:
#### Warmup 2a: Find all sentences that contain "Meena"
%% Cell type:code id: tags:
``` python
```
%%Celltype:markdownid:tags:
#### Warmup 2b: Extract title tag's value
%%Celltype:codeid:tags:
``` python
# finally, we are able to extract the title tag's data
# Takeaway: It would be nice if there were a module that could make finding easy!
```
%% Cell type:markdown id: tags:
### Learning Objectives:
- Using the Document Object Model of web pages
- describe the 3 things a DOM element may contain, and give examples of each
- given an html string, identify the correct DOM tree of elements
- Create BeautifulSoup objects from an html string and use prettify to display
- Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag
- Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs
- Use BeautifulSoup to scrape a live web site.
%% Cell type:markdown id: tags:
### Document Object Model
In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.
<div>
<img src="attachment:image.png" width="600"/>
</div>
%% Cell type:markdown id: tags:
### Take a look at the HTML in the below cell.
%% Cell type:markdown id: tags:
<b>To Do List</b>
<ul>
<li>Eat Healthy</li>
<li>Sleep <b>More</b></li>
<li>Exercise</li>
</ul>
%% Cell type:markdown id: tags:
### BeautifulSoup constructor
- takes a html, as a string, as argument and parses it
- Second argument specifies what kind of parsing we want done
%% Cell type:code id: tags:
``` python
html_string = "<b>To Do List</b><ul><li>Eat Healthy</li><li>Sleep <b>More</b></li><li>Exercise</li></ul>"
type(bs_obj)
```
%% Cell type:markdown id: tags:
## BeautifulSoup operations
- `prettify()` returns a formatted representation of the raw HTML
### A BeautifulSoup object can be searched for elements using:
- `find("")` returns the first element matching the tag string, None otherwise
- `find_all("")` returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise
### Beautiful Soup Elements can be inspected by using:
- `get_text()` returns the text associated with this element, if applicable; does not return the child elements associated with that element
- `.children` all children of this element (can be converted into a list)
- `.attrs` the atribute associated with that element / tag.
%% Cell type:markdown id: tags:
`prettify()` returns a formatted representation of the raw HTML
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
`find` returns the first HTML 'tag' matching the string "b"
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
What is the type of find's return value?
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
How do we extract the text of the "b" element and what is its type?
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
`find` returns None if it cannot find that element.
%% Cell type:code id: tags:
``` python
# assert that this html string has a <ul> tag
# assert that this does not have an <a> tag
```
%% Cell type:markdown id: tags:
`find_all` returns an iterable of all matching elements (HTML 'tags') matching the string "b"
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
What is the type of return value of `find_all`?
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
Use a for loop to print the text of each "b" element.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements.
%% Cell type:code id: tags:
``` python
# only searches for elements, not text
print(bs_obj.find_all("Sleep"))
# if not present returns None
print(bs_obj.find("Sleep"))
```
%% Cell type:markdown id: tags:
You can invoke `find` or `find_all` on other BeautifulSoup object instances.
Find all `li` elements and find `b` element inside the second `li` element.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
DOM trees are hierarchical. You can use `.children` on any element to gets its children.
%% Cell type:markdown id: tags:
Find all the children of "ul" element.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
Find text of every child element.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()`
%% Cell type:markdown id: tags:
To understand `attribute`, let's go back to the table from warmup 1.
New syntax, you can use `"""some really long string"""` to split a string across multiple lines.
<td><ahref = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
</tr>
</table>
"""
```
%% Cell type:markdown id: tags:
Find the table headers.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
Find the first anchor element, extract its text.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.
Now, let's get the attributes of the anchor element.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
What is the return value type of `.attrs`?
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
Extract the hyperlink.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
Extract hyperlinks for each department and populate department name and link into a `dict`.
%% Cell type:code id: tags:
``` python
department_urls = {} # Key: department name; Value: website URL
```
%% Cell type:markdown id: tags:
#### Self-practice: Extract title of the CS220 syllabus page (from warmup 2)