diff --git a/f22/andy_lec_notes/lec31_Nov21_Web3/lec_31_web3_completed.ipynb b/f22/andy_lec_notes/lec31_Nov21_Web3/lec_31_web3_completed.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..41d5165bdacf83ab63abfddb53843f5d8b2fba76 --- /dev/null +++ b/f22/andy_lec_notes/lec31_Nov21_Web3/lec_31_web3_completed.ipynb @@ -0,0 +1,1298 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Web3: Scraping Web Data" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "metadata": {}, + "outputs": [], + "source": [ + "# import statements\n", + "import requests\n", + "from bs4 import BeautifulSoup # bs4 is the module, BeautifulSoup is the type" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Warmup 1: HTML table and hyperlinks\n", + "In order to scrape web pages, you need to know the HTML syntax for tables and hyperlinks.\n", + "\n", + "TODO: Add another row or two to the table below" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "<table>\n", + " <tr>\n", + " <th>University</th>\n", + " <th>Department</th>\n", + " </tr>\n", + " <tr>\n", + " <td>UW-Madison</td>\n", + " <td><a href = \"https://www.cs.wisc.edu/\">Computer Sciences</a></td>\n", + " </tr>\n", + " <tr>\n", + " <td>UW-Madison</td>\n", + " <td><a href = \"https://stat.wisc.edu/\">Statistics</a></td>\n", + " </tr>\n", + " <tr>\n", + " <td>UW-Madison</td>\n", + " <td><a href = \"https://cdis.wisc.edu/\">CDIS</a></td>\n", + " </tr>\n", + " <tr>\n", + " <td>UC Berkeley</td>\n", + " <td><a href = \"https://eecs.berkeley.edu/\">Electrical Engineering and Computer Sciences</a></td>\n", + " </tr>\n", + "\n", + "</table>" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Warmup 2: Scraping data from syllabus page" + ] + }, + { + "cell_type": "code", + "execution_count": 80, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings\n", + " warnings.warn(\n" + ] + }, + { + "data": { + "text/plain": [ + "['<!doctype html>',\n", + " '<html lang=\"en\">',\n", + " ' <head>',\n", + " ' <meta charset=\"utf-8\">',\n", + " ' <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, shrink-to-fit=no\">',\n", + " ' <meta name=\"description\" content=\"\">',\n", + " ' <meta name=\"author\" content=\"\">',\n", + " '',\n", + " ' <!-- Google Auth stuff -->',\n", + " ' <meta name=\"google-signin-scope\" content=\"profile email\">']" + ] + }, + "execution_count": 80, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Get this page using requests. \n", + "url = \"https://cs220.cs.wisc.edu/f22/syllabus.html\"\n", + "r = requests.get(url, verify=False)\n", + "\n", + "# make sure there is no error\n", + "r.raise_for_status()\n", + "\n", + "# read the entire contents of the page into a single string variable\n", + "html_str = r.text\n", + "html_str[:100]\n", + "\n", + "\n", + "# split the contents into list of strings using newline separator\n", + "html_lines = html_str.split('\\n')\n", + "html_lines[:10]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Warmup 2: find all lines with 'Kuemmel'" + ] + }, + { + "cell_type": "code", + "execution_count": 81, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "<li>Andrew Kuemmel (Teaching Faculty - Department of Computer Sciences) kuemmel@wisc.edu</li>\n" + ] + } + ], + "source": [ + "for line in html_lines:\n", + " if \"Kuemmel\" in line:\n", + " print(line)\n", + "\n", + "# Takeaway: It would be nice if there were a module that could make finding easy!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Learning Objectives:\n", + "\n", + "- Using the Document Object Model of web pages\n", + " - describe the 3 things a DOM element may contain, and give examples of each\n", + " - given an html string, identify the correct DOM tree of elements\n", + "- Create BeautifulSoup objects from an html string and use prettify to display\n", + "- Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag\n", + "- Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs\n", + "- Use BeautifulSoup to scrape a live web site. " + ] + }, + { + "attachments": { + "image.png": { + "image/png": "" + } + }, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Document Object Model\n", + "\n", + "In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.\n", + "\n", + "<div>\n", + "<img src=\"attachment:image.png\" width=\"600\"/>\n", + "</div>" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Take a look at the HTML in the below cell." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "<b>To Do List</b>\n", + "<ul>\n", + " <li>Eat Healthy</li>\n", + " <li>Sleep <b>More</b></li>\n", + " <li>Exercise</li>\n", + "</ul>" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### BeautifulSoup constructor\n", + "- takes a html, as a string, as argument and parses it\n", + "- Syntax: `BeautifulSoup(<html_string>, \"html.parser\")`\n", + "- Second argument specifies what kind of parsing we want done" + ] + }, + { + "cell_type": "code", + "execution_count": 82, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "bs4.BeautifulSoup" + ] + }, + "execution_count": 82, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "html_string = \"<b>To Do List</b><ul><li>Eat Healthy</li><li>Sleep <b>More</b></li><li>Exercise</li></ul>\"\n", + "\n", + "bs_obj = BeautifulSoup(html_string, \"html.parser\")\n", + "\n", + "type(bs_obj)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## BeautifulSoup operations\n", + "- `prettify()` returns a formatted representation of the raw HTML\n", + "\n", + "### A BeautifulSoup object can be searched for elements using:\n", + "- `find(\"\")` returns the first element matching the tag string, None otherwise\n", + "- `find_all(\"\")` returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise\n", + "\n", + "### Beautiful Soup Elements can be inspected by using:\n", + "- `text` returns the text associated with this element, if applicable; does not return the child elements associated with that element\n", + "- `.children` all children of this element (can be converted into a list)\n", + "- `.attrs` the atribute associated with that element / tag." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`prettify()` returns a formatted representation of the raw HTML" + ] + }, + { + "cell_type": "code", + "execution_count": 83, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'<b>\\n To Do List\\n</b>\\n<ul>\\n <li>\\n Eat Healthy\\n </li>\\n <li>\\n Sleep\\n <b>\\n More\\n </b>\\n </li>\\n <li>\\n Exercise\\n </li>\\n</ul>'" + ] + }, + "execution_count": 83, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bs_obj.prettify()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`find` returns the first HTML 'tag' matching the string \"b\"" + ] + }, + { + "cell_type": "code", + "execution_count": 84, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<b>To Do List</b>" + ] + }, + "execution_count": 84, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bs_obj.find(\"b\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What is the type of find's return value?" + ] + }, + { + "cell_type": "code", + "execution_count": 85, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "bs4.element.Tag" + ] + }, + "execution_count": 85, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(bs_obj.find(\"b\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "How do we extract the text of the \"b\" element and what is its type?" + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'To Do List'" + ] + }, + "execution_count": 86, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bs_obj.find(\"b\").text" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`find` returns None if it cannot find that element." + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "metadata": {}, + "outputs": [], + "source": [ + "# assert that this html string has a <ul> tag\n", + "assert bs_obj.find(\"ul\") != None\n", + "\n", + "# assert that this does not have an <a> tag\n", + "assert bs_obj.find(\"a\") == None" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`find_all` returns an iterable of all matching elements (HTML 'tags') matching the string \"b\"" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[<b>To Do List</b>, <b>More</b>]" + ] + }, + "execution_count": 89, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bold_elements = bs_obj.find_all(\"b\")\n", + "bold_elements" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What is the type of return value of `find_all`?" + ] + }, + { + "cell_type": "code", + "execution_count": 90, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "bs4.element.ResultSet" + ] + }, + "execution_count": 90, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(bold_elements)" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "bs4.element.Tag" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(bold_elements[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use a for loop to print the text of each \"b\" element." + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "To Do List\n", + "More\n" + ] + } + ], + "source": [ + "for element in bold_elements:\n", + " print(element.text)\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements." + ] + }, + { + "cell_type": "code", + "execution_count": 94, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[]\n", + "None\n" + ] + } + ], + "source": [ + "# only searches for elements, not text\n", + "print(bs_obj.find_all(\"Sleep\")) \n", + "# if not present returns None\n", + "print(bs_obj.find(\"Sleep\")) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can invoke `find` or `find_all` on other BeautifulSoup object instances.\n", + "\n", + "Find all `li` elements and find `b` element inside the second `li` element." + ] + }, + { + "cell_type": "code", + "execution_count": 98, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'More'" + ] + }, + "execution_count": 98, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "li_elements = bs_obj.find_all(\"li\")\n", + "li_elements[1].find(\"b\").text" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<b>More</b>" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "li_elements[1].find(\"b\")" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'More'" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "li_elements[1].find(\"b\").text" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### DOM trees are hierarchical. You can use `.children` on any element to gets its children.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Find all the children of \"ul\" element." + ] + }, + { + "cell_type": "code", + "execution_count": 100, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[<li>Eat Healthy</li>, <li>Sleep <b>More</b></li>, <li>Exercise</li>]" + ] + }, + "execution_count": 100, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ul_element = bs_obj.find(\"ul\")\n", + "list(ul_element.children)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Find text of every child element." + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Eat Healthy', 'Sleep More', 'Exercise']" + ] + }, + "execution_count": 102, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "[e.get_text() for e in ul_element.children]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To understand `attribute`, let's go back to the table from warmup 1.\n", + "\n", + "\n", + "New syntax, you can use `\"\"\"some really long string\"\"\"` to split a string across multiple lines." + ] + }, + { + "cell_type": "code", + "execution_count": 103, + "metadata": {}, + "outputs": [], + "source": [ + "html_string = \"\"\"\n", + "<table>\n", + " <tr>\n", + " <th>University</th>\n", + " <th>Department</th>\n", + " </tr>\n", + " <tr>\n", + " <td>UW-Madison</td>\n", + " <td><a href = \"https://www.cs.wisc.edu/\">Computer Sciences</a></td>\n", + " </tr>\n", + " <tr>\n", + " <td>UW-Madison</td>\n", + " <td><a href = \"https://stat.wisc.edu/\">Statistics</a></td>\n", + " </tr>\n", + " <tr>\n", + " <td>UW-Madison</td>\n", + " <td><a href = \"https://cdis.wisc.edu/\">CDIS</a></td>\n", + " </tr>\n", + " <tr>\n", + " <td>UC Berkeley</td>\n", + " <td><a href = \"https://eecs.berkeley.edu/\">Electrical Engineering and Computer Sciences</a></td>\n", + " </tr>\n", + "</table>\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Find the table headers." + ] + }, + { + "cell_type": "code", + "execution_count": 104, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[<th>University</th>, <th>Department</th>]" + ] + }, + "execution_count": 104, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "bs_obj = BeautifulSoup(html_string, \"html.parser\")\n", + "th_elements = bs_obj.find_all(\"th\")\n", + "th_elements" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Find the first anchor element, extract its text." + ] + }, + { + "cell_type": "code", + "execution_count": 106, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<a href=\"https://www.cs.wisc.edu/\">Computer Sciences</a>" + ] + }, + "execution_count": 106, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "anchor_element = bs_obj.find(\"a\")\n", + "anchor_element" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.\n", + "\n", + "Now, let's get the attributes of the anchor element." + ] + }, + { + "cell_type": "code", + "execution_count": 107, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'href': 'https://www.cs.wisc.edu/'}" + ] + }, + "execution_count": 107, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "anchor_element.attrs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What is the type of `.attrs`?" + ] + }, + { + "cell_type": "code", + "execution_count": 108, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "dict" + ] + }, + "execution_count": 108, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(anchor_element.attrs)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Extract the hyperlink." + ] + }, + { + "cell_type": "code", + "execution_count": 109, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'https://www.cs.wisc.edu/'" + ] + }, + "execution_count": 109, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "anchor_element.attrs['href']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Extract hyperlinks for each department and populate department name and link into a `dict`." + ] + }, + { + "cell_type": "code", + "execution_count": 114, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'Computer Sciences': 'https://www.cs.wisc.edu/',\n", + " 'Statistics': 'https://stat.wisc.edu/',\n", + " 'CDIS': 'https://cdis.wisc.edu/',\n", + " 'Electrical Engineering and Computer Sciences': 'https://eecs.berkeley.edu/'}" + ] + }, + "execution_count": 114, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "department_urls = {} # Key: department name; Value: website URL\n", + "\n", + "anchor_elements = bs_obj.find_all(\"a\")\n", + "anchor_elements\n", + "\n", + "for element in anchor_elements:\n", + " key = element.text\n", + " value = element.attrs['href']\n", + " #print(key, value)\n", + " department_urls[key] = value\n", + "department_urls" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Self-practice: Find all anchor links that include piazza in the CS 220 page" + ] + }, + { + "cell_type": "code", + "execution_count": 116, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "https://piazza.com/wisc/fall2022/cs220/home\n", + "https://piazza.com/wisc/fall2022/cs220/home\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings\n", + " warnings.warn(\n" + ] + } + ], + "source": [ + "# Get this page using requests. \n", + "url = \"https://cs220.cs.wisc.edu/f22/syllabus.html\"\n", + "r = requests.get(url, verify=False)\n", + "\n", + "# make sure there is no error\n", + "r.raise_for_status()\n", + "\n", + "# read the entire contents of the page into a single string variable\n", + "html_data = r.text\n", + "\n", + "# create a BeautifulSoup object\n", + "bs_obj = BeautifulSoup(html_data, 'html.parser')\n", + "\n", + "# find all anchor elements\n", + "anchor_elements = bs_obj.find_all(\"a\")\n", + "\n", + "# print out all URLS to piazza\n", + "for e in anchor_elements:\n", + " url = e.attrs['href']\n", + " if 'piazza' in url:\n", + " print(url)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Scraping Tables\n", + "### Parsing small_movies html table to extract `small_movies.json`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1: Read `small_movies.html` content into a variable" + ] + }, + { + "cell_type": "code", + "execution_count": 125, + "metadata": {}, + "outputs": [], + "source": [ + "f = open(\"small_movies.html\")\n", + "small_movies_str = f.read()\n", + "f.close()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: Initialize BeautifulSoup object instance" + ] + }, + { + "cell_type": "code", + "execution_count": 126, + "metadata": {}, + "outputs": [], + "source": [ + "bs_obj = BeautifulSoup(small_movies_str, \"html.parser\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 3: Find table element" + ] + }, + { + "cell_type": "code", + "execution_count": 127, + "metadata": {}, + "outputs": [], + "source": [ + "table = bs_obj.find(\"table\") # works only when you have exactly 1 table" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 4: Find all th tags, to parse the table header" + ] + }, + { + "cell_type": "code", + "execution_count": 128, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Title', 'Genre', 'Director', 'Cast', 'Year', 'Runtime', 'Rating', 'Revenue']" + ] + }, + "execution_count": 128, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "header = [th.get_text() for th in table.find_all('th')]\n", + "header" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 5: Scrape second row, convert data to appropriate types, and populate data into a row dictionary\n", + "- \"Year\", \"Runtime\": `int` conversion\n", + "- \"Revenue\": format_revenue(...) conversion\n", + "- \"Rating\": `float` conversion" + ] + }, + { + "cell_type": "code", + "execution_count": 129, + "metadata": {}, + "outputs": [], + "source": [ + "def format_revenue(revenue):\n", + " if type(revenue) == float: # need this in here if we run code multiple times\n", + " return revenue\n", + " elif revenue[-1] == 'M': # some have an \"M\" at the end\n", + " return float(revenue[:-1]) * 1e6\n", + " else: # otherwise, assume millions.\n", + " return float(revenue) * 1e6" + ] + }, + { + "cell_type": "code", + "execution_count": 130, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'Title': 'Guardians of the Galaxy',\n", + " 'Genre': 'Action,Adventure,Sci-Fi',\n", + " 'Director': 'James Gunn',\n", + " 'Cast': 'Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana',\n", + " 'Year': 2014,\n", + " 'Runtime': 121,\n", + " 'Rating': 8.1,\n", + " 'Revenue': 333130000.0}" + ] + }, + "execution_count": 130, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Why second row? Because first row has the header information.\n", + "\n", + "\n", + "movie = {}\n", + "\n", + "tr_elements = table.find_all('tr')\n", + "tr = tr_elements[1]\n", + "td_elements = tr.find_all('td')\n", + "for idx in range(len(td_elements)):\n", + " td = td_elements[idx]\n", + " val = td.get_text()\n", + " if header[idx] in [\"Year\", \"Runtime\"]:\n", + " movie[header[idx]] = int(val)\n", + " elif header[idx] == \"Revenue\":\n", + " revenue = format_revenue(val)\n", + " movie[header[idx]] = revenue\n", + " elif header[idx] == \"Rating\":\n", + " movie[header[idx]] = float(val)\n", + " else:\n", + " movie[header[idx]] = val\n", + " \n", + "movie" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 6: Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list\n", + "- \"Year\", \"Runtime\": `int` conversion\n", + "- \"Revenue\": format_revenue(...) conversion\n", + "- \"Rating\": `float` conversion\n", + "\n", + "You can compare your parsing output to `small_movies.json` file contents, to confirm your result." + ] + }, + { + "cell_type": "code", + "execution_count": 131, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[{'Title': 'Guardians of the Galaxy',\n", + " 'Genre': 'Action,Adventure,Sci-Fi',\n", + " 'Director': 'James Gunn',\n", + " 'Cast': 'Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana',\n", + " 'Year': 2014,\n", + " 'Runtime': 121,\n", + " 'Rating': 8.1,\n", + " 'Revenue': 333130000.0},\n", + " {'Title': 'Prometheus',\n", + " 'Genre': 'Adventure,Mystery,Sci-Fi',\n", + " 'Director': 'Ridley Scott',\n", + " 'Cast': 'Noomi Rapace, Logan Marshall-Green, Michael fassbender, Charlize Theron',\n", + " 'Year': 2012,\n", + " 'Runtime': 124,\n", + " 'Rating': 7.0,\n", + " 'Revenue': 126460000.0},\n", + " {'Title': 'Split',\n", + " 'Genre': 'Horror,Thriller',\n", + " 'Director': 'M. Night Shyamalan',\n", + " 'Cast': 'James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula',\n", + " 'Year': 2016,\n", + " 'Runtime': 117,\n", + " 'Rating': 7.3,\n", + " 'Revenue': 138120000.0},\n", + " {'Title': 'Sing',\n", + " 'Genre': 'Animation,Comedy,Family',\n", + " 'Director': 'Christophe Lourdelet',\n", + " 'Cast': 'Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson',\n", + " 'Year': 2016,\n", + " 'Runtime': 108,\n", + " 'Rating': 7.2,\n", + " 'Revenue': 270320000.0},\n", + " {'Title': 'Suicide Squad',\n", + " 'Genre': 'Action,Adventure,Fantasy',\n", + " 'Director': 'David Ayer',\n", + " 'Cast': 'Will Smith, Jared Leto, Margot Robbie, Viola Davis',\n", + " 'Year': 2016,\n", + " 'Runtime': 123,\n", + " 'Rating': 6.2,\n", + " 'Revenue': 325020000.0},\n", + " {'Title': 'The Great Wall',\n", + " 'Genre': 'Action,Adventure,Fantasy',\n", + " 'Director': 'Yimou Zhang',\n", + " 'Cast': 'Matt Damon, Tian Jing, Willem Dafoe, Andy Lau',\n", + " 'Year': 2016,\n", + " 'Runtime': 103,\n", + " 'Rating': 6.1,\n", + " 'Revenue': 45130000.0},\n", + " {'Title': 'La La Land',\n", + " 'Genre': 'Comedy,Drama,Music',\n", + " 'Director': 'Damien Chazelle',\n", + " 'Cast': 'Ryan Gosling, Emma Stone, Rosemarie DeWitt, J.K. Simmons',\n", + " 'Year': 2016,\n", + " 'Runtime': 128,\n", + " 'Rating': 8.3,\n", + " 'Revenue': 151060000.0},\n", + " {'Title': 'Mindhorn',\n", + " 'Genre': 'Comedy',\n", + " 'Director': 'Sean Foley',\n", + " 'Cast': 'Essie Davis, Andrea Riseborough, Julian Barratt,Kenneth Branagh',\n", + " 'Year': 2016,\n", + " 'Runtime': 89,\n", + " 'Rating': 6.4,\n", + " 'Revenue': 0.0},\n", + " {'Title': 'The Lost City of Z',\n", + " 'Genre': 'Action,Adventure,Biography',\n", + " 'Director': 'James Gray',\n", + " 'Cast': 'Charlie Hunnam, Robert Pattinson, Sienna Miller, Tom Holland',\n", + " 'Year': 2016,\n", + " 'Runtime': 141,\n", + " 'Rating': 7.1,\n", + " 'Revenue': 8010000.0},\n", + " {'Title': 'Passengers',\n", + " 'Genre': 'Adventure,Drama,Romance',\n", + " 'Director': 'Morten Tyldum',\n", + " 'Cast': 'Jennifer Lawrence, Chris Pratt, Michael Sheen,Laurence Fishburne',\n", + " 'Year': 2016,\n", + " 'Runtime': 116,\n", + " 'Rating': 7.0,\n", + " 'Revenue': 100010000.0}]" + ] + }, + "execution_count": 131, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "movies_data = []\n", + "\n", + "tr_elements = table.find_all('tr')\n", + "for tr in tr_elements[1:]: # Skip first row (header row)\n", + " movie = {}\n", + " td_elements = tr.find_all('td')\n", + " for idx in range(len(td_elements)):\n", + " td = td_elements[idx]\n", + " val = td.get_text()\n", + " if header[idx] in [\"Year\", \"Runtime\"]:\n", + " movie[header[idx]] = int(val)\n", + " elif header[idx] == \"Revenue\":\n", + " revenue = format_revenue(val)\n", + " movie[header[idx]] = revenue\n", + " elif header[idx] == \"Rating\":\n", + " movie[header[idx]] = float(val)\n", + " else:\n", + " movie[header[idx]] = val\n", + " movies_data.append(movie)\n", + "movies_data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Final step: convert steps 1 through 6 into a function and use that function to parse `full_movies.html` file." + ] + }, + { + "cell_type": "code", + "execution_count": 132, + "metadata": {}, + "outputs": [], + "source": [ + "def parse_html(html_file):\n", + " f = open(html_file)\n", + " small_movies_str = f.read()\n", + " f.close()\n", + "\n", + " bs_obj = BeautifulSoup(small_movies_str, \"html.parser\")\n", + " \n", + " table = bs_obj.find(\"table\") # works only when you have exactly 1 table\n", + " header = [th.get_text() for th in table.find_all('th')]\n", + "\n", + " movies_data = []\n", + "\n", + " tr_elements = table.find_all('tr')\n", + " for tr in tr_elements[1:]: # Skip first row (header row)\n", + " movie = {}\n", + " td_elements = tr.find_all('td')\n", + " for idx in range(len(td_elements)):\n", + " td = td_elements[idx]\n", + " val = td.get_text()\n", + " if header[idx] in [\"Year\", \"Runtime\"]:\n", + " movie[header[idx]] = int(val)\n", + " elif header[idx] == \"Revenue\":\n", + " revenue = format_revenue(val)\n", + " movie[header[idx]] = revenue\n", + " elif header[idx] == \"Rating\":\n", + " movie[header[idx]] = float(val)\n", + " else:\n", + " movie[header[idx]] = val\n", + " movies_data.append(movie)\n", + " \n", + " return movies_data" + ] + }, + { + "cell_type": "code", + "execution_count": 133, + "metadata": {}, + "outputs": [], + "source": [ + "full_movies_data = parse_html(\"full_movies.html\")\n", + "# full_movies_data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.12" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}