Merge branch 'main' of git.doit.wisc.edu:cdis/cs/courses/cs220/cs220-lecture-material

a3a565fd · msyamkumar · a4737065 · 12827f8d · a3a565fd · a3a565fd
Commit a3a565fd authored 2 years ago by msyamkumar
--- a/f22/andy_lec_notes/lec_31/.ipynb_checkpoints/lec_31_template-checkpoint.ipynb
+++ b/f22/andy_lec_notes/lec_31/.ipynb_checkpoints/lec_31_template-checkpoint.ipynb
--- a/f22/andy_lec_notes/lec_31/full_movies.html
+++ b/f22/andy_lec_notes/lec_31/full_movies.html
--- a/f22/andy_lec_notes/lec_31/lec_31_web3.ipynb
+++ b/f22/andy_lec_notes/lec_31/lec_31_web3.ipynb
--- a/f22/andy_lec_notes/lec_31/lec_31_web3_template.ipynb
+++ b/f22/andy_lec_notes/lec_31/lec_31_web3_template.ipynb
@@ -9,7 +9,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -60,56 +60,84 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Warmup 2: Scraping data from syllabus page\n",
-    "URL: https://www.msyamkumar.com/cs220/s22/syllabus.html"
+    "### Warmup 2: Scraping data from syllabus page"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 6,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "['<!doctype html>',\n",
+       " '<html lang=\"en\">',\n",
+       " '  <head>',\n",
+       " '    <meta charset=\"utf-8\">',\n",
+       " '    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, shrink-to-fit=no\">',\n",
+       " '    <meta name=\"description\" content=\"\">',\n",
+       " '    <meta name=\"author\" content=\"\">',\n",
+       " '',\n",
+       " '    <!-- Google Auth stuff -->',\n",
+       " '    <meta name=\"google-signin-scope\" content=\"profile email\">']"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
   "source": [
    "# Get this page using requests.  \n",
-    "url = \"https://www.msyamkumar.com/cs220/s22/syllabus.html\"\n",
+    "url = \"https://cs220.cs.wisc.edu/f22/syllabus.html\"\n",
+    "r = requests.get(url, verify=False)\n",
    "\n",
    "# make sure there is no error\n",
    "\n",
+    "\n",
    "# read the entire contents of the page into a single string variable\n",
+    "html_str = ...\n",
    "\n",
-    "# split the contents into list of strings using newline separator\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Warmup 2a: Find all sentences that contain \"Meena\""
+    "\n",
+    "# split the contents into list of strings using newline separator\n",
+    "#html_lines = ...\n",
+    "#html_lines[:10]"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "#### Warmup 2b: Extract title tag's value"
+    "#### Warmup 2: find all lines with 'Kuemmel'"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "<li>Andrew Kuemmel (Teaching Faculty - Department of Computer Sciences) kuemmel@wisc.edu</li>\n"
+     ]
+    }
+   ],
   "source": [
    "\n",
    "\n",
-    "# finally, we are able to extract the title tag's data\n",
    "# Takeaway:  It would be nice if there were a module that could make finding easy!"
   ]
  },
@@ -177,13 +205,24 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "bs4.BeautifulSoup"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
   "source": [
    "html_string = \"<b>To Do List</b><ul><li>Eat Healthy</li><li>Sleep <b>More</b></li><li>Exercise</li></ul>\"\n",
    "\n",
-    "\n",
+    "bs_obj = BeautifulSoup(..., \"html.parser\")\n",
    "\n",
    "type(bs_obj)"
   ]
@@ -200,7 +239,7 @@
    "- `find_all(\"\")`     returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise\n",
    "\n",
    "### Beautiful Soup Elements can be inspected by using:\n",
-    "- `get_text()`     returns the text associated with this element, if applicable; does not return the child elements associated with that element\n",
+    "- `text`    returns the text associated with this element, if applicable; does not return the child elements associated with that element\n",
    "- `.children`      all children of this element (can be converted into a list)\n",
    "- `.attrs`          the atribute associated with that element / tag."
   ]
@@ -214,10 +253,23 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'<b>\\n To Do List\\n</b>\\n<ul>\\n <li>\\n  Eat Healthy\\n </li>\\n <li>\\n  Sleep\\n  <b>\\n   More\\n  </b>\\n </li>\\n <li>\\n  Exercise\\n </li>\\n</ul>'"
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# bs_obj.prettify()"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -228,10 +280,23 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<b>To Do List</b>"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# bs_obj.find(\"b\")"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -242,9 +307,20 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "bs4.element.Tag"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
   "source": []
  },
  {
@@ -256,9 +332,20 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'To Do List'"
+      ]
+     },
+     "execution_count": 23,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
   "source": []
  },
  {
@@ -270,13 +357,15 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "# assert that this html string has a <ul> tag\n",
+    "assert bs_obj.find(\"ul\") ...\n",
    "\n",
-    "# assert that this does not have an <a> tag\n"
+    "# assert that this does not have an <a> tag\n",
+    "assert bs_obj.find(\"a\") ..."
   ]
  },
  {
@@ -288,10 +377,24 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[<b>To Do List</b>, <b>More</b>]"
+      ]
+     },
+     "execution_count": 33,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "bold_elements = ...\n",
+    "bold_elements"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -302,17 +405,43 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
+   "execution_count": 34,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "bs4.element.ResultSet"
+      ]
+     },
+     "execution_count": 34,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "type(bold_elements)"
+   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
+   "execution_count": 35,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "bs4.element.Tag"
+      ]
+     },
+     "execution_count": 35,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "type(bold_elements[0])"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -323,10 +452,23 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
+   "execution_count": 36,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "To Do List\n",
+      "More\n"
+     ]
+    }
+   ],
+   "source": [
+    "for element in bold_elements:\n",
+    "    print(...)\n",
+    "    "
+   ]
  },
  {
   "cell_type": "markdown",
@@ -337,14 +479,23 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 37,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[]\n",
+      "None\n"
+     ]
+    }
+   ],
   "source": [
    "# only searches for elements, not text\n",
-    "print(bs_obj.find_all(\"Sleep\"))  \n",
+    "# print(bs_obj.find_all(\"Sleep\"))  \n",
    "# if not present returns None\n",
-    "print(bs_obj.find(\"Sleep\"))      "
+    "# print(bs_obj.find(\"Sleep\"))      "
   ]
  },
  {
@@ -358,16 +509,71 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
+   "execution_count": 41,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'More'"
+      ]
+     },
+     "execution_count": 41,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "li_elements = ...\n",
+    "li_elements"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 42,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "<b>More</b>"
+      ]
+     },
+     "execution_count": 42,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "li_elements[1].find(\"b\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 43,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'More'"
+      ]
+     },
+     "execution_count": 43,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "li_elements[1].find(\"b\").text"
+   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "DOM trees are hierarchical. You can use `.children` on any element to gets its children."
+    "### DOM trees are hierarchical. You can use `.children` on any element to gets its children.\n",
+    "\n"
   ]
  },
  {
@@ -379,10 +585,24 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
+   "execution_count": 48,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[<li>Eat Healthy</li>, <li>Sleep <b>More</b></li>, <li>Exercise</li>]"
+      ]
+     },
+     "execution_count": 48,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "ul_elements = ...\n",
+    "ul_elements.children"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -393,16 +613,27 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 49,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['Eat Healthy', 'Sleep More', 'Exercise']"
+      ]
+     },
+     "execution_count": 49,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()`"
+    "Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. "
   ]
  },
  {
@@ -417,7 +648,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 72,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -456,10 +687,25 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
+   "execution_count": 52,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[<th>University</th>, <th>Department</th>]"
+      ]
+     },
+     "execution_count": 52,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "bs_obj = BeautifulSoup(html_string, \"html.parser\")\n",
+    "th_elements = ...\n",
+    "th_elements"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -470,10 +716,24 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
+   "execution_count": 53,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Computer Sciences'"
+      ]
+     },
+     "execution_count": 53,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "anchor_element = ...\n",
+    "anchor_element"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -486,10 +746,23 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
+   "execution_count": 54,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{'href': 'https://www.cs.wisc.edu/'}"
+      ]
+     },
+     "execution_count": 54,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "anchor_element.attrs"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -500,10 +773,23 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
+   "execution_count": 55,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "dict"
+      ]
+     },
+     "execution_count": 55,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "type(anchor_element.attrs)"
+   ]
  },
  {
   "cell_type": "markdown",
@@ -514,9 +800,20 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 56,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'https://www.cs.wisc.edu/'"
+      ]
+     },
+     "execution_count": 56,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
   "source": []
  },
  {
@@ -528,48 +825,97 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 64,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Computer Sciences https://www.cs.wisc.edu/\n",
+      "Statistics https://stat.wisc.edu/\n",
+      "CDIS https://cdis.wisc.edu/\n",
+      "Electrical Engineering and Computer Sciences https://eecs.berkeley.edu/\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "{'Computer Sciences': 'https://www.cs.wisc.edu/',\n",
+       " 'Statistics': 'https://stat.wisc.edu/',\n",
+       " 'CDIS': 'https://cdis.wisc.edu/',\n",
+       " 'Electrical Engineering and Computer Sciences': 'https://eecs.berkeley.edu/'}"
+      ]
+     },
+     "execution_count": 64,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
   "source": [
-    "department_urls = {} # Key: department name; Value: website URL\n"
+    "department_urls = {} # Key: department name; Value: website URL\n",
+    "\n",
+    "anchor_elements = bs_obj.find_all(\"a\")\n",
+    "anchor_elements\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "#### Self-practice: Extract title of the CS220 syllabus page (from warmup 2)"
+    "#### Self-practice: Find all anchor links that include piazza in the CS 220 page"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
+   "execution_count": 71,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "['https://piazza.com/wisc/fall2022/cs220/home',\n",
+       " 'https://piazza.com/wisc/fall2022/cs220/home']"
+      ]
+     },
+     "execution_count": 71,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
   "source": [
    "# Get this page using requests.  \n",
-    "url = \"https://www.msyamkumar.com/cs220/s22/syllabus.html\"\n",
-    "\n",
+    "url = \"https://cs220.cs.wisc.edu/f22/syllabus.html\"\n",
+    "r = ...\n",
    "# make sure there is no error\n",
    "\n",
+    "\n",
    "# read the entire contents of the page into a single string variable\n",
+    "html_data = ...\n",
    "\n",
-    "# use BeautifulSoup to extract title\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Parsing small_movies html table to extract `small_movies.json`"
+    "# create a BeautifulSoup object\n",
+    "bs_obj = ...\n",
+    "\n",
+    "# find all anchor elements\n",
+    "anchor_elements = ..\n",
+    "\n",
+    "# print out all URLS to piazza"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### https://www.msyamkumar.com/cs220/f21/syllabus.html\n"
+    "### Scraping Tables\n",
+    "### Parsing small_movies html table to extract `small_movies.json`"
   ]
  },
  {
@@ -723,7 +1069,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.7"
+   "version": "3.9.12"
  }
 },
 "nbformat": 4,

 %% Cell type:markdown id: tags:

 # Web3: Scraping Web Data

 %% Cell type:code id: tags:

 ``` python
 # import statements
 import requests
 from bs4 import BeautifulSoup # bs4 is the module, BeautifulSoup is the type
 ```

 %% Cell type:markdown id: tags:

 ### Warmup 1: HTML table and hyperlinks
 In order to scrape web pages, you need to know the HTML syntax for tables and hyperlinks.

 TODO: Add another row or two to the table below

 %% Cell type:markdown id: tags:

 <table>
  <tr>
    <th>University</th>
    <th>Department</th>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
  </tr>
   <tr>
    <td>UW-Madison</td>
    <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
  </tr>
  <tr>
    <td>UC Berkeley</td>
    <td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
    </tr>
 </table>

 %% Cell type:markdown id: tags:

 ### Warmup 2: Scraping data from syllabus page
-URL: https://www.msyamkumar.com/cs220/s22/syllabus.html

 %% Cell type:code id: tags:

 ``` python
 # Get this page using requests.
-url = "https://www.msyamkumar.com/cs220/s22/syllabus.html"
+url = "https://cs220.cs.wisc.edu/f22/syllabus.html"
+r = requests.get(url, verify=False)

 # make sure there is no error

+
 # read the entire contents of the page into a single string variable
+html_str = ...
+

 # split the contents into list of strings using newline separator
+#html_lines = ...
+#html_lines[:10]
 ```

-%% Cell type:markdown id: tags:
+%% Output

-#### Warmup 2a: Find all sentences that contain "Meena"
+    /Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
+      warnings.warn(

-%% Cell type:code id: tags:
-
-``` python
-```
+    ['<!doctype html>',
+     '<html lang="en">',
+     '  <head>',
+     '    <meta charset="utf-8">',
+     '    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">',
+     '    <meta name="description" content="">',
+     '    <meta name="author" content="">',
+     '',
+     '    <!-- Google Auth stuff -->',
+     '    <meta name="google-signin-scope" content="profile email">']

 %% Cell type:markdown id: tags:

-#### Warmup 2b: Extract title tag's value
+#### Warmup 2: find all lines with 'Kuemmel'

 %% Cell type:code id: tags:

 ``` python


-# finally, we are able to extract the title tag's data
 # Takeaway:  It would be nice if there were a module that could make finding easy!
 ```

+%% Output
+
+    <li>Andrew Kuemmel (Teaching Faculty - Department of Computer Sciences) kuemmel@wisc.edu</li>
+
 %% Cell type:markdown id: tags:

 ### Learning Objectives:

 - Using the Document Object Model of web pages
    - describe the 3 things a DOM element may contain, and give examples of each
    - given an html string, identify the correct DOM tree of elements
 - Create BeautifulSoup objects from an html string and use prettify to display
 - Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag
 - Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs
 - Use BeautifulSoup to scrape a live web site.

 %% Cell type:markdown id: tags:

 ### Document Object Model

 In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.

 <div>
 <img src="attachment:image.png" width="600"/>
 </div>

 %% Cell type:markdown id: tags:

 ### Take a look at the HTML in the below cell.

 %% Cell type:markdown id: tags:

 <b>To Do List</b>
 <ul>
    <li>Eat Healthy</li>
    <li>Sleep <b>More</b></li>
    <li>Exercise</li>
 </ul>

 %% Cell type:markdown id: tags:

 ### BeautifulSoup constructor
 - takes a html, as a string, as argument  and parses it
 - Syntax: `BeautifulSoup(<html_string>, "html.parser")`
 - Second argument specifies what kind of parsing we want done

 %% Cell type:code id: tags:

 ``` python
 html_string = "<b>To Do List</b><ul><li>Eat Healthy</li><li>Sleep <b>More</b></li><li>Exercise</li></ul>"

-
+bs_obj = BeautifulSoup(..., "html.parser")

 type(bs_obj)
 ```

+%% Output
+
+    bs4.BeautifulSoup
+
 %% Cell type:markdown id: tags:

 ## BeautifulSoup operations
 - `prettify()`        returns a formatted representation of the raw HTML

 ### A  BeautifulSoup object can be searched for elements using:
 - `find("")`         returns the first element matching the tag string, None otherwise
 - `find_all("")`     returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise

 ### Beautiful Soup Elements can be inspected by using:
- `get_text()`     returns the text associated with this element, if applicable; does not return the child elements associated with that element
+- `text`    returns the text associated with this element, if applicable; does not return the child elements associated with that element
 - `.children`      all children of this element (can be converted into a list)
 - `.attrs`          the atribute associated with that element / tag.

 %% Cell type:markdown id: tags:

 `prettify()` returns a formatted representation of the raw HTML

 %% Cell type:code id: tags:

 ``` python
+# bs_obj.prettify()
 ```

+%% Output
+
+    '<b>\n To Do List\n</b>\n<ul>\n <li>\n  Eat Healthy\n </li>\n <li>\n  Sleep\n  <b>\n   More\n  </b>\n </li>\n <li>\n  Exercise\n </li>\n</ul>'
+
 %% Cell type:markdown id: tags:

 `find` returns the first HTML 'tag' matching the string "b"

 %% Cell type:code id: tags:

 ``` python
+# bs_obj.find("b")
 ```

+%% Output
+
+    <b>To Do List</b>
+
 %% Cell type:markdown id: tags:

 What is the type of find's return value?

 %% Cell type:code id: tags:

 ``` python
 ```

+%% Output
+
+    bs4.element.Tag
+
 %% Cell type:markdown id: tags:

 How do we extract the text of the "b" element and what is its type?

 %% Cell type:code id: tags:

 ``` python
 ```

+%% Output
+
+    'To Do List'
+
 %% Cell type:markdown id: tags:

 `find` returns None if it cannot find that element.

 %% Cell type:code id: tags:

 ``` python
 # assert that this html string has a <ul> tag
+assert bs_obj.find("ul") ...

 # assert that this does not have an <a> tag
+assert bs_obj.find("a") ...
 ```

 %% Cell type:markdown id: tags:

 `find_all` returns an iterable of all matching elements (HTML 'tags') matching the string "b"

 %% Cell type:code id: tags:

 ``` python
+bold_elements = ...
+bold_elements
 ```

+%% Output
+
+    [<b>To Do List</b>, <b>More</b>]
+
 %% Cell type:markdown id: tags:

 What is the type of return value of `find_all`?

 %% Cell type:code id: tags:

 ``` python
+type(bold_elements)
 ```

+%% Output
+
+    bs4.element.ResultSet
+
 %% Cell type:code id: tags:

 ``` python
+type(bold_elements[0])
 ```

+%% Output
+
+    bs4.element.Tag
+
 %% Cell type:markdown id: tags:

 Use a for loop to print the text of each "b" element.

 %% Cell type:code id: tags:

 ``` python
+for element in bold_elements:
+    print(...)
+
 ```

+%% Output
+
+    To Do List
+    More
+
 %% Cell type:markdown id: tags:

 Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements.

 %% Cell type:code id: tags:

 ``` python
 # only searches for elements, not text
-print(bs_obj.find_all("Sleep"))
+# print(bs_obj.find_all("Sleep"))
 # if not present returns None
-print(bs_obj.find("Sleep"))
+# print(bs_obj.find("Sleep"))
 ```

+%% Output
+
+    []
+    None
+
 %% Cell type:markdown id: tags:

 You can invoke `find` or `find_all` on other BeautifulSoup object instances.

 Find all `li` elements and find `b` element inside the second `li` element.

 %% Cell type:code id: tags:

 ``` python
+li_elements = ...
+li_elements
+```
+
+%% Output
+
+    'More'
+
+%% Cell type:code id: tags:
+
+``` python
+li_elements[1].find("b")
+```
+
+%% Output
+
+    <b>More</b>
+
+%% Cell type:code id: tags:
+
+``` python
+li_elements[1].find("b").text
 ```

+%% Output
+
+    'More'
+
 %% Cell type:markdown id: tags:

-DOM trees are hierarchical. You can use `.children` on any element to gets its children.
+### DOM trees are hierarchical. You can use `.children` on any element to gets its children.
+

 %% Cell type:markdown id: tags:

 Find all the children of "ul" element.

 %% Cell type:code id: tags:

 ``` python
+ul_elements = ...
+ul_elements.children
 ```

+%% Output
+
+    [<li>Eat Healthy</li>, <li>Sleep <b>More</b></li>, <li>Exercise</li>]
+
 %% Cell type:markdown id: tags:

 Find text of every child element.

 %% Cell type:code id: tags:

 ``` python
 ```

+%% Output
+
+    ['Eat Healthy', 'Sleep More', 'Exercise']
+
 %% Cell type:markdown id: tags:

-Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()`
+Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`.

 %% Cell type:markdown id: tags:

 To understand `attribute`, let's go back to the table from warmup 1.


 New syntax, you can use `"""some really long string"""` to split a string across multiple lines.

 %% Cell type:code id: tags:

 ``` python
 html_string = """
 <table>
  <tr>
    <th>University</th>
    <th>Department</th>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
  </tr>
   <tr>
    <td>UW-Madison</td>
    <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
  </tr>
  <tr>
    <td>UC Berkeley</td>
    <td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
    </tr>
 </table>
 """
 ```

 %% Cell type:markdown id: tags:

 Find the table headers.

 %% Cell type:code id: tags:

 ``` python
+bs_obj = BeautifulSoup(html_string, "html.parser")
+th_elements = ...
+th_elements
 ```

+%% Output
+
+    [<th>University</th>, <th>Department</th>]
+
 %% Cell type:markdown id: tags:

 Find the first anchor element, extract its text.

 %% Cell type:code id: tags:

 ``` python
+anchor_element = ...
+anchor_element
 ```

+%% Output
+
+    'Computer Sciences'
+
 %% Cell type:markdown id: tags:

 You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.

 Now, let's get the attributes of the anchor element.

 %% Cell type:code id: tags:

 ``` python
+anchor_element.attrs
 ```

+%% Output
+
+    {'href': 'https://www.cs.wisc.edu/'}
+
 %% Cell type:markdown id: tags:

 What is the return value type of `.attrs`?

 %% Cell type:code id: tags:

 ``` python
+type(anchor_element.attrs)
 ```

+%% Output
+
+    dict
+
 %% Cell type:markdown id: tags:

 Extract the hyperlink.

 %% Cell type:code id: tags:

 ``` python
 ```

+%% Output
+
+    'https://www.cs.wisc.edu/'
+
 %% Cell type:markdown id: tags:

 Extract hyperlinks for each department and populate department name and link into a `dict`.

 %% Cell type:code id: tags:

 ``` python
 department_urls = {} # Key: department name; Value: website URL
+
+anchor_elements = bs_obj.find_all("a")
+anchor_elements
 ```

+%% Output
+
+    Computer Sciences https://www.cs.wisc.edu/
+    Statistics https://stat.wisc.edu/
+    CDIS https://cdis.wisc.edu/
+    Electrical Engineering and Computer Sciences https://eecs.berkeley.edu/
+
+    {'Computer Sciences': 'https://www.cs.wisc.edu/',
+     'Statistics': 'https://stat.wisc.edu/',
+     'CDIS': 'https://cdis.wisc.edu/',
+     'Electrical Engineering and Computer Sciences': 'https://eecs.berkeley.edu/'}
+
 %% Cell type:markdown id: tags:

-#### Self-practice: Extract title of the CS220 syllabus page (from warmup 2)
+#### Self-practice: Find all anchor links that include piazza in the CS 220 page

 %% Cell type:code id: tags:

 ``` python
 # Get this page using requests.
-url = "https://www.msyamkumar.com/cs220/s22/syllabus.html"
-
+url = "https://cs220.cs.wisc.edu/f22/syllabus.html"
+r = ...
 # make sure there is no error

+
 # read the entire contents of the page into a single string variable
+html_data = ...

-# use BeautifulSoup to extract title
+# create a BeautifulSoup object
+bs_obj = ...
+
+# find all anchor elements
+anchor_elements = ..
+
+# print out all URLS to piazza
 ```

-%% Cell type:markdown id: tags:
+%% Output
+
+    /Users/andrewkuemmel/opt/anaconda3/lib/python3.9/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'cs220.cs.wisc.edu'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
+      warnings.warn(

-## Parsing small_movies html table to extract `small_movies.json`
+    ['https://piazza.com/wisc/fall2022/cs220/home',
+     'https://piazza.com/wisc/fall2022/cs220/home']

 %% Cell type:markdown id: tags:

-### https://www.msyamkumar.com/cs220/f21/syllabus.html
+### Scraping Tables
+### Parsing small_movies html table to extract `small_movies.json`

 %% Cell type:markdown id: tags:

 ### Step 1: Read `small_movies.html` content into a variable

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### Step 2: Initialize BeautifulSoup object instance

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### Step 3: Find table element

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### Step 4: Find all th tags, to parse the table header

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### Step 5: Scrape second row, convert data to appropriate types, and populate data into a row dictionary
 - "Year", "Runtime": `int` conversion
 - "Revenue": format_revenue(...) conversion
 - "Rating": `float` conversion

 %% Cell type:code id: tags:

 ``` python
 def format_revenue(revenue):
    if type(revenue) == float: # need this in here if we run code multiple times
        return revenue
    elif revenue[-1] == 'M': # some have an "M" at the end
        return float(revenue[:-1]) * 1e6
    else:                    # otherwise, assume millions.
        return float(revenue) * 1e6
 ```

 %% Cell type:code id: tags:

 ``` python
 # Why second row? Because first row has the header information.

 ```

 %% Cell type:markdown id: tags:

 ### Step 6: Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list
 - "Year", "Runtime": `int` conversion
 - "Revenue": format_revenue(...) conversion
 - "Rating": `float` conversion

 You can compare your parsing output to `small_movies.json` file contents, to confirm your result.

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### Final step: convert steps 1 through 6 into a function and use that function to parse `full_movies.html` file.

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 full_movies_data = parse_html("full_movies.html")
 # full_movies_data
 ```

--- a/f22/andy_lec_notes/lec_31/readme.md
+++ b/f22/andy_lec_notes/lec_31/readme.md
--- a/f22/andy_lec_notes/lec_31/small_movies.html
+++ b/f22/andy_lec_notes/lec_31/small_movies.html
--- a/f22/andy_lec_notes/lec_32/bus.db
+++ b/f22/andy_lec_notes/lec_32/bus.db
--- a/f22/andy_lec_notes/lec_32/lec32_database1_complete.ipynb
+++ b/f22/andy_lec_notes/lec_32/lec32_database1_complete.ipynb
--- a/f22/andy_lec_notes/lec_32/lec32_database1_template.ipynb
+++ b/f22/andy_lec_notes/lec_32/lec32_database1_template.ipynb
--- a/f22/andy_lec_notes/lec_32/readme.md
+++ b/f22/andy_lec_notes/lec_32/readme.md
--- a/f22/andy_lec_notes/lec_33/lec33_database2_complete.ipynb
+++ b/f22/andy_lec_notes/lec_33/lec33_database2_complete.ipynb
--- a/f22/andy_lec_notes/lec_33/lec33_database2_template.ipynb
+++ b/f22/andy_lec_notes/lec_33/lec33_database2_template.ipynb
--- a/f22/andy_lec_notes/lec_33/movies.db
+++ b/f22/andy_lec_notes/lec_33/movies.db
--- a/f22/andy_lec_notes/lec_33/readme.md
+++ b/f22/andy_lec_notes/lec_33/readme.md
--- a/f22/andy_lec_notes/lec_31/full_movies.json
+++ b/f22/andy_lec_notes/lec_31/full_movies.json
--- a/f22/andy_lec_notes/lec_31/small_movies.json
+++ b/f22/andy_lec_notes/lec_31/small_movies.json
-[
-  {
-    "Title": "Guardians of the Galaxy",
-    "Genre": "Action,Adventure,Sci-Fi",
-    "Director": "James Gunn",
-    "Cast": "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana",
-    "Year": 2014,
-    "Runtime": 121,
-    "Rating": 8.1,
-    "Revenue": 333130000.0
-  },
-  {
-    "Title": "Prometheus",
-    "Genre": "Adventure,Mystery,Sci-Fi",
-    "Director": "Ridley Scott",
-    "Cast": "Noomi Rapace, Logan Marshall-Green, Michael         fassbender, Charlize Theron",
-    "Year": 2012,
-    "Runtime": 124,
-    "Rating": 7.0,
-    "Revenue": 126460000.0
-  },
-  {
-    "Title": "Split",
-    "Genre": "Horror,Thriller",
-    "Director": "M. Night Shyamalan",
-    "Cast": "James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula",
-    "Year": 2016,
-    "Runtime": 117,
-    "Rating": 7.3,
-    "Revenue": 138120000.0
-  },
-  {
-    "Title": "Sing",
-    "Genre": "Animation,Comedy,Family",
-    "Director": "Christophe Lourdelet",
-    "Cast": "Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson",
-    "Year": 2016,
-    "Runtime": 108,
-    "Rating": 7.2,
-    "Revenue": 270320000.0
-  },
-  {
-    "Title": "Suicide Squad",
-    "Genre": "Action,Adventure,Fantasy",
-    "Director": "David Ayer",
-    "Cast": "Will Smith, Jared Leto, Margot Robbie, Viola Davis",
-    "Year": 2016,
-    "Runtime": 123,
-    "Rating": 6.2,
-    "Revenue": 325020000.0
-  },
-  {
-    "Title": "The Great Wall",
-    "Genre": "Action,Adventure,Fantasy",
-    "Director": "Yimou Zhang",
-    "Cast": "Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",
-    "Year": 2016,
-    "Runtime": 103,
-    "Rating": 6.1,
-    "Revenue": 45130000.0
-  },
-  {
-    "Title": "La La Land",
-    "Genre": "Comedy,Drama,Music",
-    "Director": "Damien Chazelle",
-    "Cast": "Ryan Gosling, Emma Stone, Rosemarie DeWitt, J.K. Simmons",
-    "Year": 2016,
-    "Runtime": 128,
-    "Rating": 8.3,
-    "Revenue": 151060000.0
-  },
-  {
-    "Title": "Mindhorn",
-    "Genre": "Comedy",
-    "Director": "Sean Foley",
-    "Cast": "Essie Davis, Andrea Riseborough, Julian Barratt,Kenneth Branagh",
-    "Year": 2016,
-    "Runtime": 89,
-    "Rating": 6.4,
-    "Revenue": 0.0
-  },
-  {
-    "Title": "The Lost City of Z",
-    "Genre": "Action,Adventure,Biography",
-    "Director": "James Gray",
-    "Cast": "Charlie Hunnam, Robert Pattinson, Sienna Miller, Tom Holland",
-    "Year": 2016,
-    "Runtime": 141,
-    "Rating": 7.1,
-    "Revenue": 8010000.0
-  },
-  {
-    "Title": "Passengers",
-    "Genre": "Adventure,Drama,Romance",
-    "Director": "Morten Tyldum",
-    "Cast": "Jennifer Lawrence, Chris Pratt, Michael Sheen,Laurence Fishburne",
-    "Year": 2016,
-    "Runtime": 116,
-    "Rating": 7.0,
-    "Revenue": 100010000.0
-  }
-]
\ No newline at end of file