11-web_001.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e172ecb9",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Web 1: Selenium\n",
    "\n",
    "- Operations:\n",
    "    - `b.get(URL)`: sends HTTP GET request to the URL\n",
    "    - `b.page_source`: HTML source for the page"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f9bec30-8c27-4cc6-8e64-b44c46bd34c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from selenium.webdriver.chrome.options import Options\n",
    "from selenium.webdriver.chrome.service import Service\n",
    "from selenium.webdriver.common.by import By\n",
    "from selenium.common.exceptions import NoSuchElementException\n",
    "\n",
    "from selenium import webdriver\n",
    "\n",
    "from webdriver_manager.chrome import ChromeDriverManager\n",
    "from IPython.display import display, Image\n",
    "\n",
    "import time\n",
    "import pandas as pd\n",
    "\n",
    "from collections import deque\n",
    "from graphviz import Digraph\n",
    "\n",
    "# os.system(\"pkill -f -9 chromium\")\n",
    "# os.system(\"pkill -f -9 chrome\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "371ee5f8-22fd-4bdc-8878-b8467f53223d",
   "metadata": {},
   "outputs": [],
   "source": [
    "options = Options()\n",
    "options.add_argument(\"--headless\")\n",
    "options.add_argument(\"--no-sandbox\")\n",
    "options.add_argument(\"--disable-dev-shm-usage\")\n",
    "\n",
    "service = Service(ChromeDriverManager().install())\n",
    "\n",
    "b = webdriver.Chrome(options=options, service=service)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "722db1fa-b151-4ede-b895-6162aafc4843",
   "metadata": {},
   "source": [
    "## Tricky pages"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fcabc9ed-2b56-4b84-b66f-f1af5c556743",
   "metadata": {},
   "source": [
    "### page1.html: Javascript table example"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8e7031b",
   "metadata": {},
   "source": [
    "### Selenium operations\n",
    "\n",
    "- Operations:\n",
    "    - `b.get(URL)`: sends HTTP GET request to the URL\n",
    "    - `b.page_source`: HTML source for the page\n",
    "    - `b.find_elements(\"id\", <ID>)`: searches for a specific element that matches the \"id\"\n",
    "    - `b.find_elements(\"tag name\", <TAG>)`: searches for a specific element using corresponding tag name\n",
    "    - `b.find_element` versus `b.find_elements`:\n",
    "        - `find_element` gives first match\n",
    "        - `find_elements` gives all matches\n",
    "    - `<element obj>.text`: gives text associated with that element"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15b97efd-7f40-4038-bf12-78c1a745d276",
   "metadata": {},
   "source": [
    "### POLLING: How would we know when the updated page becomes available?\n",
    "- keep checking regularly until you get all the details you are looking for."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "662a2ae1-8077-45e1-bde9-9e179720d26e",
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://cs320.cs.wisc.edu/tricky/page1.html\"\n",
    "b.get(url)\n",
    "\n",
    "while True:\n",
    "    tbls = b.find_elements(\"tag name\", \"table\")\n",
    "    print(\"Tables:\", len(tbls))\n",
    "        \n",
    "    if len(tbls) == 2:\n",
    "        print(tbls)\n",
    "        break\n",
    "    \n",
    "    time.sleep(0.1) # sleep for 0.1 second"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c70f9430-9ae5-4d70-9355-4b113b9fc20a",
   "metadata": {},
   "source": [
    "### Let's extract the 2nd table information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c626766-4a91-482f-bd01-77a56a7a2c0f",
   "metadata": {},
   "outputs": [],
   "source": [
    "tbl = tbls[-1]\n",
    "\n",
    "# TODO: find all tr elements\n",
    "trs = tbl.find_elements(\"tag name\", \"tr\")\n",
    "\n",
    "# TODO: find all td elements\n",
    "# TODO: extract text for all td elements into a list of list\n",
    "rows = []\n",
    "\n",
    "for tr in trs:\n",
    "    tds = tr.find_elements(\"tag name\", \"td\")\n",
    "    assert len(tds) == 2\n",
    "    rows.append([tds[0].text, tds[1].text])\n",
    "    \n",
    "rows"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "358cfbab-5db5-4f9b-a205-d78a74cf04ca",
   "metadata": {},
   "source": [
    "### Converting `rows` into a `DataFrame`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5084534-db73-4766-bded-2cf50a3fad0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame(rows)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9358137-a2d7-4ebb-96a3-4bb8513084a4",
   "metadata": {},
   "source": [
    "### How can we visually see the page on the VM?\n",
    "\n",
    "- Operations:\n",
    "    - `b.save_screenshot(\"some_file.png\")`: saves a screenshot of the rendered page\n",
    "    - `b.set_window_size(<width>, <height>)`: controls size of the image\n",
    "    - import statement: `from IPython.display import display, Image`: helps us show the screenshot as an image inside the notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0eb2a2cf-02ee-4d8a-baaf-48c7b1f118fc",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38316989-871e-4b84-b877-b06740a09533",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "3e124fd0-bf7c-4e0a-9d38-39dac9186e97",
   "metadata": {},
   "source": [
    "### Combining taking screenshot and displaying it\n",
    "- useful for p3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "607f4c72-e1eb-41ea-8124-7618f8e8efd3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_screen(width, height):\n",
    "    b.save_screenshot(\"out.png\")\n",
    "    b.set_window_size(width, height)\n",
    "    display(Image(\"out.png\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a742134-466f-45be-9289-eafb9029d765",
   "metadata": {},
   "source": [
    "### page2.html: \"Show More!\" button example\n",
    "\n",
    "- Operations:\n",
    "    `button_oject.click()`: enables us to click the button"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6a13d39-ae0f-4bb7-9146-ae27d5255fbd",
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://cs320.cs.wisc.edu/tricky/page2.html\"\n",
    "b.get(url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe8866c9-7d2e-4bab-b533-6abd4d00d3ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: find the id for the more button (inspect element on browser)\n",
    "button = b.find_element(\"id\", \"???\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6744a392-304c-410e-88ad-4dfd457547a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: click the button\n",
    "\n",
    "# keep running this cell reptitively\n",
    "# once all data is retrieved, we will run into NoSuchElementException"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f234387-417d-4cc1-8571-5a02778db2cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "b.get(url)\n",
    "\n",
    "while True:\n",
    "    try:\n",
    "        button = b.find_element(\"id\", \"more\")\n",
    "        button.click()\n",
    "        show_screen(500, 500)\n",
    "        print(\"============================================================\")\n",
    "    except NoSuchElementException:\n",
    "        print(\"We have all the data!\")\n",
    "        break\n",
    "    time.sleep(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "603af6f2-a8a1-41ec-8c6e-a24286482e4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(b.page_source)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f57b79ca-5e8d-48d1-82d4-1456598ca5a0",
   "metadata": {},
   "source": [
    "### page 3: password protection example\n",
    "\n",
    "- Operations:\n",
    "    `text_object.send_keys()`: enables us to send data to textbox"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf91d59d-6468-4e78-866b-a1f9351e7413",
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://cs320.cs.wisc.edu/tricky/page3.html\"\n",
    "b.get(url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb734feb-df0d-4d19-89b2-1cc7034f00aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: find the id for password box (inspect element on browser)\n",
    "# TODO: find the id for the login button (inspect element on browser)\n",
    "text = b.find_element(\"id\", \"\")\n",
    "button = b.find_element(\"id\", \"\")\n",
    "\n",
    "# TODO: send the password (plain text just for example purposes)\n",
    "\n",
    "show_screen()\n",
    "\n",
    "# TODO: click the button\n",
    "\n",
    "show_screen()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d85f4319-a345-4d30-b7aa-2d344dc658c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(b.page_source)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba660e42-1d67-4c1e-b69f-6bcf25be4b28",
   "metadata": {},
   "source": [
    "### page 4: search data for a year\n",
    "\n",
    "- Operations:\n",
    "    `text_object.clear()`: enables us to clear the previous text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8ea2227-6d6f-405b-8610-01f627d82b5d",
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://cs320.cs.wisc.edu/tricky/page4.html\"\n",
    "b.get(url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "276c988a-bacc-4535-9a5f-2e68fc2910b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: find the id for year box (inspect element on browser)\n",
    "# TODO: find the id for the search button (inspect element on browser)\n",
    "text = b.find_element(\"id\", \"\")\n",
    "button = b.find_element(\"id\", \"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff31ce65-12a9-4629-aee5-56e0bbbf9698",
   "metadata": {},
   "outputs": [],
   "source": [
    "text.send_keys(\"1952\")\n",
    "button.click()\n",
    "show_screen()\n",
    "# TODO: run this cell twice"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a892d964-7e87-431e-8f45-81664f159e10",
   "metadata": {},
   "source": [
    "#### How many hurricanes were there each year?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab822af8-6d66-46b6-bf19-28d332e4b01e",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "for year in range(1950, 1960):\n",
    "    text.clear()\n",
    "    text.send_keys(???)\n",
    "    button.click()\n",
    "    show_screen()\n",
    "    \n",
    "    # TODO: find all tr elements and count hurricanes for each year\n",
    "    \n",
    "    # TODO: We have to subtract 1 for removing header tr element\n",
    "    \n",
    "    \n",
    "# ax = hurricane_counts.plot.line()\n",
    "# ax.set_xlabel(\"Year\")\n",
    "# ax.set_ylabel(\"Hurricane count\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8543c5b",
   "metadata": {},
   "source": [
    "## Recursive Crawl\n",
    "\n",
    "- crawling: process of finding all the webpages inside a website"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e60c940-2624-4d78-a730-03056af72297",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: initialize url, send GET request, and display page source\n",
    "url = \"https://cs320.cs.wisc.edu/crawl/practice1/1.html\"\n",
    "\n",
    "print(b.page_source)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ada2baa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: show the screen\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f3c1cfc",
   "metadata": {},
   "source": [
    "### Final all hyperlinks\n",
    "\n",
    "- Selenium operations:\n",
    "    - `b.get(URL)`: sends HTTP GET request to the URL\n",
    "    - `b.page_source`: HTML source for the page\n",
    "    - `b.find_elements(\"id\", <ID>)`: searches for a specific element that matches the \"id\"\n",
    "    - `b.find_elements(\"tag name\", <TAG>)`: searches for a specific element using corresponding tag name\n",
    "    - `b.find_element` versus `b.find_elements`:\n",
    "        - `find_element` gives first match\n",
    "        - `find_elements` gives all matches\n",
    "    - `<element obj>.text`: gives text associated with that element   \n",
    "    - `<element obj>.get_attribute(<attribute>)`: gives attribute value; for ex: `<anchor_obj>.get_attribute(\"href\")`\n",
    "    \n",
    "    - `b.save_screenshot(\"some_file.png\")`: saves a screenshot of the rendered page\n",
    "    - `b.set_window_size(<width>, <height>)`: controls size of the image\n",
    "    - import statement: `from IPython.display import display, Image`: helps us show the screenshot as an image inside the notebook\n",
    "    - `button_oject.click()`: enables us to click the button\n",
    "    - `text_object.send_keys()`: enables us to send data to textbox"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddab23c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: find all a elements, and then \n",
    "# TODO: loop over all the a elements to print text and use get_attribute to print href value of each a element\n",
    "a_elements = b.find_elements(\"tag name\", \"a\")\n",
    "for a_element in a_elements:\n",
    "    print(a_element.text, a_element.get_attribute(\"href\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a9a54b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: Generalize to a function\n",
    "def get_children(url):\n",
    "    \"\"\"\n",
    "    Finds all hyperlinks in the given url by sending GET request and parsing page source.\n",
    "    Returns a list of children URLs.\n",
    "    \"\"\"\n",
    "    pass\n",
    "\n",
    "url = \"https://cs320.cs.wisc.edu/crawl/practice1/1.html\"\n",
    "get_children(url)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3741f5aa",
   "metadata": {},
   "source": [
    "### Breadth First Search\n",
    "\n",
    "- for crawling, there is no specific \"destination\", as we need to find all the webpages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8c61a4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "start_url = \"https://cs320.cs.wisc.edu/crawl/practice1/1.html\"\n",
    "#start_url = \"https://cs320.cs.wisc.edu/crawl/practice7/1.html\"\n",
    "\n",
    "# Why use a set to keep track of visited nodes?\n",
    "\n",
    "# TODO: create a Digraph\n",
    "\n",
    "\n",
    "    # TODO: add current node to digraph\n",
    "    \n",
    "    # TODO: how do we get all the children?\n",
    "    \n",
    "    \n",
    "        # TODO: add an edge\n",
    "        "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}