Skip to content
Snippets Groups Projects
11-web_001.ipynb 15.4 KiB
Newer Older
gsingh58's avatar
gsingh58 committed
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e172ecb9",
gsingh58's avatar
gsingh58 committed
   "metadata": {
    "tags": []
   },
gsingh58's avatar
gsingh58 committed
   "source": [
    "# Web 1: Selenium\n",
    "\n",
    "- Operations:\n",
    "    - `b.get(URL)`: sends HTTP GET request to the URL\n",
    "    - `b.page_source`: HTML source for the page"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f9bec30-8c27-4cc6-8e64-b44c46bd34c6",
   "metadata": {},
   "outputs": [],
   "source": [
gsingh58's avatar
gsingh58 committed
    "import os\n",
gsingh58's avatar
gsingh58 committed
    "from selenium.webdriver.chrome.options import Options\n",
    "from selenium.webdriver.chrome.service import Service\n",
    "from selenium.webdriver.common.by import By\n",
    "from selenium.common.exceptions import NoSuchElementException\n",
gsingh58's avatar
gsingh58 committed
    "\n",
gsingh58's avatar
gsingh58 committed
    "from selenium import webdriver\n",
    "\n",
gsingh58's avatar
gsingh58 committed
    "from webdriver_manager.chrome import ChromeDriverManager\n",
gsingh58's avatar
gsingh58 committed
    "from IPython.display import display, Image\n",
    "\n",
    "import time\n",
gsingh58's avatar
gsingh58 committed
    "import pandas as pd\n",
    "\n",
    "from collections import deque\n",
    "from graphviz import Digraph\n",
    "\n",
    "# os.system(\"pkill -f -9 chromium\")\n",
    "# os.system(\"pkill -f -9 chrome\")"
gsingh58's avatar
gsingh58 committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "371ee5f8-22fd-4bdc-8878-b8467f53223d",
   "metadata": {},
   "outputs": [],
   "source": [
    "options = Options()\n",
gsingh58's avatar
gsingh58 committed
    "options.add_argument(\"--headless\")\n",
    "options.add_argument(\"--no-sandbox\")\n",
    "options.add_argument(\"--disable-dev-shm-usage\")\n",
    "\n",
    "service = Service(ChromeDriverManager().install())\n",
    "\n",
gsingh58's avatar
gsingh58 committed
    "b = webdriver.Chrome(options=options, service=service)"
   ]
  },
gsingh58's avatar
gsingh58 committed
  {
   "cell_type": "markdown",
   "id": "722db1fa-b151-4ede-b895-6162aafc4843",
   "metadata": {},
   "source": [
    "## Tricky pages"
   ]
  },
gsingh58's avatar
gsingh58 committed
  {
   "cell_type": "markdown",
   "id": "fcabc9ed-2b56-4b84-b66f-f1af5c556743",
   "metadata": {},
   "source": [
    "### page1.html: Javascript table example"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8e7031b",
   "metadata": {},
   "source": [
    "### Selenium operations\n",
    "\n",
    "- Operations:\n",
    "    - `b.get(URL)`: sends HTTP GET request to the URL\n",
    "    - `b.page_source`: HTML source for the page\n",
    "    - `b.find_elements(\"id\", <ID>)`: searches for a specific element that matches the \"id\"\n",
    "    - `b.find_elements(\"tag name\", <TAG>)`: searches for a specific element using corresponding tag name\n",
    "    - `b.find_element` versus `b.find_elements`:\n",
    "        - `find_element` gives first match\n",
    "        - `find_elements` gives all matches\n",
    "    - `<element obj>.text`: gives text associated with that element"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15b97efd-7f40-4038-bf12-78c1a745d276",
   "metadata": {},
   "source": [
    "### POLLING: How would we know when the updated page becomes available?\n",
    "- keep checking regularly until you get all the details you are looking for."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "662a2ae1-8077-45e1-bde9-9e179720d26e",
   "metadata": {},
   "outputs": [],
   "source": [
gsingh58's avatar
gsingh58 committed
    "url = \"https://cs320.cs.wisc.edu/tricky/page1.html\"\n",
gsingh58's avatar
gsingh58 committed
    "b.get(url)\n",
    "\n",
    "while True:\n",
    "    tbls = b.find_elements(\"tag name\", \"table\")\n",
    "    print(\"Tables:\", len(tbls))\n",
    "        \n",
    "    if len(tbls) == 2:\n",
    "        print(tbls)\n",
    "        break\n",
    "    \n",
    "    time.sleep(0.1) # sleep for 0.1 second"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c70f9430-9ae5-4d70-9355-4b113b9fc20a",
   "metadata": {},
   "source": [
    "### Let's extract the 2nd table information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c626766-4a91-482f-bd01-77a56a7a2c0f",
   "metadata": {},
   "outputs": [],
   "source": [
    "tbl = tbls[-1]\n",
    "\n",
    "# TODO: find all tr elements\n",
    "trs = tbl.find_elements(\"tag name\", \"tr\")\n",
    "\n",
    "# TODO: find all td elements\n",
    "# TODO: extract text for all td elements into a list of list\n",
    "rows = []\n",
    "\n",
    "for tr in trs:\n",
    "    tds = tr.find_elements(\"tag name\", \"td\")\n",
    "    assert len(tds) == 2\n",
    "    rows.append([tds[0].text, tds[1].text])\n",
    "    \n",
    "rows"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "358cfbab-5db5-4f9b-a205-d78a74cf04ca",
   "metadata": {},
   "source": [
    "### Converting `rows` into a `DataFrame`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5084534-db73-4766-bded-2cf50a3fad0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame(rows)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9358137-a2d7-4ebb-96a3-4bb8513084a4",
   "metadata": {},
   "source": [
    "### How can we visually see the page on the VM?\n",
    "\n",
    "- Operations:\n",
    "    - `b.save_screenshot(\"some_file.png\")`: saves a screenshot of the rendered page\n",
    "    - `b.set_window_size(<width>, <height>)`: controls size of the image\n",
    "    - import statement: `from IPython.display import display, Image`: helps us show the screenshot as an image inside the notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0eb2a2cf-02ee-4d8a-baaf-48c7b1f118fc",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38316989-871e-4b84-b877-b06740a09533",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "3e124fd0-bf7c-4e0a-9d38-39dac9186e97",
   "metadata": {},
   "source": [
    "### Combining taking screenshot and displaying it\n",
    "- useful for p3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "607f4c72-e1eb-41ea-8124-7618f8e8efd3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_screen(width, height):\n",
    "    b.save_screenshot(\"out.png\")\n",
    "    b.set_window_size(width, height)\n",
    "    display(Image(\"out.png\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a742134-466f-45be-9289-eafb9029d765",
   "metadata": {},
   "source": [
    "### page2.html: \"Show More!\" button example\n",
    "\n",
    "- Operations:\n",
    "    `button_oject.click()`: enables us to click the button"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6a13d39-ae0f-4bb7-9146-ae27d5255fbd",
   "metadata": {},
   "outputs": [],
   "source": [
gsingh58's avatar
gsingh58 committed
    "url = \"https://cs320.cs.wisc.edu/tricky/page2.html\"\n",
gsingh58's avatar
gsingh58 committed
    "b.get(url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe8866c9-7d2e-4bab-b533-6abd4d00d3ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: find the id for the more button (inspect element on browser)\n",
    "button = b.find_element(\"id\", \"???\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6744a392-304c-410e-88ad-4dfd457547a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: click the button\n",
    "\n",
    "# keep running this cell reptitively\n",
    "# once all data is retrieved, we will run into NoSuchElementException"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f234387-417d-4cc1-8571-5a02778db2cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "b.get(url)\n",
    "\n",
    "while True:\n",
    "    try:\n",
    "        button = b.find_element(\"id\", \"more\")\n",
    "        button.click()\n",
    "        show_screen(500, 500)\n",
    "        print(\"============================================================\")\n",
    "    except NoSuchElementException:\n",
    "        print(\"We have all the data!\")\n",
    "        break\n",
    "    time.sleep(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "603af6f2-a8a1-41ec-8c6e-a24286482e4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(b.page_source)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f57b79ca-5e8d-48d1-82d4-1456598ca5a0",
   "metadata": {},
   "source": [
    "### page 3: password protection example\n",
    "\n",
    "- Operations:\n",
    "    `text_object.send_keys()`: enables us to send data to textbox"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf91d59d-6468-4e78-866b-a1f9351e7413",
   "metadata": {},
   "outputs": [],
   "source": [
gsingh58's avatar
gsingh58 committed
    "url = \"https://cs320.cs.wisc.edu/tricky/page3.html\"\n",
gsingh58's avatar
gsingh58 committed
    "b.get(url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb734feb-df0d-4d19-89b2-1cc7034f00aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: find the id for password box (inspect element on browser)\n",
    "# TODO: find the id for the login button (inspect element on browser)\n",
    "text = b.find_element(\"id\", \"\")\n",
    "button = b.find_element(\"id\", \"\")\n",
    "\n",
    "# TODO: send the password (plain text just for example purposes)\n",
    "\n",
    "show_screen()\n",
    "\n",
    "# TODO: click the button\n",
    "\n",
    "show_screen()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d85f4319-a345-4d30-b7aa-2d344dc658c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(b.page_source)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba660e42-1d67-4c1e-b69f-6bcf25be4b28",
   "metadata": {},
   "source": [
    "### page 4: search data for a year\n",
    "\n",
    "- Operations:\n",
    "    `text_object.clear()`: enables us to clear the previous text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8ea2227-6d6f-405b-8610-01f627d82b5d",
   "metadata": {},
   "outputs": [],
   "source": [
gsingh58's avatar
gsingh58 committed
    "url = \"https://cs320.cs.wisc.edu/tricky/page4.html\"\n",
gsingh58's avatar
gsingh58 committed
    "b.get(url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "276c988a-bacc-4535-9a5f-2e68fc2910b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: find the id for year box (inspect element on browser)\n",
    "# TODO: find the id for the search button (inspect element on browser)\n",
    "text = b.find_element(\"id\", \"\")\n",
    "button = b.find_element(\"id\", \"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff31ce65-12a9-4629-aee5-56e0bbbf9698",
   "metadata": {},
   "outputs": [],
   "source": [
    "text.send_keys(\"1952\")\n",
    "button.click()\n",
    "show_screen()\n",
    "# TODO: run this cell twice"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a892d964-7e87-431e-8f45-81664f159e10",
   "metadata": {},
   "source": [
    "#### How many hurricanes were there each year?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab822af8-6d66-46b6-bf19-28d332e4b01e",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "for year in range(1950, 1960):\n",
    "    text.clear()\n",
    "    text.send_keys(???)\n",
    "    button.click()\n",
    "    show_screen()\n",
    "    \n",
    "    # TODO: find all tr elements and count hurricanes for each year\n",
    "    \n",
    "    # TODO: We have to subtract 1 for removing header tr element\n",
    "    \n",
    "    \n",
    "# ax = hurricane_counts.plot.line()\n",
    "# ax.set_xlabel(\"Year\")\n",
    "# ax.set_ylabel(\"Hurricane count\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8543c5b",
   "metadata": {},
   "source": [
gsingh58's avatar
gsingh58 committed
    "## Recursive Crawl\n",
gsingh58's avatar
gsingh58 committed
    "\n",
    "- crawling: process of finding all the webpages inside a website"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e60c940-2624-4d78-a730-03056af72297",
   "metadata": {},
   "outputs": [],
   "source": [
gsingh58's avatar
gsingh58 committed
    "# TODO: initialize url, send GET request, and display page source\n",
    "url = \"https://cs320.cs.wisc.edu/crawl/practice1/1.html\"\n",
    "\n",
    "print(b.page_source)"
gsingh58's avatar
gsingh58 committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ada2baa",
   "metadata": {},
   "outputs": [],
   "source": [
gsingh58's avatar
gsingh58 committed
    "# TODO: show the screen\n"
gsingh58's avatar
gsingh58 committed
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f3c1cfc",
   "metadata": {},
   "source": [
    "### Final all hyperlinks\n",
    "\n",
    "- Selenium operations:\n",
    "    - `b.get(URL)`: sends HTTP GET request to the URL\n",
    "    - `b.page_source`: HTML source for the page\n",
    "    - `b.find_elements(\"id\", <ID>)`: searches for a specific element that matches the \"id\"\n",
    "    - `b.find_elements(\"tag name\", <TAG>)`: searches for a specific element using corresponding tag name\n",
    "    - `b.find_element` versus `b.find_elements`:\n",
    "        - `find_element` gives first match\n",
    "        - `find_elements` gives all matches\n",
    "    - `<element obj>.text`: gives text associated with that element   \n",
    "    - `<element obj>.get_attribute(<attribute>)`: gives attribute value; for ex: `<anchor_obj>.get_attribute(\"href\")`\n",
    "    \n",
    "    - `b.save_screenshot(\"some_file.png\")`: saves a screenshot of the rendered page\n",
    "    - `b.set_window_size(<width>, <height>)`: controls size of the image\n",
    "    - import statement: `from IPython.display import display, Image`: helps us show the screenshot as an image inside the notebook\n",
    "    - `button_oject.click()`: enables us to click the button\n",
    "    - `text_object.send_keys()`: enables us to send data to textbox"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddab23c0",
   "metadata": {},
   "outputs": [],
gsingh58's avatar
gsingh58 committed
   "source": [
    "# TODO: find all a elements, and then \n",
    "# TODO: loop over all the a elements to print text and use get_attribute to print href value of each a element\n",
    "a_elements = b.find_elements(\"tag name\", \"a\")\n",
    "for a_element in a_elements:\n",
    "    print(a_element.text, a_element.get_attribute(\"href\"))"
   ]
gsingh58's avatar
gsingh58 committed
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a9a54b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: Generalize to a function\n",
    "def get_children(url):\n",
    "    \"\"\"\n",
    "    Finds all hyperlinks in the given url by sending GET request and parsing page source.\n",
    "    Returns a list of children URLs.\n",
    "    \"\"\"\n",
    "    pass\n",
    "\n",
gsingh58's avatar
gsingh58 committed
    "url = \"https://cs320.cs.wisc.edu/crawl/practice1/1.html\"\n",
gsingh58's avatar
gsingh58 committed
    "get_children(url)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3741f5aa",
   "metadata": {},
   "source": [
    "### Breadth First Search\n",
    "\n",
    "- for crawling, there is no specific \"destination\", as we need to find all the webpages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8c61a4f",
   "metadata": {},
   "outputs": [],
   "source": [
gsingh58's avatar
gsingh58 committed
    "start_url = \"https://cs320.cs.wisc.edu/crawl/practice1/1.html\"\n",
    "#start_url = \"https://cs320.cs.wisc.edu/crawl/practice7/1.html\"\n",
gsingh58's avatar
gsingh58 committed
    "\n",
    "# Why use a set to keep track of visited nodes?\n",
    "\n",
    "# TODO: create a Digraph\n",
    "\n",
    "\n",
    "    # TODO: add current node to digraph\n",
    "    \n",
    "    # TODO: how do we get all the children?\n",
    "    \n",
    "    \n",
    "        # TODO: add an edge\n",
    "        "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
gsingh58's avatar
gsingh58 committed
   "version": "3.10.12"
gsingh58's avatar
gsingh58 committed
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}