{ "cells": [ { "cell_type": "markdown", "id": "e172ecb9", "metadata": { "tags": [] }, "source": [ "# Web 1: Selenium\n", "\n", "- Operations:\n", " - `b.get(URL)`: sends HTTP GET request to the URL\n", " - `b.page_source`: HTML source for the page" ] }, { "cell_type": "code", "execution_count": null, "id": "6f9bec30-8c27-4cc6-8e64-b44c46bd34c6", "metadata": {}, "outputs": [], "source": [ "import os\n", "from selenium.webdriver.chrome.options import Options\n", "from selenium.webdriver.chrome.service import Service\n", "from selenium.webdriver.common.by import By\n", "from selenium.common.exceptions import NoSuchElementException\n", "\n", "from selenium import webdriver\n", "\n", "from webdriver_manager.chrome import ChromeDriverManager\n", "from IPython.display import display, Image\n", "\n", "import time\n", "import pandas as pd\n", "\n", "from collections import deque\n", "from graphviz import Digraph\n", "\n", "# os.system(\"pkill -f -9 chromium\")\n", "# os.system(\"pkill -f -9 chrome\")" ] }, { "cell_type": "code", "execution_count": null, "id": "371ee5f8-22fd-4bdc-8878-b8467f53223d", "metadata": {}, "outputs": [], "source": [ "options = Options()\n", "options.add_argument(\"--headless\")\n", "options.add_argument(\"--no-sandbox\")\n", "options.add_argument(\"--disable-dev-shm-usage\")\n", "\n", "service = Service(ChromeDriverManager().install())\n", "\n", "b = webdriver.Chrome(options=options, service=service)" ] }, { "cell_type": "markdown", "id": "722db1fa-b151-4ede-b895-6162aafc4843", "metadata": {}, "source": [ "## Tricky pages" ] }, { "cell_type": "markdown", "id": "fcabc9ed-2b56-4b84-b66f-f1af5c556743", "metadata": {}, "source": [ "### page1.html: Javascript table example" ] }, { "cell_type": "markdown", "id": "b8e7031b", "metadata": {}, "source": [ "### Selenium operations\n", "\n", "- Operations:\n", " - `b.get(URL)`: sends HTTP GET request to the URL\n", " - `b.page_source`: HTML source for the page\n", " - `b.find_elements(\"id\", <ID>)`: searches for a specific element that matches the \"id\"\n", " - `b.find_elements(\"tag name\", <TAG>)`: searches for a specific element using corresponding tag name\n", " - `b.find_element` versus `b.find_elements`:\n", " - `find_element` gives first match\n", " - `find_elements` gives all matches\n", " - `<element obj>.text`: gives text associated with that element" ] }, { "cell_type": "markdown", "id": "15b97efd-7f40-4038-bf12-78c1a745d276", "metadata": {}, "source": [ "### POLLING: How would we know when the updated page becomes available?\n", "- keep checking regularly until you get all the details you are looking for." ] }, { "cell_type": "code", "execution_count": null, "id": "662a2ae1-8077-45e1-bde9-9e179720d26e", "metadata": {}, "outputs": [], "source": [ "url = \"https://cs320.cs.wisc.edu/tricky/page1.html\"\n", "b.get(url)\n", "\n", "while True:\n", " tbls = b.find_elements(\"tag name\", \"table\")\n", " print(\"Tables:\", len(tbls))\n", " \n", " if len(tbls) == 2:\n", " print(tbls)\n", " break\n", " \n", " time.sleep(0.1) # sleep for 0.1 second" ] }, { "cell_type": "markdown", "id": "c70f9430-9ae5-4d70-9355-4b113b9fc20a", "metadata": {}, "source": [ "### Let's extract the 2nd table information" ] }, { "cell_type": "code", "execution_count": null, "id": "0c626766-4a91-482f-bd01-77a56a7a2c0f", "metadata": {}, "outputs": [], "source": [ "tbl = tbls[-1]\n", "\n", "# TODO: find all tr elements\n", "trs = tbl.find_elements(\"tag name\", \"tr\")\n", "\n", "# TODO: find all td elements\n", "# TODO: extract text for all td elements into a list of list\n", "rows = []\n", "\n", "for tr in trs:\n", " tds = tr.find_elements(\"tag name\", \"td\")\n", " assert len(tds) == 2\n", " rows.append([tds[0].text, tds[1].text])\n", " \n", "rows" ] }, { "cell_type": "markdown", "id": "358cfbab-5db5-4f9b-a205-d78a74cf04ca", "metadata": {}, "source": [ "### Converting `rows` into a `DataFrame`" ] }, { "cell_type": "code", "execution_count": null, "id": "d5084534-db73-4766-bded-2cf50a3fad0c", "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(rows)" ] }, { "cell_type": "markdown", "id": "a9358137-a2d7-4ebb-96a3-4bb8513084a4", "metadata": {}, "source": [ "### How can we visually see the page on the VM?\n", "\n", "- Operations:\n", " - `b.save_screenshot(\"some_file.png\")`: saves a screenshot of the rendered page\n", " - `b.set_window_size(<width>, <height>)`: controls size of the image\n", " - import statement: `from IPython.display import display, Image`: helps us show the screenshot as an image inside the notebook" ] }, { "cell_type": "code", "execution_count": null, "id": "0eb2a2cf-02ee-4d8a-baaf-48c7b1f118fc", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "id": "38316989-871e-4b84-b877-b06740a09533", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "3e124fd0-bf7c-4e0a-9d38-39dac9186e97", "metadata": {}, "source": [ "### Combining taking screenshot and displaying it\n", "- useful for p3" ] }, { "cell_type": "code", "execution_count": null, "id": "607f4c72-e1eb-41ea-8124-7618f8e8efd3", "metadata": {}, "outputs": [], "source": [ "def show_screen(width, height):\n", " b.save_screenshot(\"out.png\")\n", " b.set_window_size(width, height)\n", " display(Image(\"out.png\"))" ] }, { "cell_type": "markdown", "id": "3a742134-466f-45be-9289-eafb9029d765", "metadata": {}, "source": [ "### page2.html: \"Show More!\" button example\n", "\n", "- Operations:\n", " `button_oject.click()`: enables us to click the button" ] }, { "cell_type": "code", "execution_count": null, "id": "e6a13d39-ae0f-4bb7-9146-ae27d5255fbd", "metadata": {}, "outputs": [], "source": [ "url = \"https://cs320.cs.wisc.edu/tricky/page2.html\"\n", "b.get(url)" ] }, { "cell_type": "code", "execution_count": null, "id": "fe8866c9-7d2e-4bab-b533-6abd4d00d3ef", "metadata": {}, "outputs": [], "source": [ "# TODO: find the id for the more button (inspect element on browser)\n", "button = b.find_element(\"id\", \"???\")" ] }, { "cell_type": "code", "execution_count": null, "id": "6744a392-304c-410e-88ad-4dfd457547a3", "metadata": {}, "outputs": [], "source": [ "# TODO: click the button\n", "\n", "# keep running this cell reptitively\n", "# once all data is retrieved, we will run into NoSuchElementException" ] }, { "cell_type": "code", "execution_count": null, "id": "3f234387-417d-4cc1-8571-5a02778db2cd", "metadata": {}, "outputs": [], "source": [ "b.get(url)\n", "\n", "while True:\n", " try:\n", " button = b.find_element(\"id\", \"more\")\n", " button.click()\n", " show_screen(500, 500)\n", " print(\"============================================================\")\n", " except NoSuchElementException:\n", " print(\"We have all the data!\")\n", " break\n", " time.sleep(1)" ] }, { "cell_type": "code", "execution_count": null, "id": "603af6f2-a8a1-41ec-8c6e-a24286482e4e", "metadata": {}, "outputs": [], "source": [ "print(b.page_source)" ] }, { "cell_type": "markdown", "id": "f57b79ca-5e8d-48d1-82d4-1456598ca5a0", "metadata": {}, "source": [ "### page 3: password protection example\n", "\n", "- Operations:\n", " `text_object.send_keys()`: enables us to send data to textbox" ] }, { "cell_type": "code", "execution_count": null, "id": "bf91d59d-6468-4e78-866b-a1f9351e7413", "metadata": {}, "outputs": [], "source": [ "url = \"https://cs320.cs.wisc.edu/tricky/page3.html\"\n", "b.get(url)" ] }, { "cell_type": "code", "execution_count": null, "id": "fb734feb-df0d-4d19-89b2-1cc7034f00aa", "metadata": {}, "outputs": [], "source": [ "# TODO: find the id for password box (inspect element on browser)\n", "# TODO: find the id for the login button (inspect element on browser)\n", "text = b.find_element(\"id\", \"\")\n", "button = b.find_element(\"id\", \"\")\n", "\n", "# TODO: send the password (plain text just for example purposes)\n", "\n", "show_screen()\n", "\n", "# TODO: click the button\n", "\n", "show_screen()" ] }, { "cell_type": "code", "execution_count": null, "id": "d85f4319-a345-4d30-b7aa-2d344dc658c5", "metadata": {}, "outputs": [], "source": [ "print(b.page_source)" ] }, { "cell_type": "markdown", "id": "ba660e42-1d67-4c1e-b69f-6bcf25be4b28", "metadata": {}, "source": [ "### page 4: search data for a year\n", "\n", "- Operations:\n", " `text_object.clear()`: enables us to clear the previous text" ] }, { "cell_type": "code", "execution_count": null, "id": "a8ea2227-6d6f-405b-8610-01f627d82b5d", "metadata": {}, "outputs": [], "source": [ "url = \"https://cs320.cs.wisc.edu/tricky/page4.html\"\n", "b.get(url)" ] }, { "cell_type": "code", "execution_count": null, "id": "276c988a-bacc-4535-9a5f-2e68fc2910b0", "metadata": {}, "outputs": [], "source": [ "# TODO: find the id for year box (inspect element on browser)\n", "# TODO: find the id for the search button (inspect element on browser)\n", "text = b.find_element(\"id\", \"\")\n", "button = b.find_element(\"id\", \"\")" ] }, { "cell_type": "code", "execution_count": null, "id": "ff31ce65-12a9-4629-aee5-56e0bbbf9698", "metadata": {}, "outputs": [], "source": [ "text.send_keys(\"1952\")\n", "button.click()\n", "show_screen()\n", "# TODO: run this cell twice" ] }, { "cell_type": "markdown", "id": "a892d964-7e87-431e-8f45-81664f159e10", "metadata": {}, "source": [ "#### How many hurricanes were there each year?" ] }, { "cell_type": "code", "execution_count": null, "id": "ab822af8-6d66-46b6-bf19-28d332e4b01e", "metadata": {}, "outputs": [], "source": [ "\n", "for year in range(1950, 1960):\n", " text.clear()\n", " text.send_keys(???)\n", " button.click()\n", " show_screen()\n", " \n", " # TODO: find all tr elements and count hurricanes for each year\n", " \n", " # TODO: We have to subtract 1 for removing header tr element\n", " \n", " \n", "# ax = hurricane_counts.plot.line()\n", "# ax.set_xlabel(\"Year\")\n", "# ax.set_ylabel(\"Hurricane count\")" ] }, { "cell_type": "markdown", "id": "c8543c5b", "metadata": {}, "source": [ "## Recursive Crawl\n", "\n", "- crawling: process of finding all the webpages inside a website" ] }, { "cell_type": "code", "execution_count": null, "id": "9e60c940-2624-4d78-a730-03056af72297", "metadata": {}, "outputs": [], "source": [ "# TODO: initialize url, send GET request, and display page source\n", "url = \"https://cs320.cs.wisc.edu/crawl/practice1/1.html\"\n", "\n", "print(b.page_source)" ] }, { "cell_type": "code", "execution_count": null, "id": "5ada2baa", "metadata": {}, "outputs": [], "source": [ "# TODO: show the screen\n" ] }, { "cell_type": "markdown", "id": "6f3c1cfc", "metadata": {}, "source": [ "### Final all hyperlinks\n", "\n", "- Selenium operations:\n", " - `b.get(URL)`: sends HTTP GET request to the URL\n", " - `b.page_source`: HTML source for the page\n", " - `b.find_elements(\"id\", <ID>)`: searches for a specific element that matches the \"id\"\n", " - `b.find_elements(\"tag name\", <TAG>)`: searches for a specific element using corresponding tag name\n", " - `b.find_element` versus `b.find_elements`:\n", " - `find_element` gives first match\n", " - `find_elements` gives all matches\n", " - `<element obj>.text`: gives text associated with that element \n", " - `<element obj>.get_attribute(<attribute>)`: gives attribute value; for ex: `<anchor_obj>.get_attribute(\"href\")`\n", " \n", " - `b.save_screenshot(\"some_file.png\")`: saves a screenshot of the rendered page\n", " - `b.set_window_size(<width>, <height>)`: controls size of the image\n", " - import statement: `from IPython.display import display, Image`: helps us show the screenshot as an image inside the notebook\n", " - `button_oject.click()`: enables us to click the button\n", " - `text_object.send_keys()`: enables us to send data to textbox" ] }, { "cell_type": "code", "execution_count": null, "id": "ddab23c0", "metadata": {}, "outputs": [], "source": [ "# TODO: find all a elements, and then \n", "# TODO: loop over all the a elements to print text and use get_attribute to print href value of each a element\n", "a_elements = b.find_elements(\"tag name\", \"a\")\n", "for a_element in a_elements:\n", " print(a_element.text, a_element.get_attribute(\"href\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "8a9a54b4", "metadata": {}, "outputs": [], "source": [ "# TODO: Generalize to a function\n", "def get_children(url):\n", " \"\"\"\n", " Finds all hyperlinks in the given url by sending GET request and parsing page source.\n", " Returns a list of children URLs.\n", " \"\"\"\n", " pass\n", "\n", "url = \"https://cs320.cs.wisc.edu/crawl/practice1/1.html\"\n", "get_children(url)" ] }, { "cell_type": "markdown", "id": "3741f5aa", "metadata": {}, "source": [ "### Breadth First Search\n", "\n", "- for crawling, there is no specific \"destination\", as we need to find all the webpages." ] }, { "cell_type": "code", "execution_count": null, "id": "b8c61a4f", "metadata": {}, "outputs": [], "source": [ "start_url = \"https://cs320.cs.wisc.edu/crawl/practice1/1.html\"\n", "#start_url = \"https://cs320.cs.wisc.edu/crawl/practice7/1.html\"\n", "\n", "# Why use a set to keep track of visited nodes?\n", "\n", "# TODO: create a Digraph\n", "\n", "\n", " # TODO: add current node to digraph\n", " \n", " # TODO: how do we get all the children?\n", " \n", " \n", " # TODO: add an edge\n", " " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 5 }