Lec31 updated

8f41adfc · gsingh58 · 83e67fce · 8f41adfc · 8f41adfc · 8f41adfc
Commit 8f41adfc authored 1 year ago by gsingh58
--- a/f23/Gurmail_Lecture_Notes/31_Web-3/lec_31_web3.ipynb
+++ b/f23/Gurmail_Lecture_Notes/31_Web-3/lec_31_web3.ipynb
@@ -71,7 +71,7 @@
   "outputs": [],
   "source": [
    "# Get this page using requests.  \n",
-    "url = \"https://cs220.cs.wisc.edu/s23/syllabus.html\"\n",
+    "url = \"https://cs220.cs.wisc.edu/f23/syllabus.html\"\n",
    "r = requests.get(url)\n",
    "\n",
    "# make sure there is no error\n",

 %% Cell type:markdown id: tags:
 # Web 3: Scraping Web Data
 %% Cell type:code id: tags:
 ``` python
 # import statements
 import requests
 from bs4 import BeautifulSoup # bs4 is the module, BeautifulSoup is the type
 ```
 %% Cell type:markdown id: tags:
 ### Warmup 1: HTML table and hyperlinks
 In order to scrape web pages, you need to know the HTML syntax for tables and hyperlinks.
 TODO: Add another row or two to the table below
 %% Cell type:markdown id: tags:
 <table>
  <tr>
    <th>University</th>
    <th>Department</th>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
  </tr>
   <tr>
    <td>UW-Madison</td>
    <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
  </tr>
  <tr>
    <td>UC Berkeley</td>
    <td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
    </tr>
 </table>
 %% Cell type:markdown id: tags:
 ### Warmup 2: Scraping data from syllabus page
 URL: https://cs220.cs.wisc.edu/s23/syllabus.html
 %% Cell type:code id: tags:
 ``` python
 # Get this page using requests.
-url = "https://cs220.cs.wisc.edu/s23/syllabus.html"
+url = "https://cs220.cs.wisc.edu/f23/syllabus.html"
 r = requests.get(url)
 # make sure there is no error
 r.raise_for_status()
 # read the entire contents of the page into a single string variable
 contents = r.text
 # split the contents into list of strings using newline separator
 content_list = contents.split("\n")
 ```
 %% Cell type:markdown id: tags:
 #### Warmup 2a: Find all sentences that contain "CS220"
 %% Cell type:code id: tags:
 ``` python
 cs220_sentences = [sentence for sentence in content_list if "CS220" in sentence]
 ```
 %% Cell type:markdown id: tags:
 #### Warmup 2b: Extract title tag's value
 %% Cell type:code id: tags:
 ``` python
 title_tag = cs220_sentences[0]
 print(title_tag)
 title_tag = title_tag.strip()
 print(title_tag)
 title_tag_parts = title_tag.split(">")
 print(title_tag_parts)
 title_details = title_tag_parts[1]
 title_detail_parts = title_details.split("<")
 title_detail_parts[0] # finally, we are able to extract the title tag's data
 # Takeaway:  It would be nice if there were a module that could make finding easy!
 ```
 %% Output
        <title>CS220</title>
    <title>CS220</title>
    ['<title', 'CS220</title', '']
    'CS220'
 %% Cell type:markdown id: tags:
 ### Learning Objectives:
 - Using the Document Object Model of web pages
    - describe the 3 things a DOM element may contain, and give examples of each
    - given an html string, identify the correct DOM tree of elements
 - Create BeautifulSoup objects from an html string and use prettify to display
 - Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag
 - Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs
 - Use BeautifulSoup to scrape a live web site.
 %% Cell type:markdown id: tags:
 ### Document Object Model
 In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.
 <div>
 <img src="attachment:image.png" width="600"/>
 </div>
 %% Cell type:markdown id: tags:
 ### Take a look at the HTML in the below cell.
 %% Cell type:markdown id: tags:
 <b>To Do List</b>
 <ul>
    <li>Eat Healthy</li>
    <li>Sleep <b>More</b></li>
    <li>Exercise</li>
 </ul>
 %% Cell type:markdown id: tags:
 ### BeautifulSoup constructor
 - takes a html, as a string, as argument  and parses it
 - Syntax: `BeautifulSoup(<html_string>, "html.parser")`
 - Second argument specifies what kind of parsing we want done
 New syntax, you can use `"""some really long string"""` to split a string across multiple lines.
 %% Cell type:code id: tags:
 ``` python
 html_string = """
 <b>To Do List</b>
 <ul>
    <li>Eat Healthy</li>
    <li>Sleep <b>More</b></li>
    <li>Exercise</li>
 </ul>
 """
 bs_obj = BeautifulSoup(html_string, "html.parser")
 type(bs_obj)
 ```
 %% Output
    bs4.BeautifulSoup
 %% Cell type:markdown id: tags:
 ## BeautifulSoup operations
 - `prettify()`        returns a formatted representation of the raw HTML
 ### A  BeautifulSoup object can be searched for elements using:
 - `find("")`         returns the first element matching the tag string, None otherwise
 - `find_all("")`     returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise
 ### Beautiful Soup Elements can be inspected by using:
 - `get_text()`     returns the text associated with this element, if applicable; does not return the child elements associated with that element
 - `.children`      all children of this element (can be converted into a list)
 - `.attrs`          the atribute associated with that element / tag.
 %% Cell type:markdown id: tags:
 `prettify()` returns a formatted representation of the raw HTML
 %% Cell type:code id: tags:
 ``` python
 print(bs_obj.prettify())
 ```
 %% Output
    <b>
     To Do List
    </b>
    <ul>
     <li>
      Eat Healthy
     </li>
     <li>
      Sleep
      <b>
       More
      </b>
     </li>
     <li>
      Exercise
     </li>
    </ul>
 %% Cell type:markdown id: tags:
 `find` returns the first HTML 'tag' matching the string "b"
 %% Cell type:code id: tags:
 ``` python
 element = bs_obj.find("b")
 ```
 %% Cell type:markdown id: tags:
 What is the type of find's return value?
 %% Cell type:code id: tags:
 ``` python
 print(type(element))
 ```
 %% Output
    <class 'bs4.element.Tag'>
 %% Cell type:markdown id: tags:
 How do we extract the text of the "b" element and what is its type?
 %% Cell type:code id: tags:
 ``` python
 text = element.get_text()
 print(text, type(text))
 ```
 %% Output
    To Do List <class 'str'>
 %% Cell type:markdown id: tags:
 `find` returns None if it cannot find that element.
 %% Cell type:code id: tags:
 ``` python
 # assert that this html string has a <ul> tag
 assert bs_obj.find("ul") != None
 # assert that this does not have an <a> tag
 assert bs_obj.find("a") == None
 ```
 %% Cell type:markdown id: tags:
 `find_all` returns an iterable of all matching elements (HTML 'tags') matching the string "b"
 %% Cell type:code id: tags:
 ``` python
 element_list = bs_obj.find_all("b")
 element_list
 ```
 %% Output
    [<b>To Do List</b>, <b>More</b>]
 %% Cell type:markdown id: tags:
 What is the type of return value of `find_all`?
 %% Cell type:code id: tags:
 ``` python
 type(element_list)
 ```
 %% Output
    bs4.element.ResultSet
 %% Cell type:code id: tags:
 ``` python
 type(element_list[0])
 ```
 %% Output
    bs4.element.Tag
 %% Cell type:markdown id: tags:
 Use a for loop to print the text of each "b" element.
 %% Cell type:code id: tags:
 ``` python
 for element in element_list:
    print(element.get_text())
 ```
 %% Output
    To Do List
    More
 %% Cell type:markdown id: tags:
 Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements.
 %% Cell type:code id: tags:
 ``` python
 # only searches for elements, not text
 print(bs_obj.find_all("Sleep"))
 # if not present returns None
 print(bs_obj.find("Sleep"))
 ```
 %% Output
    []
    None
 %% Cell type:markdown id: tags:
 You can invoke `find` or `find_all` on other BeautifulSoup object instances.
 Find all `li` elements and find `b` element inside the second `li` element.
 %% Cell type:code id: tags:
 ``` python
 li_elements = bs_obj.find_all("li")
 second_li = li_elements[1]
 second_li.find("b")
 ```
 %% Output
    <b>More</b>
 %% Cell type:markdown id: tags:
 DOM trees are hierarchical. You can use `.children` on any element to gets its children.
 %% Cell type:markdown id: tags:
 Find all the children of "ul" element.
 %% Cell type:code id: tags:
 ``` python
 element = bs_obj.find("ul")
 children_list = list(element.children)
 children_list
 ```
 %% Output
    ['\n',
     <li>Eat Healthy</li>,
     '\n',
     <li>Sleep <b>More</b></li>,
     '\n',
     <li>Exercise</li>,
     '\n']
 %% Cell type:markdown id: tags:
 Find text of every child element.
 %% Cell type:code id: tags:
 ``` python
 for child in children_list:
    print(child.get_text())
 ```
 %% Output
    Eat Healthy
    Sleep More
    Exercise
 %% Cell type:markdown id: tags:
 Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()`
 %% Cell type:markdown id: tags:
 To understand `attribute`, let's go back to the table from warmup 1.
 %% Cell type:code id: tags:
 ``` python
 html_string = """
 <table>
  <tr>
    <th>University</th>
    <th>Department</th>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
  </tr>
   <tr>
    <td>UW-Madison</td>
    <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
  </tr>
  <tr>
    <td>UC Berkeley</td>
    <td>
    <a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences
    </a>
    </td>
    </tr>
 </table>
 """
 ```
 %% Cell type:markdown id: tags:
 Find the table headers.
 %% Cell type:code id: tags:
 ``` python
 bs_obj = BeautifulSoup(html_string, "html.parser")
 th_elements = bs_obj.find_all("th") # works only if there is one table in that whole HTML
 for th in th_elements:
    print(th.get_text())
 ```
 %% Output
    University
    Department
 %% Cell type:markdown id: tags:
 Find the first anchor element, extract its text.
 %% Cell type:code id: tags:
 ``` python
 anchor = bs_obj.find("a")
 print(anchor.get_text())
 ```
 %% Output
    Computer Sciences
 %% Cell type:markdown id: tags:
 You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.
 Now, let's get the attributes of the anchor element.
 %% Cell type:code id: tags:
 ``` python
 anchor_attributes = anchor.attrs
 anchor_attributes
 ```
 %% Output
    {'href': 'https://www.cs.wisc.edu/'}
 %% Cell type:markdown id: tags:
 What is the return value type of `.attrs`?
 %% Cell type:code id: tags:
 ``` python
 print(type(anchor_attributes))
 ```
 %% Output
    <class 'dict'>
 %% Cell type:markdown id: tags:
 Extract the hyperlink.
 %% Cell type:code id: tags:
 ``` python
 anchor_attributes["href"]
 ```
 %% Output
    'https://www.cs.wisc.edu/'
 %% Cell type:markdown id: tags:
 Extract hyperlinks for each department and populate department name and link into a `dict`.
 %% Cell type:code id: tags:
 ``` python
 department_urls = {} # Key: department name; Value: website URL
 tr_elements = bs_obj.find_all("tr")
 for tr in tr_elements:
    if tr.find("td") != None: # this should handle row containing th's
        anchor = tr.find("a")
        name = anchor.get_text()
        website = anchor.attrs["href"]
        department_urls[name] = website
 department_urls
 ```
 %% Output
    {'Computer Sciences': 'https://www.cs.wisc.edu/',
     'Statistics': 'https://stat.wisc.edu/',
     'CDIS': 'https://cdis.wisc.edu/',
     'Electrical Engineering and Computer Sciences\n    ': 'https://eecs.berkeley.edu/'}
 %% Cell type:markdown id: tags:
 #### Self-practice: Extract title of the CS220 syllabus page (from warmup 2)
 %% Cell type:code id: tags:
 ``` python
 # Get this page using requests.
 url = "https://cs220.cs.wisc.edu/s23/syllabus.html"
 r = requests.get(url)
 # make sure there is no error
 r.raise_for_status()
 # read the entire contents of the page into a single string variable
 contents = r.text
 # split the contents into list of strings using newline separator
 bs_obj = BeautifulSoup(contents, "html.parser")
 bs_obj.find("title").get_text()
 ```
 %% Output
    'CS220'
 %% Cell type:markdown id: tags:
 ## Parsing small_movies html table to extract `small_movies.json`
 %% Cell type:markdown id: tags:
 ### Step 1: Read `small_movies.html` content into a variable
 %% Cell type:code id: tags:
 ``` python
 f = open("small_movies.html")
 small_movies_str = f.read()
 f.close()
 # small_movies_str
 ```
 %% Cell type:markdown id: tags:
 ### Step 2: Initialize BeautifulSoup object instance
 %% Cell type:code id: tags:
 ``` python
 bs_obj = BeautifulSoup(small_movies_str, "html.parser")
 ```
 %% Cell type:markdown id: tags:
 ### Step 3: Find table element
 %% Cell type:code id: tags:
 ``` python
 table = bs_obj.find("table") # works only when you have exactly 1 table
 ```
 %% Cell type:markdown id: tags:
 ### Step 4: Find all th tags, to parse the table header
 %% Cell type:code id: tags:
 ``` python
 header = [th.get_text() for th in table.find_all('th')]
 header
 ```
 %% Output
    ['Title', 'Genre', 'Director', 'Cast', 'Year', 'Runtime', 'Rating', 'Revenue']
 %% Cell type:markdown id: tags:
 ### Step 5: Scrape second row, convert data to appropriate types, and populate data into a row dictionary
 - "Year", "Runtime": `int` conversion
 - "Revenue": format_revenue(...) conversion
 - "Rating": `float` conversion
 %% Cell type:code id: tags:
 ``` python
 def format_revenue(revenue):
    if type(revenue) == float: # need this in here if we run code multiple times
        return revenue
    elif revenue[-1] == 'M': # some have an "M" at the end
        return float(revenue[:-1]) * 1e6
    else:                    # otherwise, assume millions.
        return float(revenue) * 1e6
 ```
 %% Cell type:code id: tags:
 ``` python
 # Why second row? Because first row has the header information.
 movie = {}
 tr_elements = table.find_all('tr')
 tr = tr_elements[1]
 td_elements = tr.find_all('td')
 for idx in range(len(td_elements)):
    td = td_elements[idx]
    val = td.get_text()
    if header[idx] in ["Year", "Runtime"]:
        movie[header[idx]] = int(val)
    elif header[idx] == "Revenue":
        revenue = format_revenue(val)
        movie[header[idx]] = revenue
    elif header[idx] == "Rating":
        movie[header[idx]] = float(val)
    else:
        movie[header[idx]] = val
 movie
 ```
 %% Output
    {'Title': 'Guardians of the Galaxy',
     'Genre': 'Action,Adventure,Sci-Fi',
     'Director': 'James Gunn',
     'Cast': 'Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana',
     'Year': 2014,
     'Runtime': 121,
     'Rating': 8.1,
     'Revenue': 333130000.0}
 %% Cell type:markdown id: tags:
 ### Step 6: Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list
 - "Year", "Runtime": `int` conversion
 - "Revenue": format_revenue(...) conversion
 - "Rating": `float` conversion
 You can compare your parsing output to `small_movies.json` file contents, to confirm your result.
 %% Cell type:code id: tags:
 ``` python
 movies_data = []
 tr_elements = table.find_all('tr')
 for tr in tr_elements[1:]: # Skip first row (header row)
    movie = {}
    td_elements = tr.find_all('td')
    for idx in range(len(td_elements)):
        td = td_elements[idx]
        val = td.get_text()
        if header[idx] in ["Year", "Runtime"]:
            movie[header[idx]] = int(val)
        elif header[idx] == "Revenue":
            revenue = format_revenue(val)
            movie[header[idx]] = revenue
        elif header[idx] == "Rating":
            movie[header[idx]] = float(val)
        else:
            movie[header[idx]] = val
    movies_data.append(movie)
 movies_data
 ```
 %% Output
    [{'Title': 'Guardians of the Galaxy',
      'Genre': 'Action,Adventure,Sci-Fi',
      'Director': 'James Gunn',
      'Cast': 'Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana',
      'Year': 2014,
      'Runtime': 121,
      'Rating': 8.1,
      'Revenue': 333130000.0},
     {'Title': 'Prometheus',
      'Genre': 'Adventure,Mystery,Sci-Fi',
      'Director': 'Ridley Scott',
      'Cast': 'Noomi Rapace, Logan Marshall-Green, Michael         fassbender, Charlize Theron',
      'Year': 2012,
      'Runtime': 124,
      'Rating': 7.0,
      'Revenue': 126460000.0},
     {'Title': 'Split',
      'Genre': 'Horror,Thriller',
      'Director': 'M. Night Shyamalan',
      'Cast': 'James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula',
      'Year': 2016,
      'Runtime': 117,
      'Rating': 7.3,
      'Revenue': 138120000.0},
     {'Title': 'Sing',
      'Genre': 'Animation,Comedy,Family',
      'Director': 'Christophe Lourdelet',
      'Cast': 'Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson',
      'Year': 2016,
      'Runtime': 108,
      'Rating': 7.2,
      'Revenue': 270320000.0},
     {'Title': 'Suicide Squad',
      'Genre': 'Action,Adventure,Fantasy',
      'Director': 'David Ayer',
      'Cast': 'Will Smith, Jared Leto, Margot Robbie, Viola Davis',
      'Year': 2016,
      'Runtime': 123,
      'Rating': 6.2,
      'Revenue': 325020000.0},
     {'Title': 'The Great Wall',
      'Genre': 'Action,Adventure,Fantasy',
      'Director': 'Yimou Zhang',
      'Cast': 'Matt Damon, Tian Jing, Willem Dafoe, Andy Lau',
      'Year': 2016,
      'Runtime': 103,
      'Rating': 6.1,
      'Revenue': 45130000.0},
     {'Title': 'La La Land',
      'Genre': 'Comedy,Drama,Music',
      'Director': 'Damien Chazelle',
      'Cast': 'Ryan Gosling, Emma Stone, Rosemarie DeWitt, J.K. Simmons',
      'Year': 2016,
      'Runtime': 128,
      'Rating': 8.3,
      'Revenue': 151060000.0},
     {'Title': 'Mindhorn',
      'Genre': 'Comedy',
      'Director': 'Sean Foley',
      'Cast': 'Essie Davis, Andrea Riseborough, Julian Barratt,Kenneth Branagh',
      'Year': 2016,
      'Runtime': 89,
      'Rating': 6.4,
      'Revenue': 0.0},
     {'Title': 'The Lost City of Z',
      'Genre': 'Action,Adventure,Biography',
      'Director': 'James Gray',
      'Cast': 'Charlie Hunnam, Robert Pattinson, Sienna Miller, Tom Holland',
      'Year': 2016,
      'Runtime': 141,
      'Rating': 7.1,
      'Revenue': 8010000.0},
     {'Title': 'Passengers',
      'Genre': 'Adventure,Drama,Romance',
      'Director': 'Morten Tyldum',
      'Cast': 'Jennifer Lawrence, Chris Pratt, Michael Sheen,Laurence Fishburne',
      'Year': 2016,
      'Runtime': 116,
      'Rating': 7.0,
      'Revenue': 100010000.0}]
 %% Cell type:markdown id: tags:
 ### Final step: convert steps 1 through 6 into a function and use that function to parse `full_movies.html` file.
 %% Cell type:code id: tags:
 ``` python
 def parse_html(html_file):
    f = open(html_file)
    small_movies_str = f.read()
    f.close()
    bs_obj = BeautifulSoup(small_movies_str, "html.parser")
    table = bs_obj.find("table") # works only when you have exactly 1 table
    header = [th.get_text() for th in table.find_all('th')]
    movies_data = []
    tr_elements = table.find_all('tr')
    for tr in tr_elements[1:]: # Skip first row (header row)
        movie = {}
        td_elements = tr.find_all('td')
        for idx in range(len(td_elements)):
            td = td_elements[idx]
            val = td.get_text()
            if header[idx] in ["Year", "Runtime"]:
                movie[header[idx]] = int(val)
            elif header[idx] == "Revenue":
                revenue = format_revenue(val)
                movie[header[idx]] = revenue
            elif header[idx] == "Rating":
                movie[header[idx]] = float(val)
            else:
                movie[header[idx]] = val
        movies_data.append(movie)
    return movies_data
 ```
 %% Cell type:code id: tags:
 ``` python
 full_movies_data = parse_html("full_movies.html")
 # full_movies_data
 ```

--- a/f23/Gurmail_Lecture_Notes/31_Web-3/lec_31_web3_template_Gurmail_lec1.ipynb
+++ b/f23/Gurmail_Lecture_Notes/31_Web-3/lec_31_web3_template_Gurmail_lec1.ipynb
@@ -71,7 +71,7 @@
   "outputs": [],
   "source": [
    "# Get this page using requests.  \n",
-    "url = \"https://cs220.cs.wisc.edu/s23/syllabus.html\"\n",
+    "url = \"https://cs220.cs.wisc.edu/f23/syllabus.html\"\n",
    "\n",
    "# make sure there is no error\n",
    "\n",

 %% Cell type:markdown id: tags:
 # Web3: Scraping Web Data
 %% Cell type:code id: tags:
 ``` python
 # import statements
 import requests
 from bs4 import BeautifulSoup # bs4 is the module, BeautifulSoup is the type
 ```
 %% Cell type:markdown id: tags:
 ### Warmup 1: HTML table and hyperlinks
 In order to scrape web pages, you need to know the HTML syntax for tables and hyperlinks.
 TODO: Add another row or two to the table below
 %% Cell type:markdown id: tags:
 <table>
  <tr>
    <th>University</th>
    <th>Department</th>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
  </tr>
   <tr>
    <td>UW-Madison</td>
    <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
  </tr>
  <tr>
    <td>UC Berkeley</td>
    <td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
    </tr>
 </table>
 %% Cell type:markdown id: tags:
 ### Warmup 2: Scraping data from syllabus page
 URL: https://cs220.cs.wisc.edu/s23/syllabus.html
 %% Cell type:code id: tags:
 ``` python
 # Get this page using requests.
-url = "https://cs220.cs.wisc.edu/s23/syllabus.html"
+url = "https://cs220.cs.wisc.edu/f23/syllabus.html"
 # make sure there is no error
 # read the entire contents of the page into a single string variable
 # split the contents into list of strings using newline separator
 ```
 %% Cell type:markdown id: tags:
 #### Warmup 2a: Find all sentences that contain "CS220"
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 #### Warmup 2b: Extract title tag's value
 %% Cell type:code id: tags:
 ``` python
 # finally, we are able to extract the title tag's data
 # Takeaway:  It would be nice if there were a module that could make finding easy!
 ```
 %% Cell type:markdown id: tags:
 ### Learning Objectives:
 - Using the Document Object Model of web pages
    - describe the 3 things a DOM element may contain, and give examples of each
    - given an html string, identify the correct DOM tree of elements
 - Create BeautifulSoup objects from an html string and use prettify to display
 - Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag
 - Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs
 - Use BeautifulSoup to scrape a live web site.
 %% Cell type:markdown id: tags:
 ### Document Object Model
 In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.
 <div>
 <img src="attachment:image.png" width="600"/>
 </div>
 %% Cell type:markdown id: tags:
 ### Take a look at the HTML in the below cell.
 %% Cell type:markdown id: tags:
 <b>To Do List</b>
 <ul>
    <li>Eat Healthy</li>
    <li>Sleep <b>More</b></li>
    <li>Exercise</li>
 </ul>
 %% Cell type:markdown id: tags:
 ### BeautifulSoup constructor
 - takes a html, as a string, as argument  and parses it
 - Syntax: `BeautifulSoup(<html_string>, "html.parser")`
 - Second argument specifies what kind of parsing we want done
 New syntax, you can use `"""some really long string"""` to split a string across multiple lines.
 %% Cell type:code id: tags:
 ``` python
 html_string = """
 <b>To Do List</b>
 <ul>
    <li>Eat Healthy</li>
    <li>Sleep <b>More</b></li>
    <li>Exercise</li>
 </ul>
 """
 type(bs_obj)
 ```
 %% Cell type:markdown id: tags:
 ## BeautifulSoup operations
 - `prettify()`        returns a formatted representation of the raw HTML
 ### A  BeautifulSoup object can be searched for elements using:
 - `find("")`         returns the first element matching the tag string, None otherwise
 - `find_all("")`     returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise
 ### Beautiful Soup Elements can be inspected by using:
 - `get_text()`     returns the text associated with this element, if applicable; does not return the child elements associated with that element
 - `.children`      all children of this element (can be converted into a list)
 - `.attrs`          the atribute associated with that element / tag.
 %% Cell type:markdown id: tags:
 `prettify()` returns a formatted representation of the raw HTML
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 `find` returns the first HTML 'tag' matching the string "b"
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 What is the type of find's return value?
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 How do we extract the text of the "b" element and what is its type?
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 `find` returns None if it cannot find that element.
 %% Cell type:code id: tags:
 ``` python
 # assert that this html string has a <ul> tag
 # assert that this does not have an <a> tag
 ```
 %% Cell type:markdown id: tags:
 `find_all` returns an iterable of all matching elements (HTML 'tags') matching the string "b"
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 What is the type of return value of `find_all`?
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Use a for loop to print the text of each "b" element.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements.
 %% Cell type:code id: tags:
 ``` python
 # only searches for elements, not text
 print(bs_obj.find_all("Sleep"))
 # if not present returns None
 print(bs_obj.find("Sleep"))
 ```
 %% Cell type:markdown id: tags:
 You can invoke `find` or `find_all` on other BeautifulSoup object instances.
 Find all `li` elements and find `b` element inside the second `li` element.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 DOM trees are hierarchical. You can use `.children` on any element to gets its children.
 %% Cell type:markdown id: tags:
 Find all the children of "ul" element.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Find text of every child element.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()`
 %% Cell type:markdown id: tags:
 To understand `attribute`, let's go back to the table from warmup 1.
 %% Cell type:code id: tags:
 ``` python
 html_string = """
 <table>
  <tr>
    <th>University</th>
    <th>Department</th>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
  </tr>
   <tr>
    <td>UW-Madison</td>
    <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
  </tr>
  <tr>
    <td>UC Berkeley</td>
    <td>
    <a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences
    </a>
    </td>
    </tr>
 </table>
 """
 ```
 %% Cell type:markdown id: tags:
 Find the table headers.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Find the first anchor element, extract its text.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.
 Now, let's get the attributes of the anchor element.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 What is the return value type of `.attrs`?
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Extract the hyperlink.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Extract hyperlinks for each department and populate department name and link into a `dict`.
 %% Cell type:code id: tags:
 ``` python
 department_urls = {} # Key: department name; Value: website URL
 ```
 %% Cell type:markdown id: tags:
 #### Self-practice: Extract title of the CS220 syllabus page (from warmup 2)
 %% Cell type:code id: tags:
 ``` python
 # Get this page using requests.
 url = "https://cs220.cs.wisc.edu/s23/syllabus.html"
 # make sure there is no error
 # read the entire contents of the page into a single string variable
 # use BeautifulSoup to extract title
 ```
 %% Cell type:markdown id: tags:
 ## Parsing small_movies html table to extract `small_movies.json`
 %% Cell type:markdown id: tags:
 ### Step 1: Read `small_movies.html` content into a variable
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 ### Step 2: Initialize BeautifulSoup object instance
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 ### Step 3: Find table element
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 ### Step 4: Find all th tags, to parse the table header
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 ### Step 5: Scrape second row, convert data to appropriate types, and populate data into a row dictionary
 - "Year", "Runtime": `int` conversion
 - "Revenue": format_revenue(...) conversion
 - "Rating": `float` conversion
 %% Cell type:code id: tags:
 ``` python
 def format_revenue(revenue):
    if type(revenue) == float: # need this in here if we run code multiple times
        return revenue
    elif revenue[-1] == 'M': # some have an "M" at the end
        return float(revenue[:-1]) * 1e6
    else:                    # otherwise, assume millions.
        return float(revenue) * 1e6
 ```
 %% Cell type:code id: tags:
 ``` python
 # Why second row? Because first row has the header information.
 ```
 %% Cell type:markdown id: tags:
 ### Step 6: Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list
 - "Year", "Runtime": `int` conversion
 - "Revenue": format_revenue(...) conversion
 - "Rating": `float` conversion
 You can compare your parsing output to `small_movies.json` file contents, to confirm your result.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 ### Final step: convert steps 1 through 6 into a function and use that function to parse `full_movies.html` file.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:code id: tags:
 ``` python
 full_movies_data = parse_html("full_movies.html")
 # full_movies_data
 ```

--- a/f23/Gurmail_Lecture_Notes/31_Web-3/lec_31_web3_template_Gurmail_lec2.ipynb
+++ b/f23/Gurmail_Lecture_Notes/31_Web-3/lec_31_web3_template_Gurmail_lec2.ipynb
@@ -71,7 +71,7 @@
   "outputs": [],
   "source": [
    "# Get this page using requests.  \n",
-    "url = \"https://cs220.cs.wisc.edu/s23/syllabus.html\"\n",
+    "url = \"https://cs220.cs.wisc.edu/f23/syllabus.html\"\n",
    "\n",
    "# make sure there is no error\n",
    "\n",

 %% Cell type:markdown id: tags:
 # Web3: Scraping Web Data
 %% Cell type:code id: tags:
 ``` python
 # import statements
 import requests
 from bs4 import BeautifulSoup # bs4 is the module, BeautifulSoup is the type
 ```
 %% Cell type:markdown id: tags:
 ### Warmup 1: HTML table and hyperlinks
 In order to scrape web pages, you need to know the HTML syntax for tables and hyperlinks.
 TODO: Add another row or two to the table below
 %% Cell type:markdown id: tags:
 <table>
  <tr>
    <th>University</th>
    <th>Department</th>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
  </tr>
   <tr>
    <td>UW-Madison</td>
    <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
  </tr>
  <tr>
    <td>UC Berkeley</td>
    <td><a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences</a></td>
    </tr>
 </table>
 %% Cell type:markdown id: tags:
 ### Warmup 2: Scraping data from syllabus page
 URL: https://cs220.cs.wisc.edu/s23/syllabus.html
 %% Cell type:code id: tags:
 ``` python
 # Get this page using requests.
-url = "https://cs220.cs.wisc.edu/s23/syllabus.html"
+url = "https://cs220.cs.wisc.edu/f23/syllabus.html"
 # make sure there is no error
 # read the entire contents of the page into a single string variable
 # split the contents into list of strings using newline separator
 ```
 %% Cell type:markdown id: tags:
 #### Warmup 2a: Find all sentences that contain "CS220"
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 #### Warmup 2b: Extract title tag's value
 %% Cell type:code id: tags:
 ``` python
 # finally, we are able to extract the title tag's data
 # Takeaway:  It would be nice if there were a module that could make finding easy!
 ```
 %% Cell type:markdown id: tags:
 ### Learning Objectives:
 - Using the Document Object Model of web pages
    - describe the 3 things a DOM element may contain, and give examples of each
    - given an html string, identify the correct DOM tree of elements
 - Create BeautifulSoup objects from an html string and use prettify to display
 - Use the BeautifulSoup methods 'find' and 'find_all' to find particular elements by their tag
 - Inspect a BeautufulSoup element to determine the contents of a web page using get_text(), children, and attrs
 - Use BeautifulSoup to scrape a live web site.
 %% Cell type:markdown id: tags:
 ### Document Object Model
 In order to render a HTML page, most web browsers use a tree structure called Document Object Model (DOM) to represent the HTML page as a hierarchy of elements.
 <div>
 <img src="attachment:image.png" width="600"/>
 </div>
 %% Cell type:markdown id: tags:
 ### Take a look at the HTML in the below cell.
 %% Cell type:markdown id: tags:
 <b>To Do List</b>
 <ul>
    <li>Eat Healthy</li>
    <li>Sleep <b>More</b></li>
    <li>Exercise</li>
 </ul>
 %% Cell type:markdown id: tags:
 ### BeautifulSoup constructor
 - takes a html, as a string, as argument  and parses it
 - Syntax: `BeautifulSoup(<html_string>, "html.parser")`
 - Second argument specifies what kind of parsing we want done
 New syntax, you can use `"""some really long string"""` to split a string across multiple lines.
 %% Cell type:code id: tags:
 ``` python
 html_string = """
 <b>To Do List</b>
 <ul>
    <li>Eat Healthy</li>
    <li>Sleep <b>More</b></li>
    <li>Exercise</li>
 </ul>
 """
 type(bs_obj)
 ```
 %% Cell type:markdown id: tags:
 ## BeautifulSoup operations
 - `prettify()`        returns a formatted representation of the raw HTML
 ### A  BeautifulSoup object can be searched for elements using:
 - `find("")`         returns the first element matching the tag string, None otherwise
 - `find_all("")`     returns an iterable of all matching elements (HTML 'tags'), empty iterable otherwise
 ### Beautiful Soup Elements can be inspected by using:
 - `get_text()`     returns the text associated with this element, if applicable; does not return the child elements associated with that element
 - `.children`      all children of this element (can be converted into a list)
 - `.attrs`          the atribute associated with that element / tag.
 %% Cell type:markdown id: tags:
 `prettify()` returns a formatted representation of the raw HTML
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 `find` returns the first HTML 'tag' matching the string "b"
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 What is the type of find's return value?
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 How do we extract the text of the "b" element and what is its type?
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 `find` returns None if it cannot find that element.
 %% Cell type:code id: tags:
 ``` python
 # assert that this html string has a <ul> tag
 # assert that this does not have an <a> tag
 ```
 %% Cell type:markdown id: tags:
 `find_all` returns an iterable of all matching elements (HTML 'tags') matching the string "b"
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 What is the type of return value of `find_all`?
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Use a for loop to print the text of each "b" element.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Unlike `find`, `find_all` returns an empty iterable, when there are no matching elements.
 %% Cell type:code id: tags:
 ``` python
 # only searches for elements, not text
 print(bs_obj.find_all("Sleep"))
 # if not present returns None
 print(bs_obj.find("Sleep"))
 ```
 %% Cell type:markdown id: tags:
 You can invoke `find` or `find_all` on other BeautifulSoup object instances.
 Find all `li` elements and find `b` element inside the second `li` element.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 DOM trees are hierarchical. You can use `.children` on any element to gets its children.
 %% Cell type:markdown id: tags:
 Find all the children of "ul" element.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Find text of every child element.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Notice that `get_text()` only returns the actual text and not the HTML formatting. For example, part of second child element's text is enclosed within `<b>More</b>`. Despite that `get_text()`
 %% Cell type:markdown id: tags:
 To understand `attribute`, let's go back to the table from warmup 1.
 %% Cell type:code id: tags:
 ``` python
 html_string = """
 <table>
  <tr>
    <th>University</th>
    <th>Department</th>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://www.cs.wisc.edu/">Computer Sciences</a></td>
  </tr>
  <tr>
    <td>UW-Madison</td>
    <td><a href = "https://stat.wisc.edu/">Statistics</a></td>
  </tr>
   <tr>
    <td>UW-Madison</td>
    <td><a href = "https://cdis.wisc.edu/">CDIS</a></td>
  </tr>
  <tr>
    <td>UC Berkeley</td>
    <td>
    <a href = "https://eecs.berkeley.edu/">Electrical Engineering and Computer Sciences
    </a>
    </td>
    </tr>
 </table>
 """
 ```
 %% Cell type:markdown id: tags:
 Find the table headers.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Find the first anchor element, extract its text.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 You can get the attributes associated with an element using `.attrs` on that element object. Return value will be a `dict` mapping each attribute to its value.
 Now, let's get the attributes of the anchor element.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 What is the return value type of `.attrs`?
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Extract the hyperlink.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Extract hyperlinks for each department and populate department name and link into a `dict`.
 %% Cell type:code id: tags:
 ``` python
 department_urls = {} # Key: department name; Value: website URL
 ```
 %% Cell type:markdown id: tags:
 #### Self-practice: Extract title of the CS220 syllabus page (from warmup 2)
 %% Cell type:code id: tags:
 ``` python
 # Get this page using requests.
 url = "https://cs220.cs.wisc.edu/s23/syllabus.html"
 # make sure there is no error
 # read the entire contents of the page into a single string variable
 # use BeautifulSoup to extract title
 ```
 %% Cell type:markdown id: tags:
 ## Parsing small_movies html table to extract `small_movies.json`
 %% Cell type:markdown id: tags:
 ### Step 1: Read `small_movies.html` content into a variable
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 ### Step 2: Initialize BeautifulSoup object instance
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 ### Step 3: Find table element
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 ### Step 4: Find all th tags, to parse the table header
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 ### Step 5: Scrape second row, convert data to appropriate types, and populate data into a row dictionary
 - "Year", "Runtime": `int` conversion
 - "Revenue": format_revenue(...) conversion
 - "Rating": `float` conversion
 %% Cell type:code id: tags:
 ``` python
 def format_revenue(revenue):
    if type(revenue) == float: # need this in here if we run code multiple times
        return revenue
    elif revenue[-1] == 'M': # some have an "M" at the end
        return float(revenue[:-1]) * 1e6
    else:                    # otherwise, assume millions.
        return float(revenue) * 1e6
 ```
 %% Cell type:code id: tags:
 ``` python
 # Why second row? Because first row has the header information.
 ```
 %% Cell type:markdown id: tags:
 ### Step 6: Scrape all rows, convert data to appropriate types, and populate data into a row dictionary and append row dictionaries into a list
 - "Year", "Runtime": `int` conversion
 - "Revenue": format_revenue(...) conversion
 - "Rating": `float` conversion
 You can compare your parsing output to `small_movies.json` file contents, to confirm your result.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 ### Final step: convert steps 1 through 6 into a function and use that function to parse `full_movies.html` file.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:code id: tags:
 ``` python
 full_movies_data = parse_html("full_movies.html")
 # full_movies_data
 ```