Skip to content
Snippets Groups Projects
Commit c401ade0 authored by JINLANG WANG's avatar JINLANG WANG
Browse files

Update 2 files

- /Labs/Lab9/EDGAR.md
- /Labs/Lab9/README.md
parent c99c65b5
No related branches found
No related tags found
No related merge requests found
Pipeline #798451 passed
# Lab 10: EDGAR Utilities Module
## Overview
In the US, public companies need to regularly file various statements
and reports to the SEC's (Securities and Exchange Commission) EDGAR
database. EDGAR data is publicly available online; furthermore, web
requests to EDGAR from around the world are logged and published. For
P5, you'll analyze both SEC filing HTML pages and a log of web
requests from around the world for those pages.
In this lab, you'll create an `edgar_utils.py` to help with analyzing
the pages and logs. It will contain two things: a `lookup_region`
function and a `Filing` class.
## Practice for `lookup_region`
For "practice" components of this lab, you'll do exercises in a
notebook to get the code and logic correct. You'll then use what you
learn to write your `edgar_utils.py` module.
### Exercise 1: replace letters
For the project dataset, you'll be working with some IP addresses
where some of the digits have been replaced with letters for
anonymization.
For some calculations, we need only digits, so we'll replace any
letters with "0". Complete the following regex code to get back
"101.1.1.000":
```python
import re
ipaddr = "101.1.1.abc"
re.sub(????, ????, ipaddr)
```
### Exercise 2: integer IPs
Note if you haven't installed netaddr yet from p5, please install it from the command line via:
``` pip3 install netaddr```
IP addresses are commonly represented as four-part numbers, like
"34.67.75.25". To convert an address like this to an integer, you can
use the following::
```python
import netaddr
int(netaddr.IPAddress("34.67.75.25"))
```
Modify the above to lookup the integer representation of your virtual
machine's IP address.
### Exercise 3: binary search
Consider the following sorted list:
```python
L = [1, 2, 100, 222, 900]
```
Running `150 in L` would loop over every element in the list. This is
slow, and doesn't take advantage of the fact that the list is sorted.
A better strategy when we know it is sorted would be to check the
middle (100) and infer 150 must be in the second half of the list, if
it's in the list at all; no need to check the first half.
A famous algorithm that uses this strategy is called *binary search*, and it's implemented by this function: https://docs.python.org/dev/library/bisect.html#bisect.bisect
Try it:
```python
from bisect import bisect
idx = bisect(L, 150)
idx
```
You should get `3` -- this means that if you wanted to add 150 to the
list and keep it in sorted order, you would insert 150 at index 3
(after 1, 2, and 100). This also means `L[idx-1]` is the biggest
number in the list that is **less than or equal** to 150.
*What will bisect of `L` be for 225?* Write down your prediction, then
run code to check your answer.
### Exercise 4: country/region lookup
You can generally guess what country or region a computer is in based
on its IP address. Read in `ip2location.csv` from the project to see
the IP ranges assigned to each region (this is borrowed from
https://lite.ip2location.com/database/ip-country).
```python
ips = pd.read_csv("ip2location.csv")
ips
```
Can you use (a) `bisect` on the `low` column of `ips` and (b) the
integer representation of your VM's IP address to (c) find an `idx`
for the row in `ips` corresponding to your VM?
Look at `ips.iloc[idx]` to make sure you found the correct row.
## `lookup_region` function
Write an efficient `lookup_region` function in your `edgar_utils.py`
module that takes an IP address (in string form) and returns the
country or region the corresponding computer is in. You can import it
and test it in Jupyter notebooks or Python interactive mode.
Example usage:
```python
>>> lookup_region("1.1.1.x")
'United States of America'
>>> lookup_region("101.1.1.abc")
'China'
```
Requirements:
* it needs to worked with anonymized IPs
* don't read the CSV file each time `lookup_region` is called
* don't loop over every row each time `lookup_region` is called -- your code needs to be faster than O(N)
## Practice for `Filing` class
Copy/paste the following string to a notebook:
```python
html = """<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Last-Modified" content="Fri, 12 Feb 2016 00:05:37 GMT" />
<title>EDGAR Filing Documents for 0001050470-16-000051</title>
<link rel="stylesheet" type="text/css" href="/include/interactive.css" />
</head>
<body style="margin: 0">
<!-- SEC Web Analytics - For information please visit: https://www.sec.gov/privacy.htm#collectedinfo -->
<noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-TD3BKV"
height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-TD3BKV');</script>
<!-- End SEC Web Analytics -->
<noscript><div style="color:red; font-weight:bold; text-align:center;">This page uses Javascript. Your browser either doesn't support Javascript or you have it turned off. To see this page as it is meant to appear please use a Javascript enabled browser.</div></noscript>
<!-- BEGIN BANNER -->
<div id="headerTop">
<div id="Nav"><a href="/index.htm">Home</a> | <a href="/cgi-bin/browse-edgar?action=getcurrent">Latest Filings</a> | <a href="javascript:history.back()">Previous Page</a></div>
<div id="seal"><a href="/index.htm"><img src="/images/sealTop.gif" alt="SEC Seal" border="0" /></a></div>
<div id="secWordGraphic"><img src="/images/bannerTitle.gif" alt="SEC Banner" /></div>
</div>
<div id="headerBottom">
<div id="searchHome"><a href="/edgar/searchedgar/webusers.htm">Search the Next-Generation EDGAR System</a></div>
<div id="PageTitle">Filing Detail</div>
</div>
<!-- END BANNER -->
<!-- BEGIN BREADCRUMBS -->
<div id="breadCrumbs">
<ul>
<li><a href="/index.htm">SEC Home</a> &#187;</li>
<li><a href="/edgar/searchedgar/webusers.htm">Search the Next-Generation EDGAR System</a> &#187;</li>
<li><a href="/edgar/searchedgar/companysearch.html">Company Search</a> &#187;</li>
<li class="last">Current Page</li>
</ul>
</div>
<!-- END BREADCRUMBS -->
<div id="contentDiv">
<!-- START FILING DIV -->
<div id="formDiv">
<div id="formHeader">
<div id="formName">
<strong>Form SC 13G</strong> - Statement of acquisition of beneficial ownership by individuals:
</div>
<div id="secNum">
<strong><acronym title="Securities and Exchange Commission">SEC</acronym> Accession <acronym title="Number">No.</acronym></strong> 0001050470-16-000051
</div>
</div>
<div class="formContent">
<div class="formGrouping">
<div class="infoHead">Filing Date</div>
<div class="info">2016-02-12</div>
<div class="infoHead">Accepted</div>
<div class="info">2016-02-11 19:05:37</div>
<div class="infoHead">Documents</div>
<div class="info">1</div>
</div>
<div style="clear:both"></div>
</div>
</div>
<!-- END FILING DIV -->
<!-- START DOCUMENT DIV -->
<div id="formDiv">
<div style="padding: 0px 0px 4px 0px; font-size: 12px; margin: 0px 2px 0px 5px; width: 100%; overflow:hidden">
<p>Document Format Files</p>
<table class="tableFile" summary="Document Format Files">
<tr>
<th scope="col" style="width: 5%;"><acronym title="Sequence Number">Seq</acronym></th>
<th scope="col" style="width: 40%;">Description</th>
<th scope="col" style="width: 20%;">Document</th>
<th scope="col" style="width: 10%;">Type</th>
<th scope="col">Size</th>
</tr>
<tr>
<td scope="row">1</td>
<td scope="row">LSV13G123115MEDALLION.TXT</td>
<td scope="row"><a href="/Archives/edgar/data/1000209/000105047016000051/lsv13g123115medallion.txt">lsv13g123115medallion.txt</a></td>
<td scope="row">SC 13G</td>
<td scope="row">8314</td>
</tr>
<tr class="blueRow">
<td scope="row">&nbsp;</td>
<td scope="row">Complete submission text file</td>
<td scope="row"><a href="/Archives/edgar/data/1000209/000105047016000051/0001050470-16-000051.txt">0001050470-16-000051.txt</a></td>
<td scope="row">&nbsp;</td>
<td scope="row">9803</td>
</tr>
</table>
</div>
</div>
<!-- END DOCUMENT DIV -->
<!-- START FILER DIV -->
<div id="filerDiv">
<div class="mailer">Mailing Address
<span class="mailerAddress">437 MADISON AVENUE</span>
<span class="mailerAddress">38TH FLOOR</span>
<span class="mailerAddress">
NEW YORK NY 10022 </span>
</div>
<div class="mailer">Business Address
<span class="mailerAddress">437 MADISON AVE 38 TH FLOOR</span>
<span class="mailerAddress">
NEW YORK NY 10022 </span>
<span class="mailerAddress">2123282153</span>
</div>
<div class="companyInfo">
<span class="companyName">MEDALLION FINANCIAL CORP (Subject)
<acronym title="Central Index Key">CIK</acronym>: <a href="/cgi-bin/browse-edgar?CIK=0001000209&amp;action=getcompany">0001000209 (see all company filings)</a></span>
<p class="identInfo"><acronym title="Internal Revenue Service Number">IRS No.</acronym>: <strong>043291176</strong> | State of Incorp.: <strong>DE</strong> | Fiscal Year End: <strong>1231</strong><br />Type: <strong>SC 13G</strong> | Act: <strong>34</strong> | File No.: <a href="/cgi-bin/browse-edgar?filenum=005-48473&amp;action=getcompany"><strong>005-48473</strong></a> | Film No.: <strong>161413579</strong><br /><acronym title="Standard Industrial Code">SIC</acronym>: <b><a href="/cgi-bin/browse-edgar?action=getcompany&amp;SIC=6199&amp;owner=include">6199</a></b> Finance Services<br />Office of Finance</p>
</div>
<div class="clear"></div>
</div>
<div id="filerDiv">
<div class="mailer">Mailing Address
<span class="mailerAddress">155 NORTH WACKER DRIVE</span>
<span class="mailerAddress">SUITE 4600</span>
<span class="mailerAddress">
CHICAGO IL 60606 </span>
</div>
<div class="mailer">Business Address
<span class="mailerAddress">155 NORTH WACKER DRIVE</span>
<span class="mailerAddress">SUITE 4600</span>
<span class="mailerAddress">
CHICAGO IL 60606 </span>
<span class="mailerAddress">312-460-2443</span>
</div>
<div class="companyInfo">
<span class="companyName">LSV ASSET MANAGEMENT (Filed by)
<acronym title="Central Index Key">CIK</acronym>: <a href="/cgi-bin/browse-edgar?CIK=0001050470&amp;action=getcompany">0001050470 (see all company filings)</a></span>
<p class="identInfo"><acronym title="Internal Revenue Service Number">IRS No.</acronym>: <strong>232772200</strong> | State of Incorp.: <strong>DE</strong> | Fiscal Year End: <strong>1231</strong><br />Type: <strong>SC 13G</strong></p>
</div>
<div class="clear"></div>
</div>
<!-- END FILER DIV -->
</div>"""
```
### Exercise 1: dates
Write a regular expression finding all the dates with the `YYYY-MM-DD`
format in the document. It's OK if you have extra groups within your
matches.
```python
import re
re.findall(r"????", html)
```
You might have some extra matches like `0470-16-00` that are clearly not dates. Add some additional filtering so that you only count 4-digit numbers as years if they start as 19XX or 20XX. Similarly, add additional filtering for example like `2044-16-00`, so that you only only count 2-digit numbers as months if they are within 01 and 12. You could do this in the regular expression itself, or in some additional Python code that loops over the results of the regular expression.
### Exercise 2: Standard Industrial Classification (SIC) codes
Take a look at the industry codes defined here: https://www.sec.gov/corpfin/division-of-corporation-finance-standard-industrial-classification-sic-code-list
Write a regular expression to find numbers following the text
"SIC=" in `html`.
### Exercise 3: Addresses
To find addresses, we'll look at the HTML. We want to find the
contents between `<div class="mailer">` and `</div>`. We want the
non-greedy version (meaning that if there are multiple `</div>`
instances, we want to match the first one possible).
Write a regular expression to do this:
```python
for addr_html in re.findall(????, html):
print(addr_html)
```
Note, the text between the opening and closing `div` tags generally
spans multiple lines. Remember that `.` does not match newlines. One
way to match anything is with `[\s\S]` (meaning whitespace or not
whitespace; in other words, everything). If you get very stuck on
this one, you can scroll past the expected output to see one solution
for finding `addr_html` matches.
Expected output:
```
Mailing Address
<span class="mailerAddress">437 MADISON AVENUE</span>
<span class="mailerAddress">38TH FLOOR</span>
<span class="mailerAddress">
NEW YORK NY 10022 </span>
Business Address
<span class="mailerAddress">437 MADISON AVE 38 TH FLOOR</span>
<span class="mailerAddress">
NEW YORK NY 10022 </span>
<span class="mailerAddress">2123282153</span>
Mailing Address
<span class="mailerAddress">155 NORTH WACKER DRIVE</span>
<span class="mailerAddress">SUITE 4600</span>
<span class="mailerAddress">
CHICAGO IL 60606 </span>
Business Address
<span class="mailerAddress">155 NORTH WACKER DRIVE</span>
<span class="mailerAddress">SUITE 4600</span>
<span class="mailerAddress">
CHICAGO IL 60606 </span>
<span class="mailerAddress">312-460-2443</span>
```
Now, extend your above loop so that a further regex search is
conducted for address lines within each `addr_html` match. Address
lines are between `<span class="mailerAddress">` and `</span>`:
```python
for addr_html in re.findall(r'<div class="mailer">([\s\S]+?)</div>', html):
lines = []
for line in re.findall(????, addr_html):
lines.append(line.strip())
print("\n".join(lines))
print()
```
## `Filing` class
Add a `Filing` class to your `edgar_utils.py` like this:
```python
class Filing:
def __init__(self, html):
self.dates = ????
self.sic = ????
self.addresses = ????
def state(self):
return "TODO"
```
`html` will be an HTML string, much like the one you were working with
in the practices. Fill in the missing parts and add additional lines
as needed. Much of the code from the practice exercises will be
useful here.
* `dates` should be a list of dates in the `YYYY-MM-DD` format that appear in the HTML (only count years starting as 19XX or 20XX with reasonable months and dates).
* `sic` should be an `int` indicating the Standard Industrial Classification. It should be `None` if this doesn't appear.
* `addresses` should be a list of addresses found in the HTML. Each address will contain the address lines separated by newlines, but otherwise there shouldn't be unnecessary whitespace.(i.e. `['437 MADISON AVENUE\n38TH FLOOR\nNEW YORK NY 10022','155 NORTH WACKER DRIVE\nSUITE 4600\nCHICAGO IL 60606']` note this is just an example **not** the answer.)
* `state()` should loop over the addresses. If it finds one that contains two capital letters followed by 5 digits (for example, `WI 53706`), it should return what appears to be a state abbreviation (for example `WI`). You don't need to check that the abbreviation is a valid state. If nothing that looks like a state abbreviation appears, return `None`. Note: It must be exactly 2 capital letters and 5 digits, for example, you can't get `OX 12345` in `BOX 12345`.
\ No newline at end of file
# Lab 9: Regex
1. Continue working on [Part 3 of P4](EDGAR.md).
# Screenshot Requirement
A screenshot that shows your progress
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment