In the US, public companies need to regularly file various statements
and reports to the SEC's (Securities and Exchange Commission) EDGAR
database. EDGAR data is publicly available online; furthermore, web
requests to EDGAR from around the world are logged and published. For
P5, you'll analyze both SEC filing HTML pages and a log of web
requests from around the world for those pages.
In this lab, you'll create an `edgar_utils.py` to help with analyzing
the pages and logs. It will contain two things: a `lookup_region`
function and a `Filing` class.
## Practice for `lookup_region`
For "practice" components of this lab, you'll do exercises in a
notebook to get the code and logic correct. You'll then use what you
learn to write your `edgar_utils.py` module.
### Exercise 1: replace letters
For the project dataset, you'll be working with some IP addresses
where some of the digits have been replaced with letters for
anonymization.
For some calculations, we need only digits, so we'll replace any
letters with "0". Complete the following regex code to get back
"101.1.1.000":
```python
importre
ipaddr="101.1.1.abc"
re.sub(????,????,ipaddr)
```
### Exercise 2: integer IPs
Note if you haven't installed netaddr yet from p5, please install it from the command line via:
``` pip3 install netaddr```
IP addresses are commonly represented as four-part numbers, like
"34.67.75.25". To convert an address like this to an integer, you can
use the following::
```python
import netaddr
int(netaddr.IPAddress("34.67.75.25"))
```
Modify the above to lookup the integer representation of your virtual
machine's IP address.
### Exercise 3: binary search
Consider the following sorted list:
```python
L = [1, 2, 100, 222, 900]
```
Running `150 in L` would loop over every element in the list. This is
slow, and doesn't take advantage of the fact that the list is sorted.
A better strategy when we know it is sorted would be to check the
middle (100) and infer 150 must be in the second half of the list, if
it's in the list at all; no need to check the first half.
A famous algorithm that uses this strategy is called *binary search*, and it's implemented by this function: https://docs.python.org/dev/library/bisect.html#bisect.bisect
Try it:
```python
from bisect import bisect
idx = bisect(L, 150)
idx
```
You should get `3` -- this means that if you wanted to add 150 to the
list and keep it in sorted order, you would insert 150 at index 3
(after 1, 2, and 100). This also means `L[idx-1]` is the biggest
number in the list that is **less than or equal** to 150.
*What will bisect of `L` be for 225?* Write down your prediction, then
run code to check your answer.
### Exercise 4: country/region lookup
You can generally guess what country or region a computer is in based
on its IP address. Read in `ip2location.csv` from the project to see
the IP ranges assigned to each region (this is borrowed from
<noscript><divstyle="color:red; font-weight:bold; text-align:center;">This page uses Javascript. Your browser either doesn't support Javascript or you have it turned off. To see this page as it is meant to appear please use a Javascript enabled browser.</div></noscript>
<acronymtitle="Central Index Key">CIK</acronym>: <ahref="/cgi-bin/browse-edgar?CIK=0001000209&action=getcompany">0001000209 (see all company filings)</a></span>
<pclass="identInfo"><acronymtitle="Internal Revenue Service Number">IRS No.</acronym>: <strong>043291176</strong> | State of Incorp.: <strong>DE</strong> | Fiscal Year End: <strong>1231</strong><br/>Type: <strong>SC 13G</strong> | Act: <strong>34</strong> | File No.: <ahref="/cgi-bin/browse-edgar?filenum=005-48473&action=getcompany"><strong>005-48473</strong></a> | Film No.: <strong>161413579</strong><br/><acronymtitle="Standard Industrial Code">SIC</acronym>: <b><ahref="/cgi-bin/browse-edgar?action=getcompany&SIC=6199&owner=include">6199</a></b> Finance Services<br/>Office of Finance</p>
</div>
<divclass="clear"></div>
</div>
<divid="filerDiv">
<divclass="mailer">Mailing Address
<spanclass="mailerAddress">155 NORTH WACKER DRIVE</span>
<spanclass="mailerAddress">SUITE 4600</span>
<spanclass="mailerAddress">
CHICAGO IL 60606 </span>
</div>
<divclass="mailer">Business Address
<spanclass="mailerAddress">155 NORTH WACKER DRIVE</span>
<acronymtitle="Central Index Key">CIK</acronym>: <ahref="/cgi-bin/browse-edgar?CIK=0001050470&action=getcompany">0001050470 (see all company filings)</a></span>
<pclass="identInfo"><acronymtitle="Internal Revenue Service Number">IRS No.</acronym>: <strong>232772200</strong> | State of Incorp.: <strong>DE</strong> | Fiscal Year End: <strong>1231</strong><br/>Type: <strong>SC 13G</strong></p>
</div>
<divclass="clear"></div>
</div>
<!-- END FILER DIV -->
</div>"""
```
### Exercise 1: dates
Write a regular expression finding all the dates with the `YYYY-MM-DD`
format in the document. It's OK if you have extra groups within your
matches.
```python
import re
re.findall(r"????", html)
```
You might have some extra matches like `0470-16-00` that are clearly not dates. Add some additional filtering so that you only count 4-digit numbers as years if they start as 19XX or 20XX. Similarly, add additional filtering for example like `2044-16-00`, so that you only only count 2-digit numbers as months if they are within 01 and 12. You could do this in the regular expression itself, or in some additional Python code that loops over the results of the regular expression.
### Exercise 2: Standard Industrial Classification (SIC) codes
Take a look at the industry codes defined here: https://www.sec.gov/corpfin/division-of-corporation-finance-standard-industrial-classification-sic-code-list
Write a regular expression to find numbers following the text
"SIC=" in `html`.
### Exercise 3: Addresses
To find addresses, we'll look at the HTML. We want to find the
contents between `<div class="mailer">` and `</div>`. We want the
non-greedy version (meaning that if there are multiple `</div>`
instances, we want to match the first one possible).
Write a regular expression to do this:
```python
for addr_html in re.findall(????, html):
print(addr_html)
```
Note, the text between the opening and closing `div` tags generally
spans multiple lines. Remember that `.` does not match newlines. One
way to match anything is with `[\s\S]` (meaning whitespace or not
whitespace; in other words, everything). If you get very stuck on
this one, you can scroll past the expected output to see one solution
for finding `addr_html` matches.
Expected output:
```
Mailing Address
<spanclass="mailerAddress">437 MADISON AVENUE</span>
<spanclass="mailerAddress">38TH FLOOR</span>
<spanclass="mailerAddress">
NEW YORK NY 10022 </span>
Business Address
<spanclass="mailerAddress">437 MADISON AVE 38 TH FLOOR</span>
<spanclass="mailerAddress">
NEW YORK NY 10022 </span>
<spanclass="mailerAddress">2123282153</span>
Mailing Address
<spanclass="mailerAddress">155 NORTH WACKER DRIVE</span>
<spanclass="mailerAddress">SUITE 4600</span>
<spanclass="mailerAddress">
CHICAGO IL 60606 </span>
Business Address
<spanclass="mailerAddress">155 NORTH WACKER DRIVE</span>
<spanclass="mailerAddress">SUITE 4600</span>
<spanclass="mailerAddress">
CHICAGO IL 60606 </span>
<spanclass="mailerAddress">312-460-2443</span>
```
Now, extend your above loop so that a further regex search is
conducted for address lines within each `addr_html` match. Address
lines are between `<span class="mailerAddress">` and `</span>`:
```python
for addr_html in re.findall(r'<divclass="mailer">([\s\S]+?)</div>', html):
lines = []
for line in re.findall(????, addr_html):
lines.append(line.strip())
print("\n".join(lines))
print()
```
## `Filing` class
Add a `Filing` class to your `edgar_utils.py` like this:
```python
class Filing:
def __init__(self, html):
self.dates = ????
self.sic = ????
self.addresses = ????
def state(self):
return "TODO"
```
`html` will be an HTML string, much like the one you were working with
in the practices. Fill in the missing parts and add additional lines
as needed. Much of the code from the practice exercises will be
useful here.
* `dates` should be a list of dates in the `YYYY-MM-DD` format that appear in the HTML (only count years starting as 19XX or 20XX with reasonable months and dates).
* `sic` should be an `int` indicating the Standard Industrial Classification. It should be `None` if this doesn't appear.
* `addresses` should be a list of addresses found in the HTML. Each address will contain the address lines separated by newlines, but otherwise there shouldn't be unnecessary whitespace.(i.e. `['437 MADISON AVENUE\n38TH FLOOR\nNEW YORK NY 10022','155 NORTH WACKER DRIVE\nSUITE 4600\nCHICAGO IL 60606']` note this is just an example **not** the answer.)
* `state()` should loop over the addresses. If it finds one that contains two capital letters followed by 5 digits (for example, `WI 53706`), it should return what appears to be a state abbreviation (for example `WI`). You don't need to check that the abbreviation is a valid state. If nothing that looks like a state abbreviation appears, return `None`. Note: It must be exactly 2 capital letters and 5 digits, for example, you can't get `OX 12345` in `BOX 12345`.