Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • cdis/cs/courses/cs320/s24
  • EBBARTELS/s24
  • kenninger/s24
  • hbartle/s24
  • jvoegeli/s24
  • chin6/s24
  • lallo/s24
  • cbjensen/s24
  • bjhicks/s24
  • JPERLOFF/s24
  • RMILLER56/s24
  • sswain2/s24
  • SHINEGEORGE/s24
  • SKALMAZROUEI/s24
  • nkempf2/s24
  • kmalovrh/s24
  • alagiriswamy/s24
  • SWEINGARTEN2/s24
  • SKALMAZROUEI/s-24-fork
  • jchasco/s24
20 results
Show changes
Showing
with 1641 additions and 0 deletions
# Zip Files
As you deal with bigger datasets, those datasets will often be
compressed. Compressed means that the format takes advantage of
patterns and redundancy in data to store a bigger file in less space.
For example, say you have a string like this: "HAHAHAHAHAHAHAHAHAHA".
You should imagine inventing a notation for representing that string
with fewer characters (maybe something like "HA{x10}").
Zip is one common compression format. In addition to compressing
files, .zips often bundle multiple files together. In the past, you
would have run `unzip` in the terminal before starting to write your
code. However, it is also possible to directly read the contents of a
`.zip` file in Python. Doing so is often more convenient; the code
may also quite possibly be faster.
## Generating a .zip
To create an `example.zip` file, run the following (don't worry,
understanding this particular snippet isn't expected for this lab):
```python
import pandas as pd
from zipfile import ZipFile, ZIP_DEFLATED
from io import TextIOWrapper
with open("hello.txt", "w") as f:
f.write("hello world")
with ZipFile("example.zip", "w", compression=ZIP_DEFLATED) as zf:
with zf.open("hello.txt", "w") as f:
f.write(bytes("hello world", "utf-8"))
with zf.open("ha.txt", "w") as f:
f.write(bytes("ha"*10000, "utf-8"))
with zf.open("bugs.csv", "w") as f:
pd.DataFrame([["Mon",7], ["Tue",4], ["Wed",3], ["Thu",6], ["Fri",9]],
columns=["day", "bugs"]).to_csv(TextIOWrapper(f), index=False)
```
## ZipFile
We can access the file by using the `ZipFile` type, imported from the `zipfile` module:
```python
from zipfile import ZipFile
```
ZipFiles are context managers, much like file objects. Let's try
creating one using `with`, then loop over info about the files inside
using [this
method](https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.infolist):
```python
with ZipFile('example.zip') as zf:
for info in zf.infolist():
print(info)
```
Let's print off the size and compression ratio (uncompressed size divided by compressed size) of each file:
```python
with ZipFile('example.zip') as zf:
for info in zf.infolist():
orig_mb = info.file_size / (1024**2) # there are 1024**2 bytes in a MB
ratio = info.file_size / info.compress_size
s = "file {name:s}, {mb:.3f} MB (uncompressed), {ratio:.1f} compression ratio"
print(s.format(name=info.filename, mb=orig_mb, ratio=ratio))
```
Take a minute to look through -- which file is largest? What is its
compression ratio?
The compression ratio is the original size divided by the compressed
size, so bigger means more savings. `ha.txt` contains "hahahahaha..."
(repeated 10 thousand times), which is highly compressible.
As practice, compute the overall compression ration (sum of all
uncompressed sizes divided by sum of all compressed sizes) -- it ought
to be about 216.
## Binary Open
Ok, forget zips for a minute, and run the following:
```python
with open("hello.txt", "r") as f:
data1 = f.read()
with open("hello.txt", "rb") as f:
data2 = f.read()
print(type(data1), type(data2))
```
What type does `f.read()` return if we use "r" for the mode? What
about "rb"?
The "b" stands for "binary" or "bytes", so we get back type `bytes`.
If we open in text mode (the default), as in the first open, the bytes
automatically get translated to strings, using some encoding (like
"utf-8") that assigns characters to byte-represented numbers.
Run this:
```python
from io import TextIOWrapper
```
`TextIOWrapper` objects "wrap" file objects are used to convert bytes
to characters on the fly. For example, try the following:
```python
with open("hello.txt", "rb") as f:
tio = TextIOWrapper(f)
data3 = tio.read()
print(type(data3))
```
Even though we open in binary mode, we get a string thanks to
`TextIOWrapper`! You can think of the example where we read into
`data1` as a shorthand for what we did to get `data3`.
## Reading Files
A ZipFile has a method named `open` that works a lot like the `open`
function you're familiar with. A ZipFile is a context manager, and so
is the object returned by `ZipFile.open(...)`, so we'll end up with
nested `with` statements to make sure everything gets closed up
properly. Let's take a look at the compressed schedule file:
```python
with ZipFile('example.zip') as zf:
with zf.open("hello.txt", "r") as f:
print(f.read())
```
Woah, why do we get `b'hello world'`? For regular files, "r" mode
defaults to reading text, but for files inside a zip, it defaults to
binary mode, so we got back bytes.
TextIOWrapper saves the day:
```python
with ZipFile('example.zip') as zf:
with zf.open("hello.txt", "r") as f:
tio = TextIOWrapper(f)
print(tio.read())
```
With regular files, TextIOWrapper is a bit useless (why not just open
with "r" instead of "rb"?), but for zips, it is crucial.
## Pandas
Pandas can read a DataFrame even from a binary stream. So you can can do this:
```python
with ZipFile('example.zip') as zf:
with zf.open("bugs.csv") as f:
df = pd.read_csv(f)
df
```
# Lab 3: Files
1. Share with your group: *What is your favorite day of the year?*
2. Practice [visual complexity analysis](./big-o)
3. Practice Python [files and JSON](./files-json)
4. Learn how to read [zip files](./files-zip) in Python
5. Start the [loan.py](./loans) module that will help you complete P2
\ No newline at end of file
# Loan Module
In these exercises, you'll start writing a `loans.py` module with two
Python classes you'll use for P2. It's OK if you don't finish these
classes during lab time (you can finish them with your group or alone
later when working on P2).
## loans.py
In Jupyter, do the following:
1. Go to P2
2. Right click in the file explore and create a "New File"
3. Name it loans.py
4. Open it
Using a .py module is easy -- just run `import some_mod` to run
`some_mod.py`, loading any function or classes it has.
In your `loans.py`, add a print like this:
```python
print("Hello from loans.py!")
def hey():
print("Hey!")
```
Now lets import it to your project notebook. Create a `p2.ipynb` in
the same directory as `loans.py`.
Run `import loans` in a cell. You should see the first print!
You can also call the `hey` function now. Try it:
```python
loans.hey()
```
If you change `hey` in `loans.py`, the new version won't automatically
reload into the notebook. Add this so it will auto-reload:
```
%load_ext autoreload
%autoreload 2
```
Note this doesn't work all the time (if there's a bug in your
loans.py, you may need to do a Restart & Run All in the notebook after
fixing your module).
Feel free to delete the print statement and `hey` method from
`loans.py` (those were just for experimentation -- we'll be adding
other content to `loans.py`).
## 1. `Applicant` class
We'll want to create a class to represent people who apply for loans. Start with this in `loans.py`:
```python
class Applicant:
def __init__(self, age, race):
self.age = age
self.race = set()
for r in race:
????
```
We'll be using HDMA loan data
(https://www.ffiec.gov/hmda/pdf/2023guide.pdf), which uses numeric
codes to represent race. Here are the codes from the documentation,
recorded in a dictionary:
```python
race_lookup = {
"1": "American Indian or Alaska Native",
"2": "Asian",
"3": "Black or African American",
"4": "Native Hawaiian or Other Pacific Islander",
"5": "White",
"21": "Asian Indian",
"22": "Chinese",
"23": "Filipino",
"24": "Japanese",
"25": "Korean",
"26": "Vietnamese",
"27": "Other Asian",
"41": "Native Hawaiian",
"42": "Guamanian or Chamorro",
"43": "Samoan",
"44": "Other Pacific Islander"
}
```
Paste the `dict` in your `loans.py` module, and use it to complete
your `__init__` constructor. The loop should add entries in the
`race` parameter to the `self.race` attribute of the classes,
converting from the numeric codes to text in the process. The `race`
attribute is a set because applicants often identify with multiple
options.
Simply skip over any entries in the `race` parameter that don't appear
in the `race_lookup` dict (e.g., we'll see and skip "6" later because
that code indicates a missing value).
Test the code you wrote in `loans.py` from your `p2.ipynb` notebook to
make sure the `Applicant.__init__` constructor properly fills the
`race` set.
```python
applicant = loans.Applicant("20-30", ["1", "2", "3"])
applicant.race
```
You should see this set:
```python
{'American Indian or Alaska Native', 'Asian', 'Black or African American'}
```
### `__repr__`
Add a `__repr__` method to your `Applicant` class:
```python
def __repr__(self):
????
return ????
```
Putting `applicant` at the end of a cell or printing `repr(applicant)` should show this:
```
Applicant('20-30', ['American Indian or Alaska Native', 'Asian', 'Black or African American'])
```
Note: The `race` attribute should be sorted lexicographically.
### `lower_age`
You might notice that ages are given as strings rather than ints
because we need to support ranges (like "20-30").
Add a `lower_age` method that returns the lower int of an applicant's age range:
```python
def lower_age(self):
return ????
```
It should also support ages like "<75" (should just return the int
`75`) and ">25" (should just return the int `25`).
Try your method (you should get the int `20` since the age is "20-30"):
```python
applicant.lower_age()
```
Hints: you could use `.replace` get get rid of unhelpful characters
(like "<" and ">"). After that, splitting on "-" could help you find
the first number (it's OK to split on a character that doesn't appear
in a string -- you just get a list with one entry).
### `__lt__`
Recall that `__lt__` ("less than") lets you control what happens when
two objects get compared.
`obj1 < obj2` automatically becomes `obj1.__lt__(obj2)`, so you can
write `__lt__` to return a True/False, indicating whether `obj1` is
less.
Complete the following for your `Applicant` class:
```python
def __lt__(self, other):
return ????
```
Comparisons should be based on age. Python sorting will also use your
`__lt__` method. Try it:
```python
sorted([
loans.Applicant(">75", ["43", "44"]),
loans.Applicant("20-30", ["1", "3"]),
loans.Applicant("35-44", ["22"]),
loans.Applicant("<25", ["5"]),
])
```
You should get this order:
```python
[Applicant('20-30', ['American Indian or Alaska Native', 'Black or African American']),
Applicant('<25', ['White']),
Applicant('35-44', ['Chinese']),
Applicant('>75', ['Other Pacific Islander', 'Samoan'])]
```
## 2. `Loan` class
For the project, we'll use data loan data from this site:
https://cfpb.github.io/hmda-platform/#hmda-api-documentation.
Loan applications are described with dictionaries, like this:
```python
values = {'activity_year': '2021', 'lei': '549300Q76VHK6FGPX546', 'derived_msa-md': '24580', 'state_code': 'WI','county_code': '55009', 'census_tract': '55009020702', 'conforming_loan_limit': 'C', 'derived_loan_product_type': 'Conventional:First Lien', 'derived_dwelling_category': 'Single Family (1-4 Units):Site-Built', 'derived_ethnicity': 'Not Hispanic or Latino', 'derived_race': 'White', 'derived_sex': 'Joint', 'action_taken': '1', 'purchaser_type': '1', 'preapproval': '2', 'loan_type': '1', 'loan_purpose': '31', 'lien_status': '1', 'reverse_mortgage': '2', 'open-end_line_of_credit': '2', 'business_or_commercial_purpose': '2', 'loan_amount': '325000.0', 'loan_to_value_ratio': '73.409', 'interest_rate': '2.5', 'rate_spread': '0.304', 'hoepa_status': '2', 'total_loan_costs': '3932.75', 'total_points_and_fees': 'NA', 'origination_charges': '3117.5', 'discount_points': '', 'lender_credits': '', 'loan_term': '240', 'prepayment_penalty_term': 'NA', 'intro_rate_period': 'NA', 'negative_amortization': '2', 'interest_only_payment': '2', 'balloon_payment': '2', 'other_nonamortizing_features': '2', 'property_value': '445000', 'construction_method': '1', 'occupancy_type': '1', 'manufactured_home_secured_property_type': '3', 'manufactured_home_land_property_interest': '5', 'total_units': '1', 'multifamily_affordable_units': 'NA', 'income': '264', 'debt_to_income_ratio': '20%-<30%', 'applicant_credit_score_type': '2', 'co-applicant_credit_score_type': '9', 'applicant_ethnicity-1': '2', 'applicant_ethnicity-2': '', 'applicant_ethnicity-3': '', 'applicant_ethnicity-4': '', 'applicant_ethnicity-5': '', 'co-applicant_ethnicity-1': '2', 'co-applicant_ethnicity-2': '', 'co-applicant_ethnicity-3': '', 'co-applicant_ethnicity-4': '', 'co-applicant_ethnicity-5': '', 'applicant_ethnicity_observed': '2', 'co-applicant_ethnicity_observed': '2', 'applicant_race-1': '5', 'applicant_race-2': '', 'applicant_race-3': '', 'applicant_race-4': '', 'applicant_race-5': '', 'co-applicant_race-1': '5', 'co-applicant_race-2': '', 'co-applicant_race-3': '', 'co-applicant_race-4': '', 'co-applicant_race-5': '', 'applicant_race_observed': '2', 'co-applicant_race_observed': '2', 'applicant_sex': '1', 'co-applicant_sex': '2', 'applicant_sex_observed': '2', 'co-applicant_sex_observed': '2', 'applicant_age': '35-44', 'co-applicant_age': '35-44', 'applicant_age_above_62': 'No', 'co-applicant_age_above_62': 'No', 'submission_of_application': '1', 'initially_payable_to_institution': '1', 'aus-1': '1', 'aus-2': '', 'aus-3': '', 'aus-4': '', 'aus-5': '', 'denial_reason-1': '10', 'denial_reason-2': '', 'denial_reason-3': '', 'denial_reason-4': '', 'tract_population': '6839', 'tract_minority_population_percent': '8.85999999999999943', 'ffiec_msa_md_median_family_income': '80100', 'tract_to_msa_income_percentage': '150', 'tract_owner_occupied_units': '1701', 'tract_one_to_four_family_homes': '2056', 'tract_median_age_of_housing_units': '15'}
```
Paste the above to your notebook. We want to use a dict like the above to create a `Loan` object as follows:
```python
loan = loans.Loan(values)
```
Whereas the `__init__` for `Applicant` took a few parameters, the
`__init__` for the `Loan` class will take a single parameter,
`values`, which will contain all the data necessary to set the `Loan`
attributes.
Start with the following, then modify and add code:
```python
class Loan:
def __init__(????, values):
self.loan_amount = values["loan_amount"]
# add lines here
```
Requirements:
* a `Loan` object should have four attributes: `loan_amount`, `property_value`, `interest_rate`, `applicants`
* the first three attributes are floats (you'll need to convert from the strings found in `values`)
* strings like "NA" and "Exempt" that represent missing values can be `-1` when you convert to floats
* the `applicants` attribute should be a list of `Applicant` objects. Every loan has at least one applicant, with age `values["applicant_age"]` and race(s) in the multiple `values["applicant_race-????"]` entries.
* some loans have a second applicant (but no more) -- you'll know there is a second applicant when `values["co-applicant_age"] != "9999"`. In that case, `self.applicants` should contain two `Applicant` objects, with the info from the second coming from the `values["co-applicant_age"]` and `values["co-applicant_race-????"]` entries.
Manually test your `Loan` class from your notebook with a few snippets:
* `loan.interest_rate` should be `2.5`
* `loan.applicants` should be `[Applicant('35-44', ['White']), Applicant('35-44', ['White'])]`
* choose a couple more...
### `__str__` and `__repr__`
Add a `__str__` method to your `Loan` class so that `print(loan)` gives the following:
```
<Loan: 2.5% on $445000.0 with 2 applicant(s)>
```
Add a `__repr__` that returns the same string as `__str__`.
### `yearly_amounts`
The loans have details regarding payment amount and frequency in the
terms, but for simplicity, we'll ignore that here.
The `yearly_amounts` method in the `Loan` class should be a generator
that yields loan amounts, as the loan is payed off over time. Assume
that each year, a single payment is made, after interest is
calculated. **Note:** `loan.interest_rate` is in percentage. Convert it
to decimal before using it.
```python
def yearly_amounts(self, yearly_payment):
# TODO: assert interest and amount are positive
result = []
amt = self.loan_amount
while amt > 0:
result.append(amt)
# TODO: add interest rate multiplied by amt to amt
# TODO: subtract yearly payment from amt
return result
```
Your job:
1. Finish the TODOs
2. Test your code from the notebook. For example, you could run this from the notebook:
```python
for amt in loan.yearly_amounts(30000):
print(amt)
```
And get this:
```
325000.0
303125.0
280703.125
257720.703125
234163.720703125
210017.81372070312
185268.2590637207
159899.96554031371
133897.46467882156
107244.90129579211
79926.02382818691
51924.174423891585
23222.278784488873
```
3. Make the method a generator. Get rid of the `result` list, and instead of appending to it, yield `amt`. Make sure the loop works the same way as before in your notebook. One advantage of the generator is that the method will work even if the payment is too small (the generator will keep yielding larger amounts as the debt keeps growing). **That last step is very important to passing the P2 tests!**
# BSTs (Binary Search Trees)
In this lab, you'll create a BST that can be used to lookup values by
a key (it will behave a bit like a Python dict where the all the dict
values are lists of values). You'll use the BST for P2.
## Basics Node and BST classes
Start by pasting+completing the following:
```python
class Node():
def __init__(self, key):
self.key = ????
self.values = []
self.left = None
????
```
Let's create a `BST` class with an `add` method that automatically
places a node in a place that preserves the search property (i.e., all
keys in left subtree are less than a parent's value, which is less
than those in the right tree).
Add+complete with the following. Note that this is a non-recursive
version of `add`:
```python
class BST():
def __init__(self):
self.root = None
def add(self, key, val):
if self.root == None:
self.root = ????
curr = self.root
while True:
if key < curr.key:
# go left
if curr.left == None:
curr.left = Node(key)
curr = curr.left
elif key > curr.key:
# go right
????
????
????
else:
# found it!
assert curr.key == key
break
curr.values.append(val)
```
## Dump
Let's write some methods to BST to dump out all the keys and values (note
that "__" before a method name is a hint that it is for internal use
-- methods inside the class might call `__dump`, but code outside the
class probably shouldn't):
```python
def __dump(self, node):
if node == None:
return
self.__dump(node.right) # 1
print(node.key, ":", node.values) # 2
self.__dump(node.left) # 3
def dump(self):
self.__dump(self.root)
```
Try it:
```python
tree = BST()
tree.add("A", 9)
tree.add("A", 5)
tree.add("B", 22)
tree.add("C", 33)
tree.dump()
```
You should see this:
```
C : [33]
B : [22]
A : [9, 5]
```
Play around with the order of lines 1, 2, and 3 in `__dump()` above. Can you
arrange those three so that the output is in ascending alphabetical
order, by key?
## Length
Add a special method `__len__` to `Node` so that we can find the size
of a tree. Count every entry in the `.values` list of each `Node`.
```python
def __len__(self):
size = len(self.values)
if self.left != None:
size += ????
????
????
return size
```
```python
t = BST()
t.add("B", 3)
assert len(t.root) == 1
t.add("A", 2)
assert len(t.root) == 2
t.add("C", 1)
assert len(t.root) == 3
t.add("C", 4)
assert len(t.root) == 4
```
Discuss with your neighbour: why not have a `Node.__dump(self)` method
instead of the `BST.__dump(self, node)` method?
<details>
<summary>Answer</summary>
Right now, it is convenient to check at the beginning if `node` is
None. A receiver (the `self` parameter) can't be None if the
`object.method(...)` syntax is used (you would get the
"AttributeError: 'NoneType' object has no attribute 'method'" error).
We could have a `Node.__dump(self)` method, but then we would need to do the None checks on both `.left` and `.right`, which is slightly longer.
</details>
## Lookups
Write a `lookup` method in `Node` that returns all the values that match a given key. Some examples:
* `t.root.lookup("A")` should return `[2]`
* `t.root.lookup("C")` should return `[1, 4]`
* `t.root.lookup("Z")` should return `[]`
Some pseudocode for you to translate to Python:
```
lookup method (takes key)
if key matches my key, return my values
if key is less than my key and I have a left child
call lookup on my left child and return what it returns
if key is greater than my key and I have a right child
call lookup on my right child and return what it returns
otherwise return an empty list
```
## `search.py` module
If you've been developing your `BST` and `Node` classes in a notebook,
you should now move them to a module called `search.py` in your `p2`
directory.
%% Cell type:markdown id: tags:
# This just generates random data -- look at main.ipynb to debug
%% Cell type:code id: tags:
``` python
import names
import numpy as np
import pandas as pd
```
%% Cell type:code id: tags:
``` python
df = pd.DataFrame({"name": [names.get_first_name() for i in range(10)]})
for i in range(5):
df[f"P{i+1}"] = np.random.random(size=len(df)) * 0.15 + 0.85
df[f"Final"] = np.random.random(size=len(df)) * 0.3 + 0.7
df[f"Participation"] = np.random.random(size=len(df)) * 0.1 + 0.9
df
```
%% Output
name P1 P2 P3 P4 P5 Final \
0 Elsie 0.954955 0.913779 0.921565 0.936532 0.901380 0.928387
1 Brian 0.947351 0.952920 0.925073 0.875950 0.857365 0.938826
2 Loretta 0.958606 0.891525 0.950882 0.946470 0.989340 0.933632
3 Esther 0.985102 0.872918 0.977284 0.988530 0.930378 0.724164
4 Dawn 0.966695 0.927002 0.991770 0.959826 0.895863 0.859567
5 Crystal 0.859427 0.952088 0.965462 0.899423 0.995269 0.989677
6 Clarence 0.851693 0.926668 0.906261 0.880833 0.932816 0.834454
7 Virginia 0.928037 0.934979 0.874236 0.955648 0.997138 0.715540
8 Ernest 0.893501 0.959971 0.938698 0.887911 0.881159 0.978723
9 Jane 0.922780 0.964439 0.926576 0.937013 0.853827 0.700346
Participation
0 0.914671
1 0.958818
2 0.969530
3 0.995753
4 0.908425
5 0.987587
6 0.943060
7 0.959519
8 0.928907
9 0.998765
%% Cell type:code id: tags:
``` python
df.to_csv("scores.csv", index=False)
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
import pandas as pd
```
%% Cell type:code id: tags:
``` python
# Point vs Grade Distribution:
# Projects: 10% each
# Final: 30%
# Participation: 20%
df = pd.read_csv("scores.csv")
df
```
%% Output
name P1 P2 P3 P4 P5 Final \
0 Elsie 0.954955 0.913779 0.921565 0.936532 0.901380 0.928387
1 Brian 0.947351 0.952920 0.925073 0.875950 0.857365 0.938826
2 Loretta 0.958606 0.891525 0.950882 0.946470 0.989340 0.933632
3 Esther 0.985102 0.872918 0.977284 0.988530 0.930378 0.724164
4 Dawn 0.966695 0.927002 0.991770 0.959826 0.895863 0.859567
5 Crystal 0.859427 0.952088 0.965462 0.899423 0.995269 0.989677
6 Clarence 0.851693 0.926668 0.906261 0.880833 0.932816 0.834454
7 Virginia 0.928037 0.934979 0.874236 0.955648 0.997138 0.715540
8 Ernest 0.893501 0.959971 0.938698 0.887911 0.881159 0.978723
9 Jane 0.922780 0.964439 0.926576 0.937013 0.853827 0.700346
Participation
0 0.914671
1 0.958818
2 0.969530
3 0.995753
4 0.908425
5 0.987587
6 0.943060
7 0.959519
8 0.928907
9 0.998765
%% Cell type:code id: tags:
``` python
class Student:
def __init__(self,name):
self.name = name
self.grade = 0
def compute_grade(self, category, points):
grade = 0
if category == "Participation":
grade += points*20
if "P" in category:
grade += points*10
if category == "Final":
grade += points*30
self.grade += self.grade
def get_grade(self):
return self.grade
```
%% Cell type:code id: tags:
``` python
cs320 = {}
for i in range(len(df)):
student = Student(df["name"][i])
for col in df.columns:
student.compute_grade(col, df.at[i, col])
cs320[student.name] = student.get_grade()
```
%% Cell type:code id: tags:
``` python
# max score should be 100; Crystal should have highest score, about 96.16
cs320
```
%% Output
{'Elsie': 0,
'Brian': 0,
'Loretta': 0,
'Esther': 0,
'Dawn': 0,
'Crystal': 0,
'Clarence': 0,
'Virginia': 0,
'Ernest': 0,
'Jane': 0}
%% Cell type:markdown id: tags:
# Hints
Debugging is about asking good questions related to the issue, then finding answers to those questions (often with print statements). Some good questions to ask here:
* what line is supposed to update `self.grade` (which appears to remain zero, incorrectly)? Does this line run? You could add a `print("DEBUG")` before to find out.
* what is getting added to `self.grade` each time `compute_grade` is called? A print can help with this question too.
* is any category getting counted more than once? You could print the category inside `compute_grade` and print "ADD" inside each `if` statement to look for double counting.
%% Cell type:code id: tags:
``` python
```
name,P1,P2,P3,P4,P5,Final,Participation
Elsie,0.9549550658129085,0.913778536271882,0.9215645170508429,0.9365319073397634,0.9013803540759843,0.9283869569882017,0.914671094034935
Brian,0.9473514567792112,0.9529196735053641,0.9250734196408118,0.8759504151368779,0.8573646255287061,0.9388263403512875,0.958818414131339
Loretta,0.9586059945221289,0.8915254116636335,0.9508819366200232,0.9464696745374475,0.9893403649836676,0.933631602588951,0.9695295462457724
Esther,0.9851022156424294,0.8729183336955997,0.977283721063346,0.9885295844870531,0.9303782853727577,0.7241638599763238,0.995752825762694
Dawn,0.9666952120841795,0.9270022970580513,0.9917699643450418,0.9598264328786889,0.8958627959043414,0.8595672515103249,0.9084248987279755
Crystal,0.8594269068766791,0.9520883722829931,0.965461916205357,0.8994228809152749,0.9952691008971792,0.9896774016995538,0.9875868086271352
Clarence,0.8516934263744823,0.926667638882182,0.9062605836013653,0.8808329786074112,0.9328156387216657,0.8344540588006799,0.9430598375038679
Virginia,0.9280369533547376,0.9349792379562916,0.8742357364765729,0.9556482581209289,0.9971375119681434,0.7155395489950207,0.9595187884592306
Ernest,0.893501309690879,0.9599711435403204,0.9386984759172361,0.8879113516975059,0.8811594481923767,0.9787231284971425,0.9289073086234929
Jane,0.9227803346581137,0.9644389789471546,0.9265759728084851,0.9370125768424054,0.8538274635091643,0.7003457399712552,0.9987653286592659
# Lab 4: BST
1. We have the detailed mortgage dataset we're using for P2 thanks to the 1975 Home Mortgage Disclosure Act. Discuss with your group: *If you could pass a law requiring the collection and release of a new dataset, what data would you choose?* Feel free to answer based on what you think would be fun or interesting, or you can think about how your dataset might bring more transparency to a societal issue (like how the HDMA data makes it easier to monitor for discriminatory lending practices).
2. Inside this folder, there is a notebook `debug/self/main.ipynb`. Open it and fix the bugs.
3. Create a [binary search tree](./bst-groups) for use in P2.
# Screenshot Requirement
1. A screenshot showing the successfully debugged `main.ipynb`` notebook.
2. A screenshot of your binary search tree implementation.
\ No newline at end of file
# Inheritance and DFS
## Inheritance
Paste and run the following code in a new python notebook called debug.ipynb in your p3 folder:
```python
class Parent:
def twice(self):
self.message()
self.message()
def message(self):
print("parent says hi")
class Child:
def message(self):
print("child says hi")
c = Child()
```
Modify `Child` so that it inherits from `Parent`.
What do you think will be printed if you call `c.twice()`? Discuss
with your group, then run it to find out.
When `self.some_method(...)` is called, and there are multiple methods
named `some_method` in your program, the type of `self` (the original object that is calling the method from the class) is what
matters for determining which one runs. It doesn't matter where the
`self.some_method(...)` is (could be any method).
## GraphSearcher
Copy and paste the following starter code (which you'll build on in the project):
```python
class GraphSearcher:
def __init__(self):
self.visited = set()
self.order = []
def visit_and_get_children(self, node):
""" Record the node value in self.order, and return its children
param: node
return: children of the given node
"""
raise Exception("must be overridden in sub classes -- don't change me here!")
def dfs_search(self, node):
# 1. clear out visited set and order list
# 2. start recursive search by calling dfs_visit
def dfs_visit(self, node):
# 1. if this node has already been visited, just `return` (no value necessary)
# 2. mark node as visited by adding it to the set
# 3. call self.visit_and_get_children(node) to get the children
# 4. in a loop, call dfs_visit on each of the children
```
The graphs we search on come in many shapes and formats
(e.g. matrices, files or web), but it would be nice if we could use
the same depth-first search (DFS) code when we want to search
different kinds of graphs. Therefore, we would like to implement a
base class `GraphSearcher` and implement the DFS algorithm in it.
For our purposes, we aren't using DFS to find a specific path. We
just want to see what nodes are reachable from a given starting
`node`, so these methods don't need to return any value. Your job is
to replace the comments in `dfs_search` and `dfs_visit` with code
(some comments may require a couple lines of code).
The `dfs_visit` method will call `visit_and_get_children` to record
the node value and determine the children of a given node. Subclasses
of `GraphSearcher` can override `visit_and_get_children` to lookup the
children of a node in different kinds of graphs (e.g. matrices, files
or web).
Try your code:
```python
g = GraphSearcher()
g.dfs_search("A")
```
You should get an exception. The purpose of `GraphSearcher` is not to
directly create objects, it is to let other clases inherit
`dfs_search` (we'll do the inheritance soon).
## Matrix Format
Paste and run the following:
```python
import pandas as pd
df = pd.DataFrame([
[0,1,0,1],
[0,0,1,0],
[0,0,0,1],
[0,0,1,0],
], index=["A", "B", "C", "D"], columns=["A", "B", "C", "D"])
df
```
A grid of ones and zeros like this is a common way to represent
directed graphs. A `1` in the "C" column of the "B" row means that
there is an edge from node B to node C.
Try drawing a directed graph on a piece of paper based on the above
grid.
`df.loc["????"]` looks up a row in a DataFrame. Use it to lookup the
children of node B.
Complete the following to print all the children of "B" (should only be "C"):
```python
for node, has_edge in df.loc["B"].items():
if ????:
print(????)
```
Let's create a class that inherits from `GraphSearcher` and works with
graphs represented as matrices:
```python
class MatrixSearcher(????):
def __init__(self, df):
super().????() # call constructor method of parent class
self.df = df
def visit_and_get_children(self, node):
# TODO: Record the node value in self.order
children = []
# TODO: use `self.df` to determine what children the node has and append them
return children
```
Complete the `????` and `TODO` parts. Test it, checking what nodes
are reachable from each starting point:
```python
m = MatrixSearcher(df)
m.dfs_search(????)
m.order
```
From "A", for example, `m.order` should be `['A','B', 'C', 'D']`. Look
back at the picture you drew of the graph and make sure you're getting
what you expect when starting from other nodes.
## scrape.py
If you've been doing this work in a notebook, you should now move your
code to a new module called `scrape.py` in your `p3` directory.
Labs/Lab5/dfs-vs-bfs/1.png

35.8 KiB

Labs/Lab5/dfs-vs-bfs/2.png

57.9 KiB

Labs/Lab5/dfs-vs-bfs/3.png

39.8 KiB

Labs/Lab5/dfs-vs-bfs/4.png

71.5 KiB

# DFS vs. BFS
In this lab, you'll get practice with depth-first search and
breadth-first search with some interactive exercises.
Start a new notebook on your virtual machine, then paste+run this code
in a cell (you don't need to read it):
```python
from IPython.display import display, HTML
from graphviz import Digraph
class test_graph:
def __init__(self):
self.nodes = {}
self.traverse_order = None # in what order were nodes checked?
self.next_guess = 0
self.colors = {}
def node(self, name):
name = str(name).upper()
self.nodes[name] = Node(self, name)
def edge(self, src, dst):
src, dst = str(src).upper(), str(dst).upper()
for name in [src, dst]:
if not name in self.nodes:
self.node(name)
self.nodes[src].children.append(self.nodes[dst])
def _repr_svg_(self):
g = Digraph(engine='neato')
for n in self.nodes:
g.node(n, fillcolor=self.colors.get(n, "white"), style="filled")
children = self.nodes[n].children
for i, child in enumerate(children):
g.edge(n, child.name, penwidth=str(len(children) - i), len="1.5")
return g._repr_image_svg_xml()
def dfs(self, src, dst):
src, dst = str(src).upper(), str(dst).upper()
self.traverse_order = []
self.next_guess = 0
self.colors = {}
self.visited = set()
self.path = self.nodes[src].dfs(dst)
display(HTML("now call .visit(???) to identify the first node explored"))
display(self)
def bfs(self, src, dst):
src, dst = str(src).upper(), str(dst).upper()
self.traverse_order = []
self.next_guess = 0
self.colors = {}
self.path = self.nodes[src].bfs(dst)
display(HTML("now call .visit(???) to identify the first node explored"))
display(self)
def visit(self, name):
name = str(name).upper()
if self.traverse_order == None:
print("please call dfs or bfs first")
if self.next_guess == len(self.traverse_order):
print("no more nodes to explore")
return
self.colors = {}
for n in self.traverse_order[:self.next_guess]:
self.colors[n] = "yellow"
if name == self.traverse_order[self.next_guess]:
display(HTML("Correct..."))
self.colors[name] = "yellow"
self.next_guess += 1
else:
display(HTML("<b>Oops!</b> Please guess again."))
self.colors[name] = "red"
display(self)
if self.next_guess == len(self.traverse_order):
if self.path == None:
display(HTML("You're done, there is no path!"))
else:
seq = input("What path was found? [enter nodes, comma separated]: ")
seq = tuple(map(str.strip, seq.upper().split(",")))
if seq == tuple(map(str.upper, self.path)):
print("Awesome!!!")
else:
print("actually, expected was: ", ",".join(self.path))
class Node:
def __init__(self, graph, name):
self.graph = graph
self.name = name
self.children = []
def __repr__(self):
return "node %s" % self.name
def dfs(self, dst):
if self.name in self.graph.visited:
return None
self.graph.traverse_order.append(self.name)
self.graph.visited.add(self.name)
if self.name == dst:
return (self.name, )
for child in self.children:
childpath = child.dfs(dst)
if childpath:
return (self.name, ) + childpath
return None
def backtrace(self):
nodes = []
node = self
while node != None:
nodes.append(node.name)
node = node.back
return tuple(reversed(nodes))
def bfs(self, dst):
added = set()
todo = [self]
self.back = None
added.add(self.name)
while len(todo) > 0:
curr = todo.pop(0)
self.graph.traverse_order.append(curr.name)
if curr.name == dst:
return curr.backtrace()
else:
for child in curr.children:
if not child.name in added:
todo.append(child)
child.back = curr
added.add(child.name)
return None
```
## Problem 1 [4-node, DFS]
Paste the following to a cell:
```python
g = test_graph()
g.edge(1, 2)
g.edge(4, 3)
g.edge(1, 3)
g.edge(2, 4)
g
```
It should look something like this:
<img src="1.png" width=300>
Node 1 has two children: nodes 2 and 3. The thicker line to node 2
indicates node 2 is in the `children` list before node 3.
Let's do a DFS from node 1 to 3. Paste the following:
```python
g.dfs(1, 3)
```
You should see something like this:
<img src="2.png" width=500>
Try calling the visit function with
```python
g.visit(1)
```
The visited node should look like this:
<img src="3.png" width=300>
Keep making `g.visit(????)` calls until you complete the depth first search.
Once the target node is reach, you'll be prompted to enter the path
from source to destination. Do so and type enter to check your
answer:
<img src="4.png" width=600>
## Problem 2 [4-node, BFS]
Paste+run the following(same graph structure as last time, but you'll
visit the nodes in a different order by doing a BFS):
```python
g = test_graph()
g.edge(1, 2)
g.edge(4, 3)
g.edge(1, 3)
g.edge(2, 4)
g.bfs(1, 3)
```
## Problem 3 [7-node, DFS+BFS]
Paste+run the following:
```python
g = test_graph()
for i in range(5):
g.edge(i, i+1)
g.edge(i, 6)
g.edge(6, i)
g.dfs(0, 4)
```
Then change `dfs` to `bfs` and try again.
## Problem 4 [6-node, BFS]
```python
g = test_graph()
for i in range(0, 4, 2):
g.edge(i, i+2)
g.edge(i+1, i+3)
g.edge(i, i+1)
g.edge(4, 5)
g.bfs(2, 1)
```
# Lab 5: Graph Search
1. Practice [graph search order](./dfs-vs-bfs)
2. Start the [module](./dfs-class) you'll be building for P3
# Screenshot Requirement
Submit a screenshot of the `dfs_search` and `dfs_visit` results.
\ No newline at end of file
# Lab 6: Selenium
1. Install Selenium and Chromium by following directions in Part 3 of P3.
2. Continue working on P3.
# Screenshot Requirement
A sreenshot that shows Selenium is installed successfully.
# Lab 7: Flask
1. Install Flask by following directions in group part of P4.
2. Start working on the group part of P4 and build the web pages "index.html", "browse.html", and "donate.html".
# Screenshot Requirement
A screenshot that shows Flask is installed successfully.
\ No newline at end of file
# Lab 8: Geopandas
1. Install geopandas and geopy, go to [City of Madison Open Data](https://data-cityofmadison.opendata.arcgis.com/) and find an interesting dataset to plot using geopandas.
2. Continue working on Part 1 and Part 2 of P4 (Part 3 will be for the next lab after regex is covered during the lectures).
# Screenshot Requirement
A Screenshot of your plot of step 1.
\ No newline at end of file
# Lab 9: EDGAR Utilities Module
## Overview
In the US, public companies need to regularly file various statements
and reports to the SEC's (Securities and Exchange Commission) EDGAR
database. EDGAR data is publicly available online; furthermore, web
requests to EDGAR from around the world are logged and published. For
P5, you'll analyze both SEC filing HTML pages and a log of web
requests from around the world for those pages.
In this lab, you'll create an `edgar_utils.py` to help with analyzing
the pages and logs. It will contain two things: a `lookup_region`
function and a `Filing` class.
## Practice for `lookup_region`
For "practice" components of this lab, you'll do exercises in a
notebook to get the code and logic correct. You'll then use what you
learn to write your `edgar_utils.py` module.
### Exercise 1: replace letters
For the project dataset, you'll be working with some IP addresses
where some of the digits have been replaced with letters for
anonymization.
For some calculations, we need only digits, so we'll replace any
letters with "0". Complete the following regex code to get back
"101.1.1.000":
```python
import re
ipaddr = "101.1.1.abc"
re.sub(????, ????, ipaddr)
```
### Exercise 2: integer IPs
Note if you haven't installed netaddr yet from p5, please install it from the command line via:
``` pip3 install netaddr```
IP addresses are commonly represented as four-part numbers, like
"34.67.75.25". To convert an address like this to an integer, you can
use the following::
```python
import netaddr
int(netaddr.IPAddress("34.67.75.25"))
```
Modify the above to lookup the integer representation of your virtual
machine's IP address.
### Exercise 3: binary search
Consider the following sorted list:
```python
L = [1, 2, 100, 222, 900]
```
Running `150 in L` would loop over every element in the list. This is
slow, and doesn't take advantage of the fact that the list is sorted.
A better strategy when we know it is sorted would be to check the
middle (100) and infer 150 must be in the second half of the list, if
it's in the list at all; no need to check the first half.
A famous algorithm that uses this strategy is called *binary search*, and it's implemented by this function: https://docs.python.org/dev/library/bisect.html#bisect.bisect
Try it:
```python
from bisect import bisect
idx = bisect(L, 150)
idx
```
You should get `3` -- this means that if you wanted to add 150 to the
list and keep it in sorted order, you would insert 150 at index 3
(after 1, 2, and 100). This also means `L[idx-1]` is the biggest
number in the list that is **less than or equal** to 150.
*What will bisect of `L` be for 225?* Write down your prediction, then
run code to check your answer.
### Exercise 4: country/region lookup
You can generally guess what country or region a computer is in based
on its IP address. Read in `ip2location.csv` from the project to see
the IP ranges assigned to each region (this is borrowed from
https://lite.ip2location.com/database/ip-country).
```python
ips = pd.read_csv("ip2location.csv")
ips
```
Can you use (a) `bisect` on the `low` column of `ips` and (b) the
integer representation of your VM's IP address to (c) find an `idx`
for the row in `ips` corresponding to your VM?
Look at `ips.iloc[idx]` to make sure you found the correct row.
## `lookup_region` function
Write an efficient `lookup_region` function in your `edgar_utils.py`
module that takes an IP address (in string form) and returns the
country or region the corresponding computer is in. You can import it
and test it in Jupyter notebooks or Python interactive mode.
Example usage:
```python
>>> lookup_region("1.1.1.x")
'United States of America'
>>> lookup_region("101.1.1.abc")
'China'
```
Requirements:
* it needs to worked with anonymized IPs
* don't read the CSV file each time `lookup_region` is called
* don't loop over every row each time `lookup_region` is called -- your code needs to be faster than O(N)
## Practice for `Filing` class
Copy/paste the following string to a notebook:
```python
html = """<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Last-Modified" content="Fri, 12 Feb 2016 00:05:37 GMT" />
<title>EDGAR Filing Documents for 0001050470-16-000051</title>
<link rel="stylesheet" type="text/css" href="/include/interactive.css" />
</head>
<body style="margin: 0">
<!-- SEC Web Analytics - For information please visit: https://www.sec.gov/privacy.htm#collectedinfo -->
<noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-TD3BKV"
height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-TD3BKV');</script>
<!-- End SEC Web Analytics -->
<noscript><div style="color:red; font-weight:bold; text-align:center;">This page uses Javascript. Your browser either doesn't support Javascript or you have it turned off. To see this page as it is meant to appear please use a Javascript enabled browser.</div></noscript>
<!-- BEGIN BANNER -->
<div id="headerTop">
<div id="Nav"><a href="/index.htm">Home</a> | <a href="/cgi-bin/browse-edgar?action=getcurrent">Latest Filings</a> | <a href="javascript:history.back()">Previous Page</a></div>
<div id="seal"><a href="/index.htm"><img src="/images/sealTop.gif" alt="SEC Seal" border="0" /></a></div>
<div id="secWordGraphic"><img src="/images/bannerTitle.gif" alt="SEC Banner" /></div>
</div>
<div id="headerBottom">
<div id="searchHome"><a href="/edgar/searchedgar/webusers.htm">Search the Next-Generation EDGAR System</a></div>
<div id="PageTitle">Filing Detail</div>
</div>
<!-- END BANNER -->
<!-- BEGIN BREADCRUMBS -->
<div id="breadCrumbs">
<ul>
<li><a href="/index.htm">SEC Home</a> &#187;</li>
<li><a href="/edgar/searchedgar/webusers.htm">Search the Next-Generation EDGAR System</a> &#187;</li>
<li><a href="/edgar/searchedgar/companysearch.html">Company Search</a> &#187;</li>
<li class="last">Current Page</li>
</ul>
</div>
<!-- END BREADCRUMBS -->
<div id="contentDiv">
<!-- START FILING DIV -->
<div id="formDiv">
<div id="formHeader">
<div id="formName">
<strong>Form SC 13G</strong> - Statement of acquisition of beneficial ownership by individuals:
</div>
<div id="secNum">
<strong><acronym title="Securities and Exchange Commission">SEC</acronym> Accession <acronym title="Number">No.</acronym></strong> 0001050470-16-000051
</div>
</div>
<div class="formContent">
<div class="formGrouping">
<div class="infoHead">Filing Date</div>
<div class="info">2016-02-12</div>
<div class="infoHead">Accepted</div>
<div class="info">2016-02-11 19:05:37</div>
<div class="infoHead">Documents</div>
<div class="info">1</div>
</div>
<div style="clear:both"></div>
</div>
</div>
<!-- END FILING DIV -->
<!-- START DOCUMENT DIV -->
<div id="formDiv">
<div style="padding: 0px 0px 4px 0px; font-size: 12px; margin: 0px 2px 0px 5px; width: 100%; overflow:hidden">
<p>Document Format Files</p>
<table class="tableFile" summary="Document Format Files">
<tr>
<th scope="col" style="width: 5%;"><acronym title="Sequence Number">Seq</acronym></th>
<th scope="col" style="width: 40%;">Description</th>
<th scope="col" style="width: 20%;">Document</th>
<th scope="col" style="width: 10%;">Type</th>
<th scope="col">Size</th>
</tr>
<tr>
<td scope="row">1</td>
<td scope="row">LSV13G123115MEDALLION.TXT</td>
<td scope="row"><a href="/Archives/edgar/data/1000209/000105047016000051/lsv13g123115medallion.txt">lsv13g123115medallion.txt</a></td>
<td scope="row">SC 13G</td>
<td scope="row">8314</td>
</tr>
<tr class="blueRow">
<td scope="row">&nbsp;</td>
<td scope="row">Complete submission text file</td>
<td scope="row"><a href="/Archives/edgar/data/1000209/000105047016000051/0001050470-16-000051.txt">0001050470-16-000051.txt</a></td>
<td scope="row">&nbsp;</td>
<td scope="row">9803</td>
</tr>
</table>
</div>
</div>
<!-- END DOCUMENT DIV -->
<!-- START FILER DIV -->
<div id="filerDiv">
<div class="mailer">Mailing Address
<span class="mailerAddress">437 MADISON AVENUE</span>
<span class="mailerAddress">38TH FLOOR</span>
<span class="mailerAddress">
NEW YORK NY 10022 </span>
</div>
<div class="mailer">Business Address
<span class="mailerAddress">437 MADISON AVE 38 TH FLOOR</span>
<span class="mailerAddress">
NEW YORK NY 10022 </span>
<span class="mailerAddress">2123282153</span>
</div>
<div class="companyInfo">
<span class="companyName">MEDALLION FINANCIAL CORP (Subject)
<acronym title="Central Index Key">CIK</acronym>: <a href="/cgi-bin/browse-edgar?CIK=0001000209&amp;action=getcompany">0001000209 (see all company filings)</a></span>
<p class="identInfo"><acronym title="Internal Revenue Service Number">IRS No.</acronym>: <strong>043291176</strong> | State of Incorp.: <strong>DE</strong> | Fiscal Year End: <strong>1231</strong><br />Type: <strong>SC 13G</strong> | Act: <strong>34</strong> | File No.: <a href="/cgi-bin/browse-edgar?filenum=005-48473&amp;action=getcompany"><strong>005-48473</strong></a> | Film No.: <strong>161413579</strong><br /><acronym title="Standard Industrial Code">SIC</acronym>: <b><a href="/cgi-bin/browse-edgar?action=getcompany&amp;SIC=6199&amp;owner=include">6199</a></b> Finance Services<br />Office of Finance</p>
</div>
<div class="clear"></div>
</div>
<div id="filerDiv">
<div class="mailer">Mailing Address
<span class="mailerAddress">155 NORTH WACKER DRIVE</span>
<span class="mailerAddress">SUITE 4600</span>
<span class="mailerAddress">
CHICAGO IL 60606 </span>
</div>
<div class="mailer">Business Address
<span class="mailerAddress">155 NORTH WACKER DRIVE</span>
<span class="mailerAddress">SUITE 4600</span>
<span class="mailerAddress">
CHICAGO IL 60606 </span>
<span class="mailerAddress">312-460-2443</span>
</div>
<div class="companyInfo">
<span class="companyName">LSV ASSET MANAGEMENT (Filed by)
<acronym title="Central Index Key">CIK</acronym>: <a href="/cgi-bin/browse-edgar?CIK=0001050470&amp;action=getcompany">0001050470 (see all company filings)</a></span>
<p class="identInfo"><acronym title="Internal Revenue Service Number">IRS No.</acronym>: <strong>232772200</strong> | State of Incorp.: <strong>DE</strong> | Fiscal Year End: <strong>1231</strong><br />Type: <strong>SC 13G</strong></p>
</div>
<div class="clear"></div>
</div>
<!-- END FILER DIV -->
</div>"""
```
### Exercise 1: dates
Write a regular expression finding all the dates with the `YYYY-MM-DD`
format in the document. It's OK if you have extra groups within your
matches.
```python
import re
re.findall(r"????", html)
```
You might have some extra matches like `0470-16-00` that are clearly not dates. Add some additional filtering so that you only count 4-digit numbers as years if they start as 19XX or 20XX. Similarly, add additional filtering for example like `2044-16-00`, so that you only only count 2-digit numbers as months if they are within 01 and 12. You could do this in the regular expression itself, or in some additional Python code that loops over the results of the regular expression.
### Exercise 2: Standard Industrial Classification (SIC) codes
Take a look at the industry codes defined here: https://www.sec.gov/corpfin/division-of-corporation-finance-standard-industrial-classification-sic-code-list
Write a regular expression to find numbers following the text
"SIC=" in `html`.
### Exercise 3: Addresses
To find addresses, we'll look at the HTML. We want to find the
contents between `<div class="mailer">` and `</div>`. We want the
non-greedy version (meaning that if there are multiple `</div>`
instances, we want to match the first one possible).
Write a regular expression to do this:
```python
for addr_html in re.findall(????, html):
print(addr_html)
```
Note, the text between the opening and closing `div` tags generally
spans multiple lines. Remember that `.` does not match newlines. One
way to match anything is with `[\s\S]` (meaning whitespace or not
whitespace; in other words, everything). If you get very stuck on
this one, you can scroll past the expected output to see one solution
for finding `addr_html` matches.
Expected output:
```
Mailing Address
<span class="mailerAddress">437 MADISON AVENUE</span>
<span class="mailerAddress">38TH FLOOR</span>
<span class="mailerAddress">
NEW YORK NY 10022 </span>
Business Address
<span class="mailerAddress">437 MADISON AVE 38 TH FLOOR</span>
<span class="mailerAddress">
NEW YORK NY 10022 </span>
<span class="mailerAddress">2123282153</span>
Mailing Address
<span class="mailerAddress">155 NORTH WACKER DRIVE</span>
<span class="mailerAddress">SUITE 4600</span>
<span class="mailerAddress">
CHICAGO IL 60606 </span>
Business Address
<span class="mailerAddress">155 NORTH WACKER DRIVE</span>
<span class="mailerAddress">SUITE 4600</span>
<span class="mailerAddress">
CHICAGO IL 60606 </span>
<span class="mailerAddress">312-460-2443</span>
```
Now, extend your above loop so that a further regex search is
conducted for address lines within each `addr_html` match. Address
lines are between `<span class="mailerAddress">` and `</span>`:
```python
for addr_html in re.findall(r'<div class="mailer">([\s\S]+?)</div>', html):
lines = []
for line in re.findall(????, addr_html):
lines.append(line.strip())
print("\n".join(lines))
print()
```
## `Filing` class
Add a `Filing` class to your `edgar_utils.py` like this:
```python
class Filing:
def __init__(self, html):
self.dates = ????
self.sic = ????
self.addresses = ????
def state(self):
return "TODO"
```
`html` will be an HTML string, much like the one you were working with
in the practices. Fill in the missing parts and add additional lines
as needed. Much of the code from the practice exercises will be
useful here.
* `dates` should be a list of dates in the `YYYY-MM-DD` format that appear in the HTML (only count years starting as 19XX or 20XX with reasonable months and dates).
* `sic` should be an `int` indicating the Standard Industrial Classification. It should be `None` if this doesn't appear.
* `addresses` should be a list of addresses found in the HTML. Each address will contain the address lines separated by newlines, but otherwise there shouldn't be unnecessary whitespace.(i.e. `['437 MADISON AVENUE\n38TH FLOOR\nNEW YORK NY 10022','155 NORTH WACKER DRIVE\nSUITE 4600\nCHICAGO IL 60606']` note this is just an example **not** the answer.)
* `state()` should loop over the addresses. If it finds one that contains two capital letters followed by 5 digits (for example, `WI 53706`), it should return what appears to be a state abbreviation (for example `WI`). You don't need to check that the abbreviation is a valid state. If nothing that looks like a state abbreviation appears, return `None`. Note: It must be exactly 2 capital letters and 5 digits, for example, you can't get `OX 12345` in `BOX 12345`.
\ No newline at end of file
# Lab 9: Regex
1. Continue working on [Part 3 of P4](EDGAR.md).
# Screenshot Requirement
A screenshot that shows your progress
\ No newline at end of file