Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • cdis/cs/courses/cs544/s25/main
  • zzhang2478/main
  • spark667/main
  • vijayprabhak/main
  • vijayprabhak/544-main
  • wyang338/cs-544-s-25
  • jmin39/main
7 results
Show changes
Commits on Source (83)
Showing
with 5465 additions and 0 deletions
# Read-only Access
We've opened up the `autobadger` tool in attempt to make things more visible to you and less of a "black box". We've done this by making the repository *read-only*, meaning you should be able to `git clone` the repo but not `git push` to it.
To start, navigate to a directory outside of any class project. I'd recommend cloning to the same directory as your projects.
```bash
git clone https://oauth2:glpat-CSTX_tgpf38eJHyUW213@git.doit.wisc.edu/cdis/cs/courses/cs544/s25/tools/autobadger.git
```
> **NOTE**: if you want to use this method throughout the semester, you'll need to `git pull` to get up-to-date code for each project.
Your folder structure should look something like
```
some-directory/
autobadger/
p1/
p2/
... # other projects
```
# Making Changes
You can change the code inside of `autobadger`. The only files that will be of interest to you are inside the `projects/` directory, i.e. `projects/*.py`). Your changes will be for debugging, i.e. `print()` or `breakpoint()` statements.
#### Using `pip`
For whatever project you're working on, you will need to *apply* any changes you make using `pip`
For example, assuming
- I'm working on `p2`
- in my `p2` directory
- and have my `venv` activated
I would do something like:
```bash
pip3 install ../autobadger/.
```
This would install and replace my local version of `autobadger` . Now when I run
```
autobadger --project=p2
```
I will see my changes in effect.
# Breakpoints
Since `breakpoint()` is less known and straightforward, I will teach about it here.
> **NOTE**: It is not required to use `breakpoint()`. You are also welcome to use `print()` instead. `breakpoint()` has a **steeper learning curve**, but may **help you iterate more quickly and save you time** once the basic concepts are well-understood.
### What is a breakpoint?
`breakpoint()` is a built-in function in Python and starts the **debugger** at the point where it is called. It allows developers to inspect variables, step through code, and debug interactively.
#### Simple Example:
```python
# Inside of /path/to/file.py
def calculate_sum(a, b):
breakpoint() # Debugger starts here
return a + b
calculate_sum(3, 5) # execute function
```
Adding a `breakpoint()` will pause execution, allowing you to inspect `a` and `b` before proceeding. I would see something like:
```
> /path/to/file.py(3)calculate_sum()
-> return a + b
```
in the terminal, which displays
1. the next line to be executed `return a + b`
2. `(3)calculate_sum()` tells me the line number and the function name (if applicable)
3. `/path/to/file.py` tells me the current file
### Navigating the debugger
While the Python debugger is active, you can use several commands to navigate through your program and investigate.
- `Variable name`: I can type any variable that is in scope and get it's value.
- Ex: Typing `a` in the previous example would return the *value* of `a`
- **NOTE**: if a variable name also coincides with a command keyword in the debugger, you may need to use `print(<variable_name>)` instead. `b` is one of those commands, so to print the value of `b` to the terminal, I would need to do `print(b)`:
- `Evaluation`: I can also evaluate statements (i.e. add two numbers)
```
In [3]: calculate_sum(3, 5)
> <ipython-input-2-443b6e8e0b0a>(3)calculate_sum()
-> return a + b
(Pdb) print(a)
3
(Pdb) print(b)
5
(Pdb) print(a + b)
8
```
- `n`: Steps to the next line of my program
- `c`: Continues execution of the program until the next breakpoint, or until the program ends.
- `s`: Steps *into* a function or method call
- `exit`: kills the debugger and ends the program
# An example
### Using breakpoints
Suppose I want to investigate `Q4` for `p2`. I can add `breakpoint()` statements to the Q4 test method for the `ProjectTwoTest` class.
Navigating to `projects/p2.py` inside of `autobadger`, I find:
```python
@graded(Q=4, points=10)
def test_simple_http(self) -> int | TestError:
address = self._test_cache_server("-cache-1")
if isinstance(address, TestError):
return address
r = requests.get(f"{address}/lookup/53706")
r.raise_for_status()
result = r.json()
if "addrs" not in result or "source" not in result:
return TestError(
message=f"Result body should be JSON with 'addrs' and 'source' fields, but got {result}.",
earned=5,
)
return 10
```
> Note: This is Q4 since I have `Q=4` in the decorator.
**I can edit this method by adding *breakpoints*!**
```python
@graded(Q=4, points=10)
def test_simple_http(self) -> int | TestError:
breakpoint()
address = self._test_cache_server("-cache-1")
if isinstance(address, TestError):
return address
r = requests.get(f"{address}/lookup/53706")
breakpoint()
r.raise_for_status()
result = r.json()
if "addrs" not in result or "source" not in result:
return TestError(
message=f"Result body should be JSON with 'addrs' and 'source' fields, but got {result}.",
earned=5,
)
return 10
```
Now, after I update with `pip` as mentioned above, I can run `autobadger --project=p2` and get:
```
> /Users/.../p2.py(103)test_simple_http()
-> address = self._test_cache_server("-cache-1")
```
Note that in this situation, typing `address` would give me an error cause it **not yet defined**:
```
(Pdb) address
*** NameError: name 'address' is not defined
```
###### Using `n` (next line)
`address` defined on the *next line*. So, I use the `n` command to step!
```
(Pdb) n
> /Users/.../p2.py(104)test_simple_http()
-> if isinstance(address, TestError):
(Pdb) address
'http://localhost:64879'
```
###### Using `s` (step into)
I could have also used `s` to *step into* `self._test_cache_server(...)` if I had wanted to investigate further:
```
> /Users/.../p2.py(103)test_simple_http()
-> address = self._test_cache_server("-cache-1")
(Pdb) s
--Call--
> /Users/.../p2.py(118)_test_cache_server()
-> def _test_cache_server(self, server_suffix: str) -> str | TestError:
# Now in a new method — _test_cache_server
(Pdb) n
> /Users/.../p2.py(119)_test_cache_server()
-> cache_server = [c for c in self.containers if c["Name"].endswith(server_suffix)]
```
###### Using `c` (continue)
I can also *continue* till the next breakpoint, which is quite convenient if you don't need to step over every line of code:
```
> /Users/.../p2.py(103)test_simple_http()
-> address = self._test_cache_server("-cache-1")
(Pdb) c
> /Users/.../p2.py(108)test_simple_http()
-> r.raise_for_status()
(Pdb) print(r.json())
{'addrs': [...], 'error': None, 'source': '...'}
```
Using `c` jumped from line `103` to line `108`, where I had my two breakpoints defined.
> **NOTE**: using `c` again would continue the Python program till the end of its execution since I have no other `breakpoint()` statements
\ No newline at end of file
File added
File added
File added
File added
File added
File added
File added
File added
%% Cell type:code id:3b8d1ed9-9568-473e-bfa2-3b08f9e4587f tags:
``` python
import threading
import time
def task():
print("hi from thread ID:", threading.get_native_id())
t = threading.Thread(target=task)
t.start()
print("hi from main thread, ID:", threading.get_native_id())
```
%% Output
hi from thread ID:hi from main thread, ID: 65
125
%% Cell type:code id:651dc205-4fdc-4054-be28-b19e2e798017 tags:
``` python
total = 0
def task(count):
global total
for i in range(count):
total += 1
t = threading.Thread(target=task, args=[1_000_000])
t.start()
t.join() # wait until it exits
print(total)
```
%% Output
1000000
%% Cell type:code id:5eb4a3e1-decf-4352-b61c-b10e53aec763 tags:
``` python
total
```
%% Output
1000
%% Cell type:code id:ceeb7904-ff2c-424c-bf7c-724d1b285e3b tags:
``` python
total = 0
def task(count):
global total
for i in range(count):
total += 1
t1 = threading.Thread(target=task, args=[1_000_000])
t1.start()
t2 = threading.Thread(target=task, args=[1_000_000])
t2.start()
t1.join()
t2.join()
total
```
%% Output
1084635
%% Cell type:code id:48a5226c-6d63-4062-a2be-f7dd890c4dc1 tags:
``` python
import dis
dis.dis("total += 1")
```
%% Output
0 RESUME 0
1 LOAD_NAME 0 (total)
LOAD_CONST 0 (1)
BINARY_OP 13 (+=)
STORE_NAME 0 (total)
RETURN_CONST 1 (None)
%% Cell type:code id:3b8d1ed9-9568-473e-bfa2-3b08f9e4587f tags:
``` python
import threading
import time
```
%% Cell type:code id:da885575-cedf-4bb4-8bbd-44400974d3f8 tags:
``` python
import dis
dis.dis("total += 1")
```
%% Output
0 RESUME 0
1 LOAD_NAME 0 (total)
LOAD_CONST 0 (1)
BINARY_OP 13 (+=)
STORE_NAME 0 (total)
RETURN_CONST 1 (None)
%% Cell type:code id:ceeb7904-ff2c-424c-bf7c-724d1b285e3b tags:
``` python
%%time
# 133 ms with no locks
# 348 ms with locks (fine grained)
# 124 ms with locks (coarse grained)
lock = threading.Lock() # this protects the "total" variable
total = 0
def task(count):
global total
lock.acquire()
for i in range(count):
total += 1
lock.release()
t1 = threading.Thread(target=task, args=[1_000_000])
t1.start()
t2 = threading.Thread(target=task, args=[1_000_000])
t2.start()
t1.join()
t2.join()
total
```
%% Output
CPU times: user 133 ms, sys: 144 μs, total: 133 ms
Wall time: 129 ms
2000000
%% Cell type:code id:042356c8-7d34-4fa5-bef1-0c43e829fc13 tags:
``` python
import threading
bank_accounts = {"x": 25, "y": 100, "z": 200} # in dollars
lock = threading.Lock() # protects bank_accounts
def transfer(src, dst, amount):
with lock: # automatically acquire now, and release after the with statement
success = False
if bank_accounts[src] >= amount:
bank_accounts[src] -= amount
bank_accounts[dst] += amount
success = True
print("transferred" if success else "denied")
print("locked inside with?", lock.locked())
print("locked after with?", lock.locked())
```
%% Cell type:code id:1878620a-553b-4f25-9606-1e2605b318ac tags:
``` python
transfer("x", "y", 20)
```
%% Output
transferred
locked inside with? True
locked after with? False
%% Cell type:code id:8901ec24-7471-40e1-8182-585d3cc760c7 tags:
``` python
bank_accounts
```
%% Output
{'x': 5, 'y': 120, 'z': 200}
%% Cell type:code id:9cee2370-0f3d-4ac0-930a-c59f9f0d7f1b tags:
``` python
transfer("x", "z", 10)
```
%% Output
denied
locked inside with? True
locked after with? False
%% Cell type:code id:3a583c7c-57a3-4f42-8301-8ce601a2249e tags:
``` python
bank_accounts
```
%% Output
{'x': 5, 'y': 120, 'z': 200}
%% Cell type:code id:3346b86d-5f80-4007-8879-b452c4826c50 tags:
``` python
transfer("w", "z", 10)
```
%% Output
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[8], line 1
----> 1 transfer("w", "z", 10)
Cell In[3], line 9, in transfer(src, dst, amount)
7 with lock:
8 success = False
----> 9 if bank_accounts[src] >= amount:
10 bank_accounts[src] -= amount
11 bank_accounts[dst] += amount
KeyError: 'w'
%% Cell type:code id:02262baa-6f4f-455e-b913-accfa894cc18 tags:
``` python
transfer("z", "x", 3)
```
%% Output
transferred
locked inside with? True
locked after with? False
%% Cell type:code id:1b338a2c-1274-41e8-9179-89b797e4ec2f tags:
``` python
bank_accounts
```
%% Output
{'x': 8, 'y': 120, 'z': 197}
%% Cell type:code id:de014d74-4685-4f16-85fd-f571fd5a91db tags:
``` python
import threading
```
%% Cell type:code id:af3da760-5f6e-40c7-94cd-b505abe59012 tags:
``` python
def task():
print("hello from thread ID:", threading.get_native_id())
#task()
t = threading.Thread(target=task)
t.start()
print("hello from main thread, with ID:", threading.get_native_id())
```
%% Output
hello from thread ID:hello from main thread, with ID: 589
602
%% Cell type:code id:3dd19fcb-2c4a-4b26-b72f-3b87bd8e5bb5 tags:
``` python
total = 0
def task(count):
global total
for i in range(count):
total += 1
t = threading.Thread(target=task, args=[1_000_000])
t.start()
t.join() # wait until the thread is done before we continue
total
```
%% Output
1000000
%% Cell type:code id:de6ce736-5bdf-42b9-b98c-afee738b8f94 tags:
``` python
total
```
%% Output
1000000
%% Cell type:code id:b735264e-f27a-4d27-ba5f-4d6a1f53543e tags:
``` python
total = 0
def task(count):
global total
for i in range(count):
total += 1
t1 = threading.Thread(target=task, args=[1_000_000])
t1.start()
t2 = threading.Thread(target=task, args=[1_000_000])
t2.start()
t1.join()
t2.join()
total
```
%% Output
1100428
%% Cell type:code id:c3d4a396-1793-4504-b2ff-fb8b81ae48d5 tags:
``` python
import dis
dis.dis("total += 1")
```
%% Output
0 RESUME 0
1 LOAD_NAME 0 (total)
LOAD_CONST 0 (1)
BINARY_OP 13 (+=)
STORE_NAME 0 (total)
RETURN_CONST 1 (None)
%% Cell type:code id:de014d74-4685-4f16-85fd-f571fd5a91db tags:
``` python
import threading
```
%% Cell type:code id:82b2f08c-1a13-4a0f-9546-e5d624dbee5b tags:
``` python
import dis
dis.dis("total += 1")
```
%% Output
0 RESUME 0
1 LOAD_NAME 0 (total)
LOAD_CONST 0 (1)
BINARY_OP 13 (+=)
STORE_NAME 0 (total)
RETURN_CONST 1 (None)
%% Cell type:code id:ae617ed8-dd51-4154-ad83-308df527d1f1 tags:
``` python
import threading
```
%% Cell type:code id:b735264e-f27a-4d27-ba5f-4d6a1f53543e tags:
``` python
%%time
# 141 ms (no locks)
# 340 ms (fine-grained locking)
# 122 ms (coarse-grained locking)
lock = threading.Lock() # this protects total
total = 0
def task(count):
global total
lock.acquire()
for i in range(count):
total += 1
lock.release()
t1 = threading.Thread(target=task, args=[1_000_000])
t1.start()
t2 = threading.Thread(target=task, args=[1_000_000])
t2.start()
t1.join()
t2.join()
total
```
%% Output
CPU times: user 151 ms, sys: 0 ns, total: 151 ms
Wall time: 148 ms
2000000
%% Cell type:code id:7cf83d46-b7db-4ab6-b0bb-4b35fecba5b2 tags:
``` python
bank_accounts = {"x": 25, "y": 100, "z": 200} # in dollars
lock = threading.Lock() # protects bank_accounts
def transfer(src, dst, amount):
with lock: # automatically acquire now, automatically release after the with
success = False
if bank_accounts[src] >= amount:
bank_accounts[src] -= amount
bank_accounts[dst] += amount
success = True
print("transferred" if success else "denied")
print("is it locked inside the with?", lock.locked())
print("is it locked after the with?", lock.locked())
```
%% Cell type:code id:c9f19a3d-172f-4103-b631-11b42a8dc94c tags:
``` python
transfer("x", "y", 20)
bank_accounts
```
%% Output
transferred
is it locked inside the with? True
is it locked after the with? False
{'x': 5, 'y': 120, 'z': 200}
%% Cell type:code id:bdd94c73-dd01-4b9e-aecc-70faff00685d tags:
``` python
transfer("x", "z", 10)
bank_accounts
```
%% Output
denied
is it locked inside the with? True
is it locked after the with? False
{'x': 5, 'y': 120, 'z': 200}
%% Cell type:code id:41f9d7d2-86f8-433c-bf56-724a7f445753 tags:
``` python
transfer("w", "z", 10) # there is no "w" bank account
bank_accounts
```
%% Output
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[8], line 1
----> 1 transfer("w", "z", 10) # there is no "w" bank account
2 bank_accounts
Cell In[5], line 7, in transfer(src, dst, amount)
5 with lock: # automatically acquire now, automatically release after the with
6 success = False
----> 7 if bank_accounts[src] >= amount:
8 bank_accounts[src] -= amount
9 bank_accounts[dst] += amount
KeyError: 'w'
%% Cell type:code id:f193975b-6f75-4c3c-b789-c494ad1130a6 tags:
``` python
transfer("z", "y", 50)
bank_accounts
```
%% Output
transferred
is it locked inside the with? True
is it locked after the with? False
{'x': 5, 'y': 170, 'z': 150}
%% Cell type:code id:cd9dd92a-3d2c-47c7-85f1-4bec952d1a67 tags:
``` python
```
%% Cell type:code id:2dbe2890-3b90-4978-9648-625dc8d7a949 tags:
``` python
# !wget https://pages.cs.wisc.edu/~harter/cs544/data/hdma-wi-2021.zip
# !unzip hdma-wi-2021.zip
```
%% Cell type:code id:3a4cfa35-6b4e-48bc-acc6-09d426cd6c3e tags:
``` python
import pyarrow as pa
import pyarrow.csv
import pyarrow.parquet
```
%% Cell type:code id:e5acbdd4-a266-477a-b720-a68e16e6c8f6 tags:
``` python
%%time
t = pa.csv.read_csv("hdma-wi-2021.csv")
```
%% Output
CPU times: user 1.18 s, sys: 996 ms, total: 2.18 s
Wall time: 575 ms
%% Cell type:code id:dd5941ac-fc66-4037-81f8-4f32b5b9e4bc tags:
``` python
pa.parquet.write_table(t, "hdma-wi-2021.parquet")
```
%% Cell type:code id:ed698328-e057-41a5-89d9-351166331a6e tags:
``` python
# point 1: Parquet lets us skip slow schema inference
```
%% Cell type:code id:3c3b8e74-d1d6-412e-9dc9-f6f1cc252ce7 tags:
``` python
%%time
t = pa.parquet.read_table("hdma-wi-2021.parquet")
```
%% Output
CPU times: user 396 ms, sys: 102 ms, total: 498 ms
Wall time: 147 ms
%% Cell type:code id:d582cd7a-27ed-4c09-a22e-8663bd24e231 tags:
``` python
# point 2: Parquet is byte encoded
```
%% Cell type:code id:9e2d33ce-3a07-405a-b379-ef86e6e0e076 tags:
``` python
with open("hdma-wi-2021.csv", "rb") as f:
print(f.read(100))
```
%% Output
b'activity_year,lei,derived_msa-md,state_code,county_code,census_tract,conforming_loan_limit,derived_l'
%% Cell type:code id:76debfbf-a7ca-48f6-8ca0-c9a00fbd3dd9 tags:
``` python
with open("hdma-wi-2021.parquet", "rb") as f:
print(f.read(100))
```
%% Output
b'PAR1\x15\x04\x15\x10\x15\x14L\x15\x02\x15\x00\x12\x00\x00\x08\x1c\xe5\x07\x00\x00\x00\x00\x00\x00\x15\x00\x15\x1a\x15\x1e,\x15\x8e\xce6\x15\x10\x15\x06\x15\x06\x1c\x18\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x18\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x16\x00(\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x18\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\r0\x04\x00\x00\x00\x8e\xce6'
%% Cell type:code id:d891d58c-8d8b-4dc9-bb2f-d12a03db465e tags:
``` python
# point 3: Parquet files are column oriented
```
%% Cell type:code id:3159c561-122d-46af-96fb-0093368f6086 tags:
``` python
%%time
t2 = pa.parquet.read_table("hdma-wi-2021.parquet", columns=["lei", "census_tract"])
```
%% Output
CPU times: user 26.3 ms, sys: 2.61 ms, total: 29 ms
Wall time: 20.8 ms
%% Cell type:code id:51ad1ba1-9ea7-4188-b96d-aaa1847f3e95 tags:
``` python
t
```
%% Output
pyarrow.Table
activity_year: int64
lei: string
derived_msa-md: int64
state_code: string
county_code: int64
census_tract: int64
conforming_loan_limit: string
derived_loan_product_type: string
derived_dwelling_category: string
derived_ethnicity: string
derived_race: string
derived_sex: string
action_taken: int64
purchaser_type: int64
preapproval: int64
loan_type: int64
loan_purpose: int64
lien_status: int64
reverse_mortgage: int64
open-end_line_of_credit: int64
business_or_commercial_purpose: int64
loan_amount: double
loan_to_value_ratio: string
interest_rate: string
rate_spread: string
hoepa_status: int64
total_loan_costs: string
total_points_and_fees: string
origination_charges: string
discount_points: string
lender_credits: string
loan_term: string
prepayment_penalty_term: string
intro_rate_period: string
negative_amortization: int64
interest_only_payment: int64
balloon_payment: int64
other_nonamortizing_features: int64
property_value: string
construction_method: int64
occupancy_type: int64
manufactured_home_secured_property_type: int64
manufactured_home_land_property_interest: int64
total_units: string
multifamily_affordable_units: string
income: int64
debt_to_income_ratio: string
applicant_credit_score_type: int64
co-applicant_credit_score_type: int64
applicant_ethnicity-1: int64
applicant_ethnicity-2: int64
applicant_ethnicity-3: int64
applicant_ethnicity-4: int64
applicant_ethnicity-5: int64
co-applicant_ethnicity-1: int64
co-applicant_ethnicity-2: int64
co-applicant_ethnicity-3: int64
co-applicant_ethnicity-4: int64
co-applicant_ethnicity-5: null
applicant_ethnicity_observed: int64
co-applicant_ethnicity_observed: int64
applicant_race-1: int64
applicant_race-2: int64
applicant_race-3: int64
applicant_race-4: int64
applicant_race-5: int64
co-applicant_race-1: int64
co-applicant_race-2: int64
co-applicant_race-3: int64
co-applicant_race-4: int64
co-applicant_race-5: int64
applicant_race_observed: int64
co-applicant_race_observed: int64
applicant_sex: int64
co-applicant_sex: int64
applicant_sex_observed: int64
co-applicant_sex_observed: int64
applicant_age: string
co-applicant_age: string
applicant_age_above_62: string
co-applicant_age_above_62: string
submission_of_application: int64
initially_payable_to_institution: int64
aus-1: int64
aus-2: int64
aus-3: int64
aus-4: int64
aus-5: int64
denial_reason-1: int64
denial_reason-2: int64
denial_reason-3: int64
denial_reason-4: int64
tract_population: int64
tract_minority_population_percent: double
ffiec_msa_md_median_family_income: int64
tract_to_msa_income_percentage: int64
tract_owner_occupied_units: int64
tract_one_to_four_family_homes: int64
tract_median_age_of_housing_units: int64
----
activity_year: [[2021,2021,2021,2021,2021,...,2021,2021,2021,2021,2021],[2021,2021,2021,2021,2021,...,2021,2021,2021,2021,2021],[2021,2021,2021,2021,2021,...,2021,2021,2021,2021,2021],[2021,2021,2021,2021,2021,...,2021,2021,2021,2021,2021]]
lei: [["54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80",...,"254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219"],["254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219",...,"549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46"],["549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46",...,"ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18"],["ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18",...,"54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80"]]
derived_msa-md: [[99999,99999,99999,29404,11540,...,33460,20740,33460,33460,99999],[99999,33460,33460,33460,20740,...,99999,33340,33340,33340,33340],[99999,33340,39540,33340,39540,...,36780,36780,11540,33340,33340],[29100,31540,99999,22540,99999,...,31540,99999,31540,99999,31540]]
state_code: [["WI","WI","WI","WI","WI",...,"WI","WI","WI","WI","WI"],["WI","WI","WI","WI","WI",...,"WI","WI","WI","WI","WI"],["WI","WI","WI","WI","WI",...,"WI","WI","WI","WI","WI"],["WI","WI","WI","WI","WI",...,"WI","WI","WI","WI","WI"]]
county_code: [[55027,55001,55013,55059,55087,...,55109,55017,55093,55109,55033],[55095,55109,55109,55109,55017,...,55027,55079,55133,55133,55079],[55027,55133,55101,55079,55101,...,55139,55139,55087,55131,55079],[55063,55021,55011,55039,55097,...,55025,55029,55025,55051,55021]]
census_tract: [[55027961800,55001950501,55013970400,55059002000,55087013300,...,55109121000,55017011100,55093960700,55109120904,55033970400],[55095960500,55109120700,55109121000,55109120904,55017010700,...,55027961500,55079090300,55133203305,55133203406,55079000303],[55027960800,55133201600,55101000901,55079016100,55101002402,...,55139001100,55139001803,55087012100,55131450104,55079150301],[55063010201,55021970100,55011960400,55039041300,55097960600,...,55025011301,55029100800,55025012300,55051180300,55021970300]]
conforming_loan_limit: [["C","C","C","C","C",...,"C","C","C","C","C"],["C","C","C","C","C",...,"C","C","C","C","C"],["C","C","C","C","C",...,"C","C","C","C","C"],["C","C","C","C","C",...,"C","C","C","C","C"]]
derived_loan_product_type: [["Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:First Lien",...,"Conventional:First Lien","Conventional:First Lien","FSA/RHS:First Lien","Conventional:Subordinate Lien","Conventional:First Lien"],["Conventional:First Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien",...,"Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:First Lien"],["Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien",...,"Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien"],["Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:First Lien",...,"Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:First Lien"]]
derived_dwelling_category: [["Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built",...,"Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built"],["Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built",...,"Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built"],["Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built",...,"Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built"],["Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built",...,"Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built"]]
derived_ethnicity: [["Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Joint",...,"Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino"],["Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Ethnicity Not Available","Not Hispanic or Latino",...,"Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Hispanic or Latino"],["Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Hispanic or Latino","Hispanic or Latino",...,"Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino"],["Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Ethnicity Not Available",...,"Ethnicity Not Available","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino"]]
...
%% Cell type:code id:28a0a2d6-95a1-4c86-9577-ef62721c19d6 tags:
``` python
# point 4: Parquet is compressed, with snappy by default
```
%% Cell type:code id:9ee9812e-76d9-4bf3-9194-4556f83bf0ff tags:
``` python
%%time
pa.parquet.write_table(t, "hdma-wi-2021.parquet", compression="snappy")
```
%% Output
CPU times: user 698 ms, sys: 18.6 ms, total: 717 ms
Wall time: 730 ms
%% Cell type:code id:487c9b6f-559c-4777-b498-b4f41cd8667a tags:
``` python
%%time
pa.parquet.write_table(t, "hdma-wi-2021-gzip.parquet", compression="gzip")
```
%% Output
CPU times: user 2.21 s, sys: 13.9 ms, total: 2.22 s
Wall time: 2.22 s
%% Cell type:code id:f2c3e5a3-7bf2-4f92-8019-e13b8cc28c28 tags:
``` python
!ls -lh
```
%% Output
total 216M
-rw-rw-r-- 1 tharter tharter 333 Feb 20 15:45 Dockerfile
-rw-r----- 1 tharter tharter 167M Nov 1 2022 hdma-wi-2021.csv
-rw-rw-r-- 1 tharter tharter 13M Feb 24 09:04 hdma-wi-2021-gzip.parquet
-rw-rw-r-- 1 tharter tharter 16M Feb 24 09:03 hdma-wi-2021.parquet
-rw-rw-r-- 1 tharter tharter 21M Jan 5 2023 hdma-wi-2021.zip
-rw-rw-r-- 1 tharter tharter 17K Feb 24 09:03 lec1.ipynb
drwxrwxr-x 3 tharter tharter 4.0K Feb 21 09:26 old
-rw-rw-r-- 1 tharter tharter 1.8K Feb 20 15:45 requirements.txt
drwxrwxr-x 3 tharter tharter 4.0K Feb 21 11:39 shared
%% Cell type:code id:92459058-9a4b-4da7-8820-65a6ebb70801 tags:
``` python
```
%% Cell type:code id:803ee3d0-11db-4b3c-901f-c09969c42a99 tags:
``` python
# !wget https://pages.cs.wisc.edu/~harter/cs544/data/hdma-wi-2021.zip
# !unzip hdma-wi-2021.zip
```
%% Cell type:code id:dc2864a3-8f39-440e-9cc1-a88e6fcf764b tags:
``` python
import pyarrow as pa
import pyarrow.csv
import pyarrow.parquet
```
%% Cell type:code id:ddb42c11-e332-42cd-be7f-4ce2ce74ed6b tags:
``` python
%%time
t = pa.csv.read_csv("hdma-wi-2021.csv")
```
%% Output
CPU times: user 1.09 s, sys: 810 ms, total: 1.9 s
Wall time: 506 ms
%% Cell type:code id:a582a756-1c75-4869-8f27-b849320213e1 tags:
``` python
pa.parquet.write_table(t, "hdma-wi-2021.parquet")
```
%% Cell type:code id:80d2c0a6-9c6e-475d-a56b-99a4e7fbb05d tags:
``` python
# point 1: we don't need to do slow schema inference with parquet
```
%% Cell type:code id:c2f34f52-60c6-48c5-b29f-4f6131fb593e tags:
``` python
%%time
t = pa.parquet.read_table("hdma-wi-2021.parquet")
```
%% Output
CPU times: user 400 ms, sys: 118 ms, total: 517 ms
Wall time: 157 ms
%% Cell type:code id:99005a4b-4eb8-4970-a1cd-65422bb6dad6 tags:
``` python
# point 2: parquet uses a binary encoding
```
%% Cell type:code id:abc45de9-6b4d-4b4e-a2ff-127a69a9e3c0 tags:
``` python
with open("hdma-wi-2021.csv", "rb") as f:
print(f.read(100))
```
%% Output
b'activity_year,lei,derived_msa-md,state_code,county_code,census_tract,conforming_loan_limit,derived_l'
%% Cell type:code id:e642e241-078c-45ba-8cf0-6a7587720e17 tags:
``` python
with open("hdma-wi-2021.parquet", "rb") as f:
print(f.read(100))
```
%% Output
b'PAR1\x15\x04\x15\x10\x15\x14L\x15\x02\x15\x00\x12\x00\x00\x08\x1c\xe5\x07\x00\x00\x00\x00\x00\x00\x15\x00\x15\x1a\x15\x1e,\x15\x8e\xce6\x15\x10\x15\x06\x15\x06\x1c\x18\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x18\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x16\x00(\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x18\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\r0\x04\x00\x00\x00\x8e\xce6'
%% Cell type:code id:912485a1-23c5-49cd-bd75-16bd860f8d3a tags:
``` python
# point 3: parquet is column oriented
```
%% Cell type:code id:4095de52-6859-440c-b614-cdeb9c28ebe6 tags:
``` python
%%time
t2 = pa.parquet.read_table("hdma-wi-2021.parquet", columns=["lei", "census_tract"])
```
%% Output
CPU times: user 26.8 ms, sys: 4.41 ms, total: 31.2 ms
Wall time: 22.3 ms
%% Cell type:code id:d277e063-d31c-495d-8fe1-92883447c451 tags:
``` python
t
```
%% Output
pyarrow.Table
activity_year: int64
lei: string
derived_msa-md: int64
state_code: string
county_code: int64
census_tract: int64
conforming_loan_limit: string
derived_loan_product_type: string
derived_dwelling_category: string
derived_ethnicity: string
derived_race: string
derived_sex: string
action_taken: int64
purchaser_type: int64
preapproval: int64
loan_type: int64
loan_purpose: int64
lien_status: int64
reverse_mortgage: int64
open-end_line_of_credit: int64
business_or_commercial_purpose: int64
loan_amount: double
loan_to_value_ratio: string
interest_rate: string
rate_spread: string
hoepa_status: int64
total_loan_costs: string
total_points_and_fees: string
origination_charges: string
discount_points: string
lender_credits: string
loan_term: string
prepayment_penalty_term: string
intro_rate_period: string
negative_amortization: int64
interest_only_payment: int64
balloon_payment: int64
other_nonamortizing_features: int64
property_value: string
construction_method: int64
occupancy_type: int64
manufactured_home_secured_property_type: int64
manufactured_home_land_property_interest: int64
total_units: string
multifamily_affordable_units: string
income: int64
debt_to_income_ratio: string
applicant_credit_score_type: int64
co-applicant_credit_score_type: int64
applicant_ethnicity-1: int64
applicant_ethnicity-2: int64
applicant_ethnicity-3: int64
applicant_ethnicity-4: int64
applicant_ethnicity-5: int64
co-applicant_ethnicity-1: int64
co-applicant_ethnicity-2: int64
co-applicant_ethnicity-3: int64
co-applicant_ethnicity-4: int64
co-applicant_ethnicity-5: null
applicant_ethnicity_observed: int64
co-applicant_ethnicity_observed: int64
applicant_race-1: int64
applicant_race-2: int64
applicant_race-3: int64
applicant_race-4: int64
applicant_race-5: int64
co-applicant_race-1: int64
co-applicant_race-2: int64
co-applicant_race-3: int64
co-applicant_race-4: int64
co-applicant_race-5: int64
applicant_race_observed: int64
co-applicant_race_observed: int64
applicant_sex: int64
co-applicant_sex: int64
applicant_sex_observed: int64
co-applicant_sex_observed: int64
applicant_age: string
co-applicant_age: string
applicant_age_above_62: string
co-applicant_age_above_62: string
submission_of_application: int64
initially_payable_to_institution: int64
aus-1: int64
aus-2: int64
aus-3: int64
aus-4: int64
aus-5: int64
denial_reason-1: int64
denial_reason-2: int64
denial_reason-3: int64
denial_reason-4: int64
tract_population: int64
tract_minority_population_percent: double
ffiec_msa_md_median_family_income: int64
tract_to_msa_income_percentage: int64
tract_owner_occupied_units: int64
tract_one_to_four_family_homes: int64
tract_median_age_of_housing_units: int64
----
activity_year: [[2021,2021,2021,2021,2021,...,2021,2021,2021,2021,2021],[2021,2021,2021,2021,2021,...,2021,2021,2021,2021,2021],[2021,2021,2021,2021,2021,...,2021,2021,2021,2021,2021],[2021,2021,2021,2021,2021,...,2021,2021,2021,2021,2021]]
lei: [["54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80",...,"254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219"],["254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219",...,"549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46"],["549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46",...,"ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18"],["ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18",...,"54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80"]]
derived_msa-md: [[99999,99999,99999,29404,11540,...,33460,20740,33460,33460,99999],[99999,33460,33460,33460,20740,...,99999,33340,33340,33340,33340],[99999,33340,39540,33340,39540,...,36780,36780,11540,33340,33340],[29100,31540,99999,22540,99999,...,31540,99999,31540,99999,31540]]
state_code: [["WI","WI","WI","WI","WI",...,"WI","WI","WI","WI","WI"],["WI","WI","WI","WI","WI",...,"WI","WI","WI","WI","WI"],["WI","WI","WI","WI","WI",...,"WI","WI","WI","WI","WI"],["WI","WI","WI","WI","WI",...,"WI","WI","WI","WI","WI"]]
county_code: [[55027,55001,55013,55059,55087,...,55109,55017,55093,55109,55033],[55095,55109,55109,55109,55017,...,55027,55079,55133,55133,55079],[55027,55133,55101,55079,55101,...,55139,55139,55087,55131,55079],[55063,55021,55011,55039,55097,...,55025,55029,55025,55051,55021]]
census_tract: [[55027961800,55001950501,55013970400,55059002000,55087013300,...,55109121000,55017011100,55093960700,55109120904,55033970400],[55095960500,55109120700,55109121000,55109120904,55017010700,...,55027961500,55079090300,55133203305,55133203406,55079000303],[55027960800,55133201600,55101000901,55079016100,55101002402,...,55139001100,55139001803,55087012100,55131450104,55079150301],[55063010201,55021970100,55011960400,55039041300,55097960600,...,55025011301,55029100800,55025012300,55051180300,55021970300]]
conforming_loan_limit: [["C","C","C","C","C",...,"C","C","C","C","C"],["C","C","C","C","C",...,"C","C","C","C","C"],["C","C","C","C","C",...,"C","C","C","C","C"],["C","C","C","C","C",...,"C","C","C","C","C"]]
derived_loan_product_type: [["Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:First Lien",...,"Conventional:First Lien","Conventional:First Lien","FSA/RHS:First Lien","Conventional:Subordinate Lien","Conventional:First Lien"],["Conventional:First Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien",...,"Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:First Lien"],["Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien",...,"Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien"],["Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:First Lien",...,"Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:First Lien"]]
derived_dwelling_category: [["Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built",...,"Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built"],["Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built",...,"Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built"],["Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built",...,"Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built"],["Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built",...,"Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built"]]
derived_ethnicity: [["Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Joint",...,"Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino"],["Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Ethnicity Not Available","Not Hispanic or Latino",...,"Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Hispanic or Latino"],["Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Hispanic or Latino","Hispanic or Latino",...,"Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino"],["Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Ethnicity Not Available",...,"Ethnicity Not Available","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino"]]
...
%% Cell type:code id:41baa49f-51dd-4c6f-9c32-33ba44f4de83 tags:
``` python
# point 4: Parquet files are compressed with snappy by default
```
%% Cell type:code id:5a043407-2a33-4c0c-8adf-bef72156d3ca tags:
``` python
!ls -lh
```
%% Output
total 204M
-rw-r----- 1 tharter tharter 167M Nov 1 2022 hdma-wi-2021.csv
-rw-rw-r-- 1 tharter tharter 16M Feb 24 11:08 hdma-wi-2021.parquet
-rw-rw-r-- 1 tharter tharter 21M Jan 5 2023 hdma-wi-2021.zip
-rw-rw-r-- 1 tharter tharter 18K Feb 24 10:11 lec1.ipynb
-rw-rw-r-- 1 tharter tharter 16K Feb 24 11:14 lec2.ipynb
%% Cell type:code id:2ff0caa9-6f57-45b0-8ff4-910eb0ad2359 tags:
``` python
%%time
pa.parquet.write_table(t, "hdma-wi-2021.parquet", compression="snappy")
```
%% Output
CPU times: user 716 ms, sys: 24.2 ms, total: 740 ms
Wall time: 754 ms
%% Cell type:code id:ea976df1-d258-4446-a277-9a9d750bfc49 tags:
``` python
%%time
pa.parquet.write_table(t, "hdma-wi-2021-gzip.parquet", compression="gzip")
```
%% Output
CPU times: user 2.15 s, sys: 15.7 ms, total: 2.17 s
Wall time: 2.17 s
%% Cell type:code id:c587e2e5-2167-4663-ab5c-5df51f3f9937 tags:
``` python
!ls -lh
```
%% Output
total 216M
-rw-r----- 1 tharter tharter 167M Nov 1 2022 hdma-wi-2021.csv
-rw-rw-r-- 1 tharter tharter 13M Feb 24 11:15 hdma-wi-2021-gzip.parquet
-rw-rw-r-- 1 tharter tharter 16M Feb 24 11:15 hdma-wi-2021.parquet
-rw-rw-r-- 1 tharter tharter 21M Jan 5 2023 hdma-wi-2021.zip
-rw-rw-r-- 1 tharter tharter 18K Feb 24 10:11 lec1.ipynb
-rw-rw-r-- 1 tharter tharter 16K Feb 24 11:14 lec2.ipynb
%% Cell type:code id:e525a734-ac33-40b4-a8c1-dc4c30b06b02 tags:
``` python
```
This diff is collapsed.
This diff is collapsed.
services:
hdfs:
image: p4-hdfs
hostname: main
ports:
- "127.0.0.1:9870:9870"
deploy:
resources:
limits:
memory: 2g
command: sleep infinity
nb:
image: p4-nb
ports:
- "127.0.0.1:5000:5000"
volumes:
- "./nb:/nb"
deploy:
resources:
limits:
memory: 2g
FROM ubuntu:24.04
RUN apt-get update; apt-get install -y wget curl openjdk-11-jdk python3-pip iproute2 nano
# HDFS
RUN wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz; tar -xf hadoop-3.3.6.tar.gz; rm hadoop-3.3.6.tar.gz
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ENV PATH="${PATH}:/hadoop-3.3.6/bin"
ENV HADOOP_HOME=/hadoop-3.3.6
This diff is collapsed.