Skip to content
Snippets Groups Projects
Commit 45f13fa5 authored by TYLER CARAZA-HARTER's avatar TYLER CARAZA-HARTER
Browse files

lec 14 example

parent 6f905184
No related branches found
No related tags found
No related merge requests found
%% Cell type:code id:2dbe2890-3b90-4978-9648-625dc8d7a949 tags:
``` python
# !wget https://pages.cs.wisc.edu/~harter/cs544/data/hdma-wi-2021.zip
# !unzip hdma-wi-2021.zip
```
%% Cell type:code id:3a4cfa35-6b4e-48bc-acc6-09d426cd6c3e tags:
``` python
import pyarrow as pa
import pyarrow.csv
import pyarrow.parquet
```
%% Cell type:code id:e5acbdd4-a266-477a-b720-a68e16e6c8f6 tags:
``` python
%%time
t = pa.csv.read_csv("hdma-wi-2021.csv")
```
%% Output
CPU times: user 1.18 s, sys: 996 ms, total: 2.18 s
Wall time: 575 ms
%% Cell type:code id:dd5941ac-fc66-4037-81f8-4f32b5b9e4bc tags:
``` python
pa.parquet.write_table(t, "hdma-wi-2021.parquet")
```
%% Cell type:code id:ed698328-e057-41a5-89d9-351166331a6e tags:
``` python
# point 1: Parquet lets us skip slow schema inference
```
%% Cell type:code id:3c3b8e74-d1d6-412e-9dc9-f6f1cc252ce7 tags:
``` python
%%time
t = pa.parquet.read_table("hdma-wi-2021.parquet")
```
%% Output
CPU times: user 396 ms, sys: 102 ms, total: 498 ms
Wall time: 147 ms
%% Cell type:code id:d582cd7a-27ed-4c09-a22e-8663bd24e231 tags:
``` python
# point 2: Parquet is byte encoded
```
%% Cell type:code id:9e2d33ce-3a07-405a-b379-ef86e6e0e076 tags:
``` python
with open("hdma-wi-2021.csv", "rb") as f:
print(f.read(100))
```
%% Output
b'activity_year,lei,derived_msa-md,state_code,county_code,census_tract,conforming_loan_limit,derived_l'
%% Cell type:code id:76debfbf-a7ca-48f6-8ca0-c9a00fbd3dd9 tags:
``` python
with open("hdma-wi-2021.parquet", "rb") as f:
print(f.read(100))
```
%% Output
b'PAR1\x15\x04\x15\x10\x15\x14L\x15\x02\x15\x00\x12\x00\x00\x08\x1c\xe5\x07\x00\x00\x00\x00\x00\x00\x15\x00\x15\x1a\x15\x1e,\x15\x8e\xce6\x15\x10\x15\x06\x15\x06\x1c\x18\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x18\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x16\x00(\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x18\x08\xe5\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\r0\x04\x00\x00\x00\x8e\xce6'
%% Cell type:code id:d891d58c-8d8b-4dc9-bb2f-d12a03db465e tags:
``` python
# point 3: Parquet files are column oriented
```
%% Cell type:code id:3159c561-122d-46af-96fb-0093368f6086 tags:
``` python
%%time
t2 = pa.parquet.read_table("hdma-wi-2021.parquet", columns=["lei", "census_tract"])
```
%% Output
CPU times: user 26.3 ms, sys: 2.61 ms, total: 29 ms
Wall time: 20.8 ms
%% Cell type:code id:51ad1ba1-9ea7-4188-b96d-aaa1847f3e95 tags:
``` python
t
```
%% Output
pyarrow.Table
activity_year: int64
lei: string
derived_msa-md: int64
state_code: string
county_code: int64
census_tract: int64
conforming_loan_limit: string
derived_loan_product_type: string
derived_dwelling_category: string
derived_ethnicity: string
derived_race: string
derived_sex: string
action_taken: int64
purchaser_type: int64
preapproval: int64
loan_type: int64
loan_purpose: int64
lien_status: int64
reverse_mortgage: int64
open-end_line_of_credit: int64
business_or_commercial_purpose: int64
loan_amount: double
loan_to_value_ratio: string
interest_rate: string
rate_spread: string
hoepa_status: int64
total_loan_costs: string
total_points_and_fees: string
origination_charges: string
discount_points: string
lender_credits: string
loan_term: string
prepayment_penalty_term: string
intro_rate_period: string
negative_amortization: int64
interest_only_payment: int64
balloon_payment: int64
other_nonamortizing_features: int64
property_value: string
construction_method: int64
occupancy_type: int64
manufactured_home_secured_property_type: int64
manufactured_home_land_property_interest: int64
total_units: string
multifamily_affordable_units: string
income: int64
debt_to_income_ratio: string
applicant_credit_score_type: int64
co-applicant_credit_score_type: int64
applicant_ethnicity-1: int64
applicant_ethnicity-2: int64
applicant_ethnicity-3: int64
applicant_ethnicity-4: int64
applicant_ethnicity-5: int64
co-applicant_ethnicity-1: int64
co-applicant_ethnicity-2: int64
co-applicant_ethnicity-3: int64
co-applicant_ethnicity-4: int64
co-applicant_ethnicity-5: null
applicant_ethnicity_observed: int64
co-applicant_ethnicity_observed: int64
applicant_race-1: int64
applicant_race-2: int64
applicant_race-3: int64
applicant_race-4: int64
applicant_race-5: int64
co-applicant_race-1: int64
co-applicant_race-2: int64
co-applicant_race-3: int64
co-applicant_race-4: int64
co-applicant_race-5: int64
applicant_race_observed: int64
co-applicant_race_observed: int64
applicant_sex: int64
co-applicant_sex: int64
applicant_sex_observed: int64
co-applicant_sex_observed: int64
applicant_age: string
co-applicant_age: string
applicant_age_above_62: string
co-applicant_age_above_62: string
submission_of_application: int64
initially_payable_to_institution: int64
aus-1: int64
aus-2: int64
aus-3: int64
aus-4: int64
aus-5: int64
denial_reason-1: int64
denial_reason-2: int64
denial_reason-3: int64
denial_reason-4: int64
tract_population: int64
tract_minority_population_percent: double
ffiec_msa_md_median_family_income: int64
tract_to_msa_income_percentage: int64
tract_owner_occupied_units: int64
tract_one_to_four_family_homes: int64
tract_median_age_of_housing_units: int64
----
activity_year: [[2021,2021,2021,2021,2021,...,2021,2021,2021,2021,2021],[2021,2021,2021,2021,2021,...,2021,2021,2021,2021,2021],[2021,2021,2021,2021,2021,...,2021,2021,2021,2021,2021],[2021,2021,2021,2021,2021,...,2021,2021,2021,2021,2021]]
lei: [["54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80",...,"254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219"],["254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219","254900X6OAHFW6BUT219",...,"549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46"],["549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46","549300KY533JFETOYG46",...,"ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18"],["ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18","ZF85QS7OXKPBG52R7N18",...,"54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80","54930034MNPILHP25H80"]]
derived_msa-md: [[99999,99999,99999,29404,11540,...,33460,20740,33460,33460,99999],[99999,33460,33460,33460,20740,...,99999,33340,33340,33340,33340],[99999,33340,39540,33340,39540,...,36780,36780,11540,33340,33340],[29100,31540,99999,22540,99999,...,31540,99999,31540,99999,31540]]
state_code: [["WI","WI","WI","WI","WI",...,"WI","WI","WI","WI","WI"],["WI","WI","WI","WI","WI",...,"WI","WI","WI","WI","WI"],["WI","WI","WI","WI","WI",...,"WI","WI","WI","WI","WI"],["WI","WI","WI","WI","WI",...,"WI","WI","WI","WI","WI"]]
county_code: [[55027,55001,55013,55059,55087,...,55109,55017,55093,55109,55033],[55095,55109,55109,55109,55017,...,55027,55079,55133,55133,55079],[55027,55133,55101,55079,55101,...,55139,55139,55087,55131,55079],[55063,55021,55011,55039,55097,...,55025,55029,55025,55051,55021]]
census_tract: [[55027961800,55001950501,55013970400,55059002000,55087013300,...,55109121000,55017011100,55093960700,55109120904,55033970400],[55095960500,55109120700,55109121000,55109120904,55017010700,...,55027961500,55079090300,55133203305,55133203406,55079000303],[55027960800,55133201600,55101000901,55079016100,55101002402,...,55139001100,55139001803,55087012100,55131450104,55079150301],[55063010201,55021970100,55011960400,55039041300,55097960600,...,55025011301,55029100800,55025012300,55051180300,55021970300]]
conforming_loan_limit: [["C","C","C","C","C",...,"C","C","C","C","C"],["C","C","C","C","C",...,"C","C","C","C","C"],["C","C","C","C","C",...,"C","C","C","C","C"],["C","C","C","C","C",...,"C","C","C","C","C"]]
derived_loan_product_type: [["Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:First Lien",...,"Conventional:First Lien","Conventional:First Lien","FSA/RHS:First Lien","Conventional:Subordinate Lien","Conventional:First Lien"],["Conventional:First Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien",...,"Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:First Lien"],["Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien",...,"Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien"],["Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:Subordinate Lien","Conventional:First Lien",...,"Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:First Lien","Conventional:First Lien"]]
derived_dwelling_category: [["Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built",...,"Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built"],["Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built",...,"Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built"],["Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built",...,"Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built"],["Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built",...,"Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built","Single Family (1-4 Units):Site-Built"]]
derived_ethnicity: [["Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Joint",...,"Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino"],["Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Ethnicity Not Available","Not Hispanic or Latino",...,"Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Hispanic or Latino"],["Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Hispanic or Latino","Hispanic or Latino",...,"Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino"],["Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Ethnicity Not Available",...,"Ethnicity Not Available","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino","Not Hispanic or Latino"]]
...
%% Cell type:code id:28a0a2d6-95a1-4c86-9577-ef62721c19d6 tags:
``` python
# point 4: Parquet is compressed, with snappy by default
```
%% Cell type:code id:9ee9812e-76d9-4bf3-9194-4556f83bf0ff tags:
``` python
%%time
pa.parquet.write_table(t, "hdma-wi-2021.parquet", compression="snappy")
```
%% Output
CPU times: user 698 ms, sys: 18.6 ms, total: 717 ms
Wall time: 730 ms
%% Cell type:code id:487c9b6f-559c-4777-b498-b4f41cd8667a tags:
``` python
%%time
pa.parquet.write_table(t, "hdma-wi-2021-gzip.parquet", compression="gzip")
```
%% Output
CPU times: user 2.21 s, sys: 13.9 ms, total: 2.22 s
Wall time: 2.22 s
%% Cell type:code id:f2c3e5a3-7bf2-4f92-8019-e13b8cc28c28 tags:
``` python
!ls -lh
```
%% Output
total 216M
-rw-rw-r-- 1 tharter tharter 333 Feb 20 15:45 Dockerfile
-rw-r----- 1 tharter tharter 167M Nov 1 2022 hdma-wi-2021.csv
-rw-rw-r-- 1 tharter tharter 13M Feb 24 09:04 hdma-wi-2021-gzip.parquet
-rw-rw-r-- 1 tharter tharter 16M Feb 24 09:03 hdma-wi-2021.parquet
-rw-rw-r-- 1 tharter tharter 21M Jan 5 2023 hdma-wi-2021.zip
-rw-rw-r-- 1 tharter tharter 17K Feb 24 09:03 lec1.ipynb
drwxrwxr-x 3 tharter tharter 4.0K Feb 21 09:26 old
-rw-rw-r-- 1 tharter tharter 1.8K Feb 20 15:45 requirements.txt
drwxrwxr-x 3 tharter tharter 4.0K Feb 21 11:39 shared
%% Cell type:code id:92459058-9a4b-4da7-8820-65a6ebb70801 tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment