Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • cdis/cs/courses/cs544/s25/main
  • zzhang2478/main
  • spark667/main
  • vijayprabhak/main
  • vijayprabhak/544-main
  • wyang338/cs-544-s-25
  • jmin39/main
7 results
Show changes
Commits on Source (30)
Showing
with 9693 additions and 33 deletions
File added
File added
File added
File added
File added
File added
File added
File added
%% Cell type:markdown id:339effc5-0376-44df-a8bd-bd8b78d88423 tags:
# Command Line Tools
%% Cell type:code id:90b52a4b-56ed-4022-a78e-7d4757b1a3a7 tags:
``` python
! hdfs dfs -mkdir hdfs://main:9000/data
```
%% Cell type:code id:e5a6bdbb-6030-4e15-86b2-60f577e77c35 tags:
``` python
# ! cat /hadoop-3.3.6/LICENSE.txt
```
%% Cell type:code id:f4368401-f24e-4a1d-a6bc-3c4abd21b5ef tags:
``` python
! hdfs dfs -cp /hadoop-3.3.6/LICENSE.txt hdfs://main:9000/data/
```
%% Cell type:code id:b51d1fb5-055c-433a-afc7-764b8e058ef6 tags:
``` python
! hdfs dfs -ls hdfs://main:9000/data
```
%% Output
Found 1 items
-rw-r--r-- 3 root supergroup 15217 2025-03-05 15:01 hdfs://main:9000/data/LICENSE.txt
%% Cell type:code id:3f785eeb-427d-4bed-9090-44e6c4f42bac tags:
``` python
! hdfs dfs -cat hdfs://main:9000/data/LICENSE.txt
```
%% Output
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--------------------------------------------------------------------------------
This product bundles various third-party components under other open source
licenses. This section summarizes those components and their licenses.
See licenses/ for text of these licenses.
Apache Software Foundation License 2.0
--------------------------------------
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/checker/AbstractFuture.java
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/checker/TimeoutFuture.java
BSD 2-Clause
------------
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/lz4/lz4.{c|h}
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/fuse-dfs/util/tree.h
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/compat/{fstatat|openat|unlinkat}.h
BSD 3-Clause
------------
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/bloom/*
hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc
hadoop-common-project/hadoop-common/src/main/native/gtest/include/gtest/gtest.h
hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/protobuf/protobuf/cpp_helpers.h
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/gmock-1.7.0/*/*.{cc|h}
hadoop-tools/hadoop-sls/src/main/html/js/thirdparty/d3.v3.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/d3-v4.1.1.min.js
MIT License
-----------
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/bootstrap-3.4.1
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/dataTables.bootstrap.css
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/dataTables.bootstrap.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/dust-full-2.0.0.min.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/dust-helpers-1.1.1.min.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/jquery-3.6.0.min.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/jquery.dataTables.min.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/moment.min.js
hadoop-tools/hadoop-sls/src/main/html/js/thirdparty/bootstrap.min.js
hadoop-tools/hadoop-sls/src/main/html/js/thirdparty/jquery.js
hadoop-tools/hadoop-sls/src/main/html/css/bootstrap.min.css
hadoop-tools/hadoop-sls/src/main/html/css/bootstrap-responsive.min.css
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog/hadoop-yarn-applications-catalog-webapp/node_modules/.bin/r.js
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/dt-1.10.18/*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/jquery
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/jt/jquery.jstree.js
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/resources/TERMINAL
uriparser2 (hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/uriparser2)
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/utils/cJSON.[ch]
Boost Software License, Version 1.0
-------------
asio-1.10.2 (hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/asio-1.10.2)
rapidxml-1.13 (hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/rapidxml-1.13)
tr2 (hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/tr2)
Public Domain
-------------
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/json-bignum.js
%% Cell type:code id:93cd8c10-b2e4-4b0d-8d2f-cbcc400d5fba tags:
``` python
! hdfs dfs -du -h hdfs://main:9000/data/LICENSE.txt
```
%% Output
14.9 K 44.6 K hdfs://main:9000/data/LICENSE.txt
%% Cell type:code id:981a565c-e79f-459f-aca3-5ef37a1a2f7c tags:
``` python
! hdfs fsck hdfs://main:9000/data/LICENSE.txt
```
%% Output
Connecting to namenode via http://main:9870/fsck?ugi=root&path=%2Fdata%2FLICENSE.txt
FSCK started by root (auth:SIMPLE) from /172.18.0.3 for path /data/LICENSE.txt at Wed Mar 05 15:05:03 GMT 2025
/data/LICENSE.txt: Under replicated BP-570661815-172.18.0.2-1741186484563:blk_1073741825_1001. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).
Status: HEALTHY
Number of data-nodes: 1
Number of racks: 1
Total dirs: 0
Total symlinks: 0
Replicated Blocks:
Total size: 15217 B
Total files: 1
Total blocks (validated): 1 (avg. block size 15217 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 1 (100.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 1.0
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 2 (66.666664 %)
Blocks queued for replication: 0
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
Blocks queued for replication: 0
FSCK ended at Wed Mar 05 15:05:03 GMT 2025 in 11 milliseconds
The filesystem under path '/data/LICENSE.txt' is HEALTHY
%% Cell type:code id:7ed30c45-7dbb-41bb-b1ee-cfb2a3609ae1 tags:
``` python
! hdfs dfs -D dfs.replication=1 -cp /hadoop-3.3.6/LICENSE.txt hdfs://main:9000/data/v2.txt
```
%% Cell type:code id:c3b13531-bff1-4d65-9e48-6a4e137ad319 tags:
``` python
! hdfs fsck hdfs://main:9000/data/v2.txt
```
%% Output
Connecting to namenode via http://main:9870/fsck?ugi=root&path=%2Fdata%2Fv2.txt
FSCK started by root (auth:SIMPLE) from /172.18.0.3 for path /data/v2.txt at Wed Mar 05 15:06:46 GMT 2025
Status: HEALTHY
Number of data-nodes: 1
Number of racks: 1
Total dirs: 0
Total symlinks: 0
Replicated Blocks:
Total size: 15217 B
Total files: 1
Total blocks (validated): 1 (avg. block size 15217 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 1.0
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Blocks queued for replication: 0
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
Blocks queued for replication: 0
FSCK ended at Wed Mar 05 15:06:46 GMT 2025 in 1 milliseconds
The filesystem under path '/data/v2.txt' is HEALTHY
%% Cell type:markdown id:597697fe-ec8d-4a0c-aa1a-5bf53b5d9378 tags:
# WebHDFS
%% Cell type:code id:8ce8d375-2b02-433d-8136-9d17bd7b82ca tags:
``` python
! curl -i "http://main:9870/webhdfs/v1/data?op=LISTSTATUS"
```
%% Output
HTTP/1.1 200 OK
Date: Wed, 05 Mar 2025 15:09:48 GMT
Cache-Control: no-cache
Expires: Wed, 05 Mar 2025 15:09:48 GMT
Date: Wed, 05 Mar 2025 15:09:48 GMT
Pragma: no-cache
X-Content-Type-Options: nosniff
X-FRAME-OPTIONS: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Content-Type: application/json
Transfer-Encoding: chunked
{"FileStatuses":{"FileStatus":[
{"accessTime":1741186880583,"blockSize":134217728,"childrenNum":0,"fileId":16387,"group":"supergroup","length":15217,"modificationTime":1741186881164,"owner":"root","pathSuffix":"LICENSE.txt","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"},
{"accessTime":1741187194367,"blockSize":134217728,"childrenNum":0,"fileId":16388,"group":"supergroup","length":15217,"modificationTime":1741187194453,"owner":"root","pathSuffix":"v2.txt","permission":"644","replication":1,"storagePolicy":0,"type":"FILE"}
]}}
%% Cell type:code id:c2e6ec63-89a3-4193-82a2-f255265f951d tags:
``` python
# ! curl -i "http://main:9870/webhdfs/v1/data?op=LISTSTATUS"
```
%% Cell type:code id:38854f06-83cc-4beb-85aa-8fce77d56207 tags:
``` python
import requests
```
%% Cell type:code id:7980b944-2f78-49af-b561-1a7e4c865f27 tags:
``` python
r = requests.get("http://main:9870/webhdfs/v1/data?op=LISTSTATUS")
r.raise_for_status()
```
%% Cell type:code id:dde2f2b2-dd02-4682-a074-ca83f544926e tags:
``` python
r.content # binary data
```
%% Output
b'{"FileStatuses":{"FileStatus":[\n{"accessTime":1741186880583,"blockSize":134217728,"childrenNum":0,"fileId":16387,"group":"supergroup","length":15217,"modificationTime":1741186881164,"owner":"root","pathSuffix":"LICENSE.txt","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"},\n{"accessTime":1741187194367,"blockSize":134217728,"childrenNum":0,"fileId":16388,"group":"supergroup","length":15217,"modificationTime":1741187194453,"owner":"root","pathSuffix":"v2.txt","permission":"644","replication":1,"storagePolicy":0,"type":"FILE"}\n]}}\n'
%% Cell type:code id:3ac379e7-773e-4231-9efd-55e73bbffeb4 tags:
``` python
r.text # binary data converted to text
```
%% Output
'{"FileStatuses":{"FileStatus":[\n{"accessTime":1741186880583,"blockSize":134217728,"childrenNum":0,"fileId":16387,"group":"supergroup","length":15217,"modificationTime":1741186881164,"owner":"root","pathSuffix":"LICENSE.txt","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"},\n{"accessTime":1741187194367,"blockSize":134217728,"childrenNum":0,"fileId":16388,"group":"supergroup","length":15217,"modificationTime":1741187194453,"owner":"root","pathSuffix":"v2.txt","permission":"644","replication":1,"storagePolicy":0,"type":"FILE"}\n]}}\n'
%% Cell type:code id:6270b16c-0082-4140-9c5a-ffef6ad227f7 tags:
``` python
for file_entry in r.json()['FileStatuses']['FileStatus']:
print(file_entry["pathSuffix"])
```
%% Output
LICENSE.txt
v2.txt
%% Cell type:code id:0cf35fa9-9eb8-4ae8-8e38-ef75cb968f30 tags:
``` python
# TODO: read v2.txt
```
%% Cell type:code id:5308cf9a-07fd-442f-b777-6755d735ed1d tags:
``` python
# curl -i -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN
# [&offset=<LONG>][&length=<LONG>][&buffersize=<INT>][&noredirect=<true|false>]"
```
%% Cell type:code id:0021e12c-5b0b-4969-841e-58702f175c73 tags:
``` python
# -L means redirect
! curl -i "http://main:9870/webhdfs/v1/data/v2.txt?op=OPEN&offset=0&length=200"
```
%% Output
HTTP/1.1 307 Temporary Redirect
Date: Wed, 05 Mar 2025 15:17:03 GMT
Cache-Control: no-cache
Expires: Wed, 05 Mar 2025 15:17:03 GMT
Date: Wed, 05 Mar 2025 15:17:03 GMT
Pragma: no-cache
X-Content-Type-Options: nosniff
X-FRAME-OPTIONS: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Location: ]8;;http://main:9864/webhdfs/v1/data/v2.txt?op=OPEN&namenoderpcaddress=main:9000&length=200&offset=0\http://main:9864/webhdfs/v1/data/v2.txt?op=OPEN&namenoderpcaddress=main:9000&length=200&offset=0
]8;;\Content-Type: application/octet-stream
Content-Length: 0
%% Cell type:code id:5351837e-4aaa-4102-b0a5-9c8cf6ecebcb tags:
``` python
! curl -i -L "http://main:9870/webhdfs/v1/data/v2.txt?op=OPEN&offset=0&length=200"
```
%% Output
HTTP/1.1 307 Temporary Redirect
Date: Wed, 05 Mar 2025 15:16:00 GMT
Cache-Control: no-cache
Expires: Wed, 05 Mar 2025 15:16:00 GMT
Date: Wed, 05 Mar 2025 15:16:00 GMT
Pragma: no-cache
X-Content-Type-Options: nosniff
X-FRAME-OPTIONS: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Location: ]8;;http://main:9864/webhdfs/v1/data/v2.txt?op=OPEN&namenoderpcaddress=main:9000&length=200&offset=0\http://main:9864/webhdfs/v1/data/v2.txt?op=OPEN&namenoderpcaddress=main:9000&length=200&offset=0
]8;;\Content-Type: application/octet-stream
Content-Length: 0
HTTP/1.1 200 OK
Access-Control-Allow-Methods: GET
Access-Control-Allow-Origin: *
Content-Type: application/octet-stream
Connection: close
Content-Length: 200
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUC
%% Cell type:code id:f92b6a36-3471-4fc9-91b9-5e89cec9dd16 tags:
``` python
! curl -L "http://main:9870/webhdfs/v1/data/v2.txt?op=OPEN&offset=0&length=200"
```
%% Output
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUC
%% Cell type:code id:7e5adb1a-fb12-433f-8245-5ab0957e9022 tags:
``` python
! curl -L "http://main:9870/webhdfs/v1/data/v2.txt?op=OPEN&offset=5&length=200"
```
%% Output
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION,
%% Cell type:code id:5930825b-3701-49da-aa2c-ac920890609d tags:
``` python
! curl "http://main:9870/webhdfs/v1/data/v2.txt?op=OPEN&offset=5&length=200&noredirect=true"
```
%% Output
{"Location":"http://main:9864/webhdfs/v1/data/v2.txt?op=OPEN&namenoderpcaddress=main:9000&length=200&offset=5"}
%% Cell type:code id:7a191a51-95a2-487a-9881-6ff545bff6ed tags:
``` python
r = requests.get("http://main:9870/webhdfs/v1/data/v2.txt?op=OPEN&offset=5&length=200&noredirect=true")
r.raise_for_status()
r.json()["Location"]
```
%% Output
'http://main:9864/webhdfs/v1/data/v2.txt?op=OPEN&namenoderpcaddress=main:9000&length=200&offset=5'
%% Cell type:code id:9b6636b9-d39d-4267-bfec-c0ecae80cab9 tags:
``` python
# where are the blocks?
```
%% Cell type:code id:0af3ff8f-d040-4f37-8b96-ddf46e1d9d4c tags:
``` python
!curl -i "http://main:9870/webhdfs/v1/data/v2.txt?op=GETFILEBLOCKLOCATIONS"
```
%% Output
HTTP/1.1 200 OK
Date: Wed, 05 Mar 2025 15:20:26 GMT
Cache-Control: no-cache
Expires: Wed, 05 Mar 2025 15:20:26 GMT
Date: Wed, 05 Mar 2025 15:20:26 GMT
Pragma: no-cache
X-Content-Type-Options: nosniff
X-FRAME-OPTIONS: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Content-Type: application/json
Transfer-Encoding: chunked
{"BlockLocations":{"BlockLocation":[{"topologyPaths":["/default-rack/172.18.0.2:9866"],"corrupt":false,"cachedHosts":[],"names":["172.18.0.2:9866"],"offset":0,"hosts":["main"],"length":15217,"storageTypes":["DISK"]}]}}
%% Cell type:code id:80e34752-aa80-4738-8212-4bda5a0d3abe tags:
``` python
r = requests.get("http://main:9870/webhdfs/v1/data/v2.txt?op=GETFILEBLOCKLOCATIONS")
r.raise_for_status()
```
%% Cell type:code id:4921b17f-e877-48c6-8cd0-4d0ef8bbd664 tags:
``` python
for logical_block in r.json()["BlockLocations"]["BlockLocation"]:
print("DataNodes for the block:", logical_block["hosts"])
```
%% Output
DataNodes for the block: ['main']
%% Cell type:markdown id:a74d220a-60f9-4da5-bc25-7bbb4a630939 tags:
# PyArrow
%% Cell type:code id:96a157bf-d622-4877-ba54-80002ea5800f tags:
``` python
import pyarrow as pa
import pyarrow.fs
```
%% Cell type:code id:3da59b5b-967c-4f6a-88d2-d878f7e62637 tags:
``` python
# other options: replication=????, default_block_size=????
hdfs = pa.fs.HadoopFileSystem(host="main", port=9000)
```
%% Output
2025-03-05 15:23:44,357 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
%% Cell type:code id:cea9e827-47ec-43a6-92bf-031b6dd78df0 tags:
``` python
from io import BufferedReader, TextIOWrapper
```
%% Cell type:code id:50b45eb3-d03b-4f5f-8d8a-1f42d1dcc09d tags:
``` python
# for reading: open_input_file; for writing: open_output_stream
linenum = 0
with hdfs.open_input_file("/data/v2.txt") as f:
#print(f.read_at(200, 0))
reader = TextIOWrapper(BufferedReader(f))
for line in reader:
print(line, end="")
linenum += 1
if linenum > 10:
break
# on P4: pq.write_table(?????, f) ????? = pq.read_table(f)
```
%% Output
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
%% Cell type:code id:cc8ca953-7eae-4119-8f78-e8fd5cfad764 tags:
``` python
```
%% Cell type:code id:1ed33b62-f09e-44e5-84c2-9b7be07998be tags:
``` python
! ls
```
%% Output
lec1.ipynb lec2.ipynb
%% Cell type:markdown id:3c8a9067-a138-43b9-ae21-e6bc6eb5a133 tags:
# Command Line Examples
%% Cell type:code id:11602a88-79ab-4d51-8def-ab1b3bca6a31 tags:
``` python
!hdfs dfs -mkdir hdfs://main:9000/data
```
%% Cell type:code id:59dca956-2612-49b6-a94c-92dc1dbe4df7 tags:
``` python
!hdfs dfs -ls hdfs://main:9000/
```
%% Output
Found 1 items
drwxr-xr-x - root supergroup 0 2025-03-05 17:12 hdfs://main:9000/data
%% Cell type:code id:9946811e-f9bb-4c50-a39a-2ebddb543ffa tags:
``` python
!hdfs dfs -cp /hadoop-3.3.6/LICENSE.txt hdfs://main:9000/data/
```
%% Cell type:code id:9faef755-777b-47f2-91e3-83366d97d552 tags:
``` python
!hdfs dfs -ls hdfs://main:9000/data
```
%% Output
Found 1 items
-rw-r--r-- 3 root supergroup 15217 2025-03-05 17:13 hdfs://main:9000/data/LICENSE.txt
%% Cell type:code id:64807b1e-f323-4271-bf19-7a3aed50cc86 tags:
``` python
!hdfs dfs -cat hdfs://main:9000/data/LICENSE.txt
```
%% Output
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--------------------------------------------------------------------------------
This product bundles various third-party components under other open source
licenses. This section summarizes those components and their licenses.
See licenses/ for text of these licenses.
Apache Software Foundation License 2.0
--------------------------------------
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/checker/AbstractFuture.java
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/checker/TimeoutFuture.java
BSD 2-Clause
------------
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/lz4/lz4.{c|h}
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/fuse-dfs/util/tree.h
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/compat/{fstatat|openat|unlinkat}.h
BSD 3-Clause
------------
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/bloom/*
hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc
hadoop-common-project/hadoop-common/src/main/native/gtest/include/gtest/gtest.h
hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/protobuf/protobuf/cpp_helpers.h
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/gmock-1.7.0/*/*.{cc|h}
hadoop-tools/hadoop-sls/src/main/html/js/thirdparty/d3.v3.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/d3-v4.1.1.min.js
MIT License
-----------
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/bootstrap-3.4.1
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/dataTables.bootstrap.css
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/dataTables.bootstrap.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/dust-full-2.0.0.min.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/dust-helpers-1.1.1.min.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/jquery-3.6.0.min.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/jquery.dataTables.min.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/moment.min.js
hadoop-tools/hadoop-sls/src/main/html/js/thirdparty/bootstrap.min.js
hadoop-tools/hadoop-sls/src/main/html/js/thirdparty/jquery.js
hadoop-tools/hadoop-sls/src/main/html/css/bootstrap.min.css
hadoop-tools/hadoop-sls/src/main/html/css/bootstrap-responsive.min.css
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog/hadoop-yarn-applications-catalog-webapp/node_modules/.bin/r.js
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/dt-1.10.18/*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/jquery
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/jt/jquery.jstree.js
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/resources/TERMINAL
uriparser2 (hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/uriparser2)
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/utils/cJSON.[ch]
Boost Software License, Version 1.0
-------------
asio-1.10.2 (hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/asio-1.10.2)
rapidxml-1.13 (hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/rapidxml-1.13)
tr2 (hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/tr2)
Public Domain
-------------
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/json-bignum.js
%% Cell type:code id:7649de66-99fc-468f-b3bb-d53ac6104651 tags:
``` python
!hdfs dfs -du -h hdfs://main:9000/data/LICENSE.txt
```
%% Output
14.9 K 44.6 K hdfs://main:9000/data/LICENSE.txt
%% Cell type:code id:077c4587-942a-40d8-80be-9c663ee10c3f tags:
``` python
!hdfs fsck hdfs://main:9000/data/LICENSE.txt
```
%% Output
Connecting to namenode via http://main:9870/fsck?ugi=root&path=%2Fdata%2FLICENSE.txt
FSCK started by root (auth:SIMPLE) from /172.18.0.3 for path /data/LICENSE.txt at Wed Mar 05 17:16:04 GMT 2025
/data/LICENSE.txt: Under replicated BP-478178705-172.18.0.2-1741194447522:blk_1073741825_1001. Target Replicas is 3 but found 1 live replica(s), 0 decommissioned replica(s), 0 decommissioning replica(s).
Status: HEALTHY
Number of data-nodes: 1
Number of racks: 1
Total dirs: 0
Total symlinks: 0
Replicated Blocks:
Total size: 15217 B
Total files: 1
Total blocks (validated): 1 (avg. block size 15217 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 1 (100.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 1.0
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 2 (66.666664 %)
Blocks queued for replication: 0
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
Blocks queued for replication: 0
FSCK ended at Wed Mar 05 17:16:04 GMT 2025 in 13 milliseconds
The filesystem under path '/data/LICENSE.txt' is HEALTHY
%% Cell type:code id:e2c9c022-158b-4181-bd4c-fa451ba6c783 tags:
``` python
!hdfs dfs -D dfs.replication=1 -cp /hadoop-3.3.6/LICENSE.txt hdfs://main:9000/data/test.txt
```
%% Cell type:code id:e0d4797c-8bde-4588-96af-3ad1134dcb2b tags:
``` python
!hdfs fsck hdfs://main:9000/data/test.txt
```
%% Output
Connecting to namenode via http://main:9870/fsck?ugi=root&path=%2Fdata%2Ftest.txt
FSCK started by root (auth:SIMPLE) from /172.18.0.3 for path /data/test.txt at Wed Mar 05 17:18:35 GMT 2025
Status: HEALTHY
Number of data-nodes: 1
Number of racks: 1
Total dirs: 0
Total symlinks: 0
Replicated Blocks:
Total size: 15217 B
Total files: 1
Total blocks (validated): 1 (avg. block size 15217 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 1.0
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Blocks queued for replication: 0
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
Blocks queued for replication: 0
FSCK ended at Wed Mar 05 17:18:35 GMT 2025 in 2 milliseconds
The filesystem under path '/data/test.txt' is HEALTHY
%% Cell type:markdown id:50efdb28-cfdb-4172-8904-7718fce705fc tags:
# WebHDFS Examples
%% Cell type:code id:ff78a66f-9300-463d-a508-677231832932 tags:
``` python
! curl -i "http://main:9870/webhdfs/v1/data?op=LISTSTATUS"
```
%% Output
HTTP/1.1 200 OK
Date: Wed, 05 Mar 2025 17:22:12 GMT
Cache-Control: no-cache
Expires: Wed, 05 Mar 2025 17:22:12 GMT
Date: Wed, 05 Mar 2025 17:22:12 GMT
Pragma: no-cache
X-Content-Type-Options: nosniff
X-FRAME-OPTIONS: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Content-Type: application/json
Transfer-Encoding: chunked
{"FileStatuses":{"FileStatus":[
{"accessTime":1741194819485,"blockSize":134217728,"childrenNum":0,"fileId":16387,"group":"supergroup","length":15217,"modificationTime":1741194820075,"owner":"root","pathSuffix":"LICENSE.txt","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"},
{"accessTime":1741195034403,"blockSize":134217728,"childrenNum":0,"fileId":16388,"group":"supergroup","length":15217,"modificationTime":1741195034487,"owner":"root","pathSuffix":"single.txt","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"},
{"accessTime":1741195111543,"blockSize":134217728,"childrenNum":0,"fileId":16389,"group":"supergroup","length":15217,"modificationTime":1741195111626,"owner":"root","pathSuffix":"test.txt","permission":"644","replication":1,"storagePolicy":0,"type":"FILE"}
]}}
%% Cell type:code id:3d957b70-a8e0-4147-838d-e4e9eb137ed4 tags:
``` python
# curl -i "http://main:9870/webhdfs/v1/data?op=LISTSTATUS"
```
%% Cell type:code id:42dc77db-565e-4359-a52f-b869a642ad08 tags:
``` python
import requests
r = requests.get("http://main:9870/webhdfs/v1/data?op=LISTSTATUS")
r.raise_for_status()
r.content
```
%% Output
b'{"FileStatuses":{"FileStatus":[\n{"accessTime":1741194819485,"blockSize":134217728,"childrenNum":0,"fileId":16387,"group":"supergroup","length":15217,"modificationTime":1741194820075,"owner":"root","pathSuffix":"LICENSE.txt","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"},\n{"accessTime":1741195034403,"blockSize":134217728,"childrenNum":0,"fileId":16388,"group":"supergroup","length":15217,"modificationTime":1741195034487,"owner":"root","pathSuffix":"single.txt","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"},\n{"accessTime":1741195111543,"blockSize":134217728,"childrenNum":0,"fileId":16389,"group":"supergroup","length":15217,"modificationTime":1741195111626,"owner":"root","pathSuffix":"test.txt","permission":"644","replication":1,"storagePolicy":0,"type":"FILE"}\n]}}\n'
%% Cell type:code id:6dc4b0f6-a691-4e19-8d5c-0715d7a46cd1 tags:
``` python
r.text
```
%% Output
'{"FileStatuses":{"FileStatus":[\n{"accessTime":1741194819485,"blockSize":134217728,"childrenNum":0,"fileId":16387,"group":"supergroup","length":15217,"modificationTime":1741194820075,"owner":"root","pathSuffix":"LICENSE.txt","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"},\n{"accessTime":1741195034403,"blockSize":134217728,"childrenNum":0,"fileId":16388,"group":"supergroup","length":15217,"modificationTime":1741195034487,"owner":"root","pathSuffix":"single.txt","permission":"644","replication":3,"storagePolicy":0,"type":"FILE"},\n{"accessTime":1741195111543,"blockSize":134217728,"childrenNum":0,"fileId":16389,"group":"supergroup","length":15217,"modificationTime":1741195111626,"owner":"root","pathSuffix":"test.txt","permission":"644","replication":1,"storagePolicy":0,"type":"FILE"}\n]}}\n'
%% Cell type:code id:e04a4e12-6478-435c-a07e-dc7b024b0719 tags:
``` python
for entry in r.json()['FileStatuses']['FileStatus']:
print(entry["pathSuffix"])
```
%% Output
LICENSE.txt
single.txt
test.txt
%% Cell type:code id:5800cd5f-b93b-4c42-af89-5558075b1f1d tags:
``` python
# open+read a file
```
%% Cell type:code id:1addea41-bc58-40c7-93ca-52f0b6bc1f9a tags:
``` python
#curl -i -L "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN
# [&offset=<LONG>][&length=<LONG>][&buffersize=<INT>][&noredirect=<true|false>]"
```
%% Cell type:code id:a9602f15-a008-43b0-b76c-2ba0da2c3685 tags:
``` python
! curl -i -L "http://main:9870/webhdfs/v1/data/test.txt?op=OPEN&offset=0&length=200"
```
%% Output
HTTP/1.1 307 Temporary Redirect
Date: Wed, 05 Mar 2025 17:31:15 GMT
Cache-Control: no-cache
Expires: Wed, 05 Mar 2025 17:31:15 GMT
Date: Wed, 05 Mar 2025 17:31:15 GMT
Pragma: no-cache
X-Content-Type-Options: nosniff
X-FRAME-OPTIONS: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Location: ]8;;http://main:9864/webhdfs/v1/data/test.txt?op=OPEN&namenoderpcaddress=main:9000&length=200&offset=0\http://main:9864/webhdfs/v1/data/test.txt?op=OPEN&namenoderpcaddress=main:9000&length=200&offset=0
]8;;\Content-Type: application/octet-stream
Content-Length: 0
HTTP/1.1 200 OK
Access-Control-Allow-Methods: GET
Access-Control-Allow-Origin: *
Content-Type: application/octet-stream
Connection: close
Content-Length: 200
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUC
%% Cell type:code id:13543696-8fff-455d-9d90-d25210893b27 tags:
``` python
r = requests.get("http://main:9870/webhdfs/v1/data/test.txt?op=OPEN&offset=0&length=200")
r.raise_for_status()
print(r.text)
```
%% Output
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUC
%% Cell type:code id:313a7b35-82ac-4221-86a8-8ebdf50a69bc tags:
``` python
r = requests.get("http://main:9870/webhdfs/v1/data/test.txt?op=OPEN&offset=5&length=200")
r.raise_for_status()
print(r.text)
```
%% Output
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION,
%% Cell type:code id:b8cbf271-28b5-4af8-b0a8-06d09b5993fc tags:
``` python
# where are all the DataNodes for all the blocks of a file?
```
%% Cell type:code id:7ce73613-cc98-42f8-9c20-95eba1346240 tags:
``` python
r = requests.get("http://main:9870/webhdfs/v1/data/test.txt?op=GETFILEBLOCKLOCATIONS")
r.raise_for_status()
r.json()
```
%% Output
{'BlockLocations': {'BlockLocation': [{'topologyPaths': ['/default-rack/172.18.0.2:9866'],
'corrupt': False,
'cachedHosts': [],
'names': ['172.18.0.2:9866'],
'offset': 0,
'hosts': ['main'],
'length': 15217,
'storageTypes': ['DISK']}]}}
%% Cell type:code id:367e11ec-ef10-4a0a-b90d-ecbabc3dcb8d tags:
``` python
for logical_block in r.json()['BlockLocations']['BlockLocation']:
print("DataNodes for this block:", logical_block["hosts"])
```
%% Output
DataNodes for this block: ['main']
%% Cell type:markdown id:28d9691f-0ef6-497d-8cd5-808c46b2844c tags:
# PyArrow Examples
%% Cell type:code id:0e9e9a93-0943-44e3-b282-a978a3802d85 tags:
``` python
import pyarrow as pa
import pyarrow.fs
```
%% Cell type:code id:f25a555d-93d0-4e01-8b37-bea11603ddfd tags:
``` python
# other options: replication=????, default_block_size=????
hdfs = pa.fs.HadoopFileSystem("main", 9000)
```
%% Output
2025-03-05 17:37:01,372 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
%% Cell type:code id:fc8df378-1a0a-43cc-bf89-eada00bc3355 tags:
``` python
from io import BufferedReader, TextIOWrapper
```
%% Cell type:code id:4776be11-61af-46ca-9493-4485c97589d1 tags:
``` python
# for reading: open_input_file, for writing: open_output_stream
with hdfs.open_input_file("/data/test.txt") as f:
#print(f.read_at(200, 0))
reader = TextIOWrapper(BufferedReader(f))
for line in reader:
print(line, end="")
# for P4: pq.read_table(f) OR pq.write_table(????, f)
```
%% Output
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--------------------------------------------------------------------------------
This product bundles various third-party components under other open source
licenses. This section summarizes those components and their licenses.
See licenses/ for text of these licenses.
Apache Software Foundation License 2.0
--------------------------------------
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/checker/AbstractFuture.java
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/checker/TimeoutFuture.java
BSD 2-Clause
------------
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-nativetask/src/main/native/lz4/lz4.{c|h}
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/fuse-dfs/util/tree.h
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/compat/{fstatat|openat|unlinkat}.h
BSD 3-Clause
------------
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/bloom/*
hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc
hadoop-common-project/hadoop-common/src/main/native/gtest/include/gtest/gtest.h
hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/util/bulk_crc32_x86.c
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/protobuf/protobuf/cpp_helpers.h
hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/gmock-1.7.0/*/*.{cc|h}
hadoop-tools/hadoop-sls/src/main/html/js/thirdparty/d3.v3.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/d3-v4.1.1.min.js
MIT License
-----------
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/bootstrap-3.4.1
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/dataTables.bootstrap.css
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/dataTables.bootstrap.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/dust-full-2.0.0.min.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/dust-helpers-1.1.1.min.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/jquery-3.6.0.min.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/jquery.dataTables.min.js
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/moment.min.js
hadoop-tools/hadoop-sls/src/main/html/js/thirdparty/bootstrap.min.js
hadoop-tools/hadoop-sls/src/main/html/js/thirdparty/jquery.js
hadoop-tools/hadoop-sls/src/main/html/css/bootstrap.min.css
hadoop-tools/hadoop-sls/src/main/html/css/bootstrap-responsive.min.css
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog/hadoop-yarn-applications-catalog-webapp/node_modules/.bin/r.js
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/dt-1.10.18/*
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/jquery
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/webapps/static/jt/jquery.jstree.js
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/resources/TERMINAL
uriparser2 (hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/uriparser2)
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/utils/cJSON.[ch]
Boost Software License, Version 1.0
-------------
asio-1.10.2 (hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/asio-1.10.2)
rapidxml-1.13 (hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/rapidxml-1.13)
tr2 (hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/third_party/tr2)
Public Domain
-------------
hadoop-hdfs-project/hadoop-hdfs/src/main/webapps/static/json-bignum.js
FROM ubuntu:24.04
RUN apt-get update; apt-get install -y wget curl openjdk-11-jdk python3-pip nano
# SPARK
RUN wget https://archive.apache.org/dist/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz && tar -xf spark-3.5.5-bin-hadoop3.tgz && rm spark-3.5.5-bin-hadoop3.tgz
# HDFS
RUN wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz && tar -xf hadoop-3.3.6.tar.gz && rm hadoop-3.3.6.tar.gz
# Jupyter
RUN pip3 install jupyterlab==4.3.5 pandas==2.2.3 pyspark==3.5.5 --break-system-packages
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ENV PATH="${PATH}:/hadoop-3.3.6/bin"
ENV HADOOP_HOME=/hadoop-3.3.6
services:
nb:
image: spark-demo
ports:
- "127.0.0.1:5000:5000"
- "127.0.0.1:4040:4040"
volumes:
- "./nb:/nb"
command: python3 -m jupyterlab --no-browser --ip=0.0.0.0 --port=5000 --allow-root --NotebookApp.token=''
nn:
image: spark-demo
hostname: nn
command: sh -c "hdfs namenode -format -force && hdfs namenode -D dfs.replication=1 -fs hdfs://nn:9000"
dn:
image: spark-demo
command: hdfs datanode -fs hdfs://nn:9000
spark-boss:
image: spark-demo
hostname: boss
command: sh -c "/spark-3.5.5-bin-hadoop3/sbin/start-master.sh && sleep infinity"
spark-worker:
image: spark-demo
command: sh -c "/spark-3.5.5-bin-hadoop3/sbin/start-worker.sh spark://boss:7077 -c 2 -m 2g && sleep infinity"
deploy:
replicas: 2
%% Cell type:code id:e7fed83d-4fd4-4370-a619-db07eb78df21 tags:
``` python
from pyspark.sql import SparkSession
spark = (SparkSession.builder.appName("cs544")
.master("spark://boss:7077")
.config("spark.executor.memory", "512M")
.getOrCreate())
```
%% Output
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/09 17:52:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
%% Cell type:code id:192912cb-d5b3-494e-8b93-e8bc17562680 tags:
``` python
sc = spark.sparkContext # provides direct RDD access
```
%% Cell type:code id:3a4b48c6-5173-41ac-8640-dcd52b712108 tags:
``` python
nums = list(range(0, 10_000_000))
nums[:5]
```
%% Output
[0, 1, 2, 3, 4]
%% Cell type:code id:9b776732-09d8-4686-9ce4-15e83344118d tags:
``` python
rdd = sc.parallelize(nums)
```
%% Cell type:code id:8b04b8f1-2fe7-4ffd-801c-b69d5cbb0978 tags:
``` python
inverses = rdd.map(lambda x: 1/x) # TRANSFORMATION (lazy)
```
%% Cell type:code id:49f94295-d242-4b43-aeb0-61fe7319fcaf tags:
``` python
# head = inverses.take(10) # ACTION (actually does the work)
```
%% Cell type:code id:28a6ec21-419c-4db6-9f88-51b481e12838 tags:
``` python
# inverses.mean() # ACTION
```
%% Cell type:code id:f78e9613-9dd6-45a5-9bf2-e171c9e4b0d1 tags:
``` python
inverses = rdd.filter(lambda x: x > 0).map(lambda x: 1/x)
```
%% Cell type:code id:ae670238-47cf-4d6f-a299-18c1d1b1a536 tags:
``` python
# 4 tasks, 0 are done, and 4 are in progress
# [Stage 3:> (0 + 4) / 4]
inverses.mean()
```
%% Output
25/03/09 17:59:52 WARN TaskSetManager: Stage 3 contains a task of very large size (12152 KiB). The maximum recommended task size is 1000 KiB.
1.669531293539298e-06
%% Cell type:code id:627099bd-69f1-41f9-81ca-31f73eaae90c tags:
``` python
# inverses.collect() # ACTION: be careful, if it's too big, we could run out of memory!
```
%% Cell type:code id:f336fdc8-a273-49ad-b23c-73f14ccb48d4 tags:
``` python
rdd.getNumPartitions()
```
%% Output
4
%% Cell type:code id:3655dbe6-b8ee-4f02-863a-ebb3a48b6614 tags:
``` python
# [Stage 5:======================> (20 + 4) / 50]
rdd = sc.parallelize(nums, 50)
inverses = rdd.filter(lambda x: x > 0).map(lambda x: 1/x)
inverses.mean()
```
%% Output
1.6695312935391358e-06
%% Cell type:code id:c81343cd-83ef-4cae-b5ce-d2eba0b055f1 tags:
``` python
import time
```
%% Cell type:code id:754b4356-8026-4e03-91be-82436a8759f3 tags:
``` python
sample = rdd.sample(True, 0.01) # TRANSFORMATION
```
%% Cell type:code id:1735153d-6a9b-45f9-bac7-674853b9f274 tags:
``` python
# 1st without cache
t0 = time.time()
print(sample.mean())
t1 = time.time()
print(t1-t0)
```
%% Output
[Stage 6:=================================================> (43 + 4) / 50]
4995621.997385424
1.9789588451385498
%% Cell type:code id:0496d02b-6dd5-4df7-97cf-60b286217af4 tags:
``` python
# 2nd without cache
t0 = time.time()
print(sample.mean())
t1 = time.time()
print(t1-t0)
```
%% Output
[Stage 7:===================================================> (45 + 4) / 50]
4995621.997385424
1.9979238510131836
%% Cell type:code id:2051edd6-92f5-4c5c-a8a1-1547bfeeb0c8 tags:
``` python
sample.cache()
```
%% Output
PythonRDD[11] at RDD at PythonRDD.scala:53
%% Cell type:code id:c50ed49f-07f7-494d-885b-56de91a0c554 tags:
``` python
# 1st with cache
t0 = time.time()
print(sample.mean())
t1 = time.time()
print(t1-t0)
```
%% Output
[Stage 8:==================================================> (44 + 4) / 50]
4995621.997385424
2.9729838371276855
%% Cell type:code id:3d602170-51b3-4dbf-a4ba-3a242d9b9ba2 tags:
``` python
# 2st with cache
t0 = time.time()
print(sample.mean())
t1 = time.time()
print(t1-t0)
```
%% Output
[Stage 9:===================================> (31 + 4) / 50]
4995621.997385424
1.2610077857971191
%% Cell type:code id:3f0c6999-f48c-4895-876e-8c60eff5a74d tags:
``` python
sample = rdd.sample(True, 0.01).repartition(4)
```
%% Cell type:code id:e3d61398-bb60-4f7f-8594-ab876b1aa51b tags:
``` python
# 1st with cache, and fewer partitions
t0 = time.time()
print(sample.mean())
t1 = time.time()
print(t1-t0)
```
%% Output
[Stage 10:====================================================> (47 + 3) / 50]
5005973.481531257
3.3820056915283203
%% Cell type:code id:201504ef-5bb4-4436-acfd-03e28c5bfd92 tags:
``` python
# 2nd with cache, and fewer partitions
t0 = time.time()
print(sample.mean())
t1 = time.time()
print(t1-t0)
```
%% Output
5005973.481531262
0.2796463966369629
%% Cell type:code id:b47df2f0-98c8-46e5-8c88-2f3600188edb tags:
``` python
# note about determinism
# ideally: given the same seed, sample the same way
# in reality: even with seed, you can get different results if the number of input partitions changes
# suggestion: be careful, and save your sampled results in a diff file to ensure determinism!
sample = rdd.sample(True, 0.01, 544)
```
%% Cell type:markdown id:93036a5e-2066-461e-ba12-8bc549294bde tags:
# Spark DataFrames: reading a text file
Spark DataFrames build on Spark SQL, which builds on Spark RDDs.
%% Cell type:code id:9141f941-fdaa-4d40-a744-9b831cb84e2b tags:
``` python
# !wget https://pages.cs.wisc.edu/~harter/cs544/data/ghcnd-stations.txt
```
%% Cell type:code id:175cb54b-eb63-4a2e-9877-5a4b1d5d12bf tags:
``` python
# read/write:
# spark.read.????.????.load() or .text()
# df.write.????.????.saveTable()
```
%% Cell type:code id:8e79a508-4dec-45d1-9dd3-58da5b83063c tags:
``` python
! ls /nb/ghcnd-stations.txt
```
%% Output
/nb/ghcnd-stations.txt
%% Cell type:code id:7d4516f2-9802-490a-9fca-22111624b63d tags:
``` python
# df = spark.read.text("ghcnd-stations.txt")
```
%% Cell type:code id:970432fa-981b-4824-bced-95eeef96afb5 tags:
``` python
# df.take(3)
```
%% Cell type:code id:fa910718-36f0-41e1-9ff8-40754a9ae707 tags:
``` python
!hdfs dfs -cp ghcnd-stations.txt hdfs://nn:9000/ghcnd-stations.txt
```
%% Cell type:code id:df70d8a3-f302-4663-bcda-adee88155131 tags:
``` python
!hdfs dfs -ls hdfs://nn:9000/
```
%% Output
Found 1 items
-rw-r--r-- 3 root supergroup 10607756 2025-03-09 18:51 hdfs://nn:9000/ghcnd-stations.txt
%% Cell type:code id:12493c05-e5f6-4120-a104-8b5fedf1a623 tags:
``` python
df = spark.read.text("hdfs://nn:9000/ghcnd-stations.txt")
```
%% Cell type:code id:eefc522f-7e8d-4cb6-8785-8064a8938c9c tags:
``` python
df.take(3)
```
%% Output
[Row(value='ACW00011604 17.1167 -61.7833 10.1 ST JOHNS COOLIDGE FLD '),
Row(value='ACW00011647 17.1333 -61.7833 19.2 ST JOHNS '),
Row(value='AE000041196 25.3330 55.5170 34.0 SHARJAH INTER. AIRP GSN 41196')]
%% Cell type:code id:65842e71-e8d3-4196-a217-005e98af68ae tags:
``` python
type(df)
```
%% Output
pyspark.sql.dataframe.DataFrame
%% Cell type:code id:ee8e9d09-6896-4257-899b-2e549e01337a tags:
``` python
type(df.rdd)
```
%% Output
pyspark.rdd.RDD
%% Cell type:code id:252fa425-9d8c-4792-8e26-e4ffd384a3c5 tags:
``` python
df.take(1)[0].value
```
%% Output
'ACW00011604 17.1167 -61.7833 10.1 ST JOHNS COOLIDGE FLD '
%% Cell type:code id:80d499d6-552e-4afa-ac38-111d31eb6d14 tags:
``` python
# first 11 characters is the station name
df.rdd.map(lambda row: row.value[:11]).take(10)
```
%% Output
['ACW00011604',
'ACW00011647',
'AE000041196',
'AEM00041194',
'AEM00041217',
'AEM00041218',
'AF000040930',
'AFM00040938',
'AFM00040948',
'AFM00040990']
%% Cell type:code id:10f9c364-5516-409d-8893-4e55bd031d98 tags:
``` python
# how would we do this in Pandas? Extract the station names, add that as a column
```
%% Cell type:code id:2ce8af7b-b4da-4e71-9c5a-a99487d2010e tags:
``` python
pandas_df = df.limit(6).toPandas()
pandas_df
```
%% Output
value
0 ACW00011604 17.1167 -61.7833 10.1 ST JO...
1 ACW00011647 17.1333 -61.7833 19.2 ST JO...
2 AE000041196 25.3330 55.5170 34.0 SHARJ...
3 AEM00041194 25.2550 55.3640 10.4 DUBAI...
4 AEM00041217 24.4330 54.6510 26.8 ABU D...
5 AEM00041218 24.2620 55.6090 264.9 AL AI...
%% Cell type:code id:29be9b4f-0a70-4b29-bd59-111098da36dd tags:
``` python
pandas_df["station"] = pandas_df["value"].str[:11]
pandas_df
```
%% Output
value station
0 ACW00011604 17.1167 -61.7833 10.1 ST JO... ACW00011604
1 ACW00011647 17.1333 -61.7833 19.2 ST JO... ACW00011647
2 AE000041196 25.3330 55.5170 34.0 SHARJ... AE000041196
3 AEM00041194 25.2550 55.3640 10.4 DUBAI... AEM00041194
4 AEM00041217 24.4330 54.6510 26.8 ABU D... AEM00041217
5 AEM00041218 24.2620 55.6090 264.9 AL AI... AEM00041218
%% Cell type:code id:664669e9-2624-4710-b58e-8303af3e1b07 tags:
``` python
# not allowed, because df wraps rdd, which is immutable!
# df["station"] = ????
```
%% Cell type:code id:465d5d75-4720-488d-b3fa-60e4246eca16 tags:
``` python
from pyspark.sql.functions import col, expr
```
%% Cell type:code id:516597a7-fb56-456a-bef6-304ff9cc25f5 tags:
``` python
col("x")
```
%% Output
Column<'x'>
%% Cell type:code id:db232b87-d14f-43ba-9536-39c1aad1fac5 tags:
``` python
col("x") + 1
```
%% Output
Column<'(x + 1)'>
%% Cell type:code id:1ceae7a8-f355-4e5e-9857-ce451997e0f1 tags:
``` python
expr("x + 1")
```
%% Output
Column<'(x + 1)'>
%% Cell type:code id:08fc588a-97c5-4908-8eb0-2631b3c4bf8f tags:
``` python
expr("x + 1").alias("plusone") # similar to SQL "AS"
```
%% Output
Column<'(x + 1) AS plusone'>
%% Cell type:code id:93e8e7e9-c9fa-4f72-b086-3e31d3d3c31d tags:
``` python
df2 = df.withColumn("station", expr("substring(value, 0, 11)")) # transformation!
```
%% Cell type:code id:f4d812d8-f1fa-4090-bbf2-590463e69c6d tags:
``` python
df
```
%% Output
DataFrame[value: string]
%% Cell type:code id:f6ca7f4b-31ae-468c-adf8-0d1d6166820e tags:
``` python
df2
```
%% Output
DataFrame[value: string, station: string]
%% Cell type:code id:7f2a04c3-dc21-4dbd-8490-dff279a20df6 tags:
``` python
df2.show() # action
```
%% Output
+--------------------+-----------+
| value| station|
+--------------------+-----------+
|ACW00011604 17.1...|ACW00011604|
|ACW00011647 17.1...|ACW00011647|
|AE000041196 25.3...|AE000041196|
|AEM00041194 25.2...|AEM00041194|
|AEM00041217 24.4...|AEM00041217|
|AEM00041218 24.2...|AEM00041218|
|AF000040930 35.3...|AF000040930|
|AFM00040938 34.2...|AFM00040938|
|AFM00040948 34.5...|AFM00040948|
|AFM00040990 31.5...|AFM00040990|
|AG000060390 36.7...|AG000060390|
|AG000060590 30.5...|AG000060590|
|AG000060611 28.0...|AG000060611|
|AG000060680 22.8...|AG000060680|
|AGE00135039 35.7...|AGE00135039|
|AGE00147704 36.9...|AGE00147704|
|AGE00147705 36.7...|AGE00147705|
|AGE00147706 36.8...|AGE00147706|
|AGE00147707 36.8...|AGE00147707|
|AGE00147708 36.7...|AGE00147708|
+--------------------+-----------+
only showing top 20 rows
%% Cell type:code id:afaff554-089f-43cd-85c3-8f2af5384545 tags:
``` python
df2.limit(10).toPandas() # limit is a transformation, toPandas is the action!
```
%% Output
value station
0 ACW00011604 17.1167 -61.7833 10.1 ST JO... ACW00011604
1 ACW00011647 17.1333 -61.7833 19.2 ST JO... ACW00011647
2 AE000041196 25.3330 55.5170 34.0 SHARJ... AE000041196
3 AEM00041194 25.2550 55.3640 10.4 DUBAI... AEM00041194
4 AEM00041217 24.4330 54.6510 26.8 ABU D... AEM00041217
5 AEM00041218 24.2620 55.6090 264.9 AL AI... AEM00041218
6 AF000040930 35.3170 69.0170 3366.0 NORTH... AF000040930
7 AFM00040938 34.2100 62.2280 977.2 HERAT... AFM00040938
8 AFM00040948 34.5660 69.2120 1791.3 KABUL... AFM00040948
9 AFM00040990 31.5000 65.8500 1010.0 KANDA... AFM00040990
%% Cell type:markdown id:f49d5318-f826-4fa0-832e-dc05be8c9426 tags:
# Spark: CSVs and Parquet
%% Cell type:code id:f2a328d7-982e-4e47-b145-2a2a79ef744e tags:
``` python
! wget https://pages.cs.wisc.edu/~harter/cs544/data/sf.zip
```
%% Output
--2025-03-10 01:40:01-- https://pages.cs.wisc.edu/~harter/cs544/data/sf.zip
Resolving pages.cs.wisc.edu (pages.cs.wisc.edu)... 128.105.7.9
Connecting to pages.cs.wisc.edu (pages.cs.wisc.edu)|128.105.7.9|:443... connected.
200 OKequest sent, awaiting response...
Length: 534803160 (510M) [application/zip]
Saving to: ‘sf.zip’
sf.zip 100%[===================>] 510.03M 21.5MB/s in 26s
2025-03-10 01:40:27 (19.8 MB/s) - ‘sf.zip’ saved [534803160/534803160]
%% Cell type:code id:9b7b7bce-6d2e-4278-829a-a97c1b491195 tags:
``` python
! unzip sf.zip
```
%% Output
Archive: sf.zip
inflating: sf.csv
%% Cell type:code id:cc7d6890-85d5-45f2-844e-644c9b6a21f2 tags:
``` python
! hdfs dfs -cp sf.csv hdfs://nn:9000/sf.csv
```
%% Cell type:code id:3d0d1512-1dcb-4b09-9ace-6348791ae0b5 tags:
``` python
# read/write:
# spark.read.????.????.load() or .text()
# df.write.????.????.saveTable()
```
%% Cell type:code id:494488f6-6186-4400-a830-89e7528e1357 tags:
``` python
%%time
df = spark.read.format("csv").option("header", True).load("hdfs://nn:9000/sf.csv")
```
%% Output
[Stage 32:> (0 + 1) / 1]
CPU times: user 2.54 ms, sys: 2.23 ms, total: 4.76 ms
Wall time: 3.37 s
%% Cell type:code id:61599f7d-6b07-4ad1-baa3-ddea0b2091c7 tags:
``` python
%%time
df.count()
```
%% Output
[Stage 27:=================================================> (15 + 2) / 17]
CPU times: user 8.32 ms, sys: 1.02 ms, total: 9.34 ms
Wall time: 4.31 s
6016057
%% Cell type:code id:19af30b2-327a-49eb-bfe9-3777b2ae5120 tags:
``` python
df.limit(5).toPandas()
```
%% Output
Call Number Unit ID Incident Number Call Type Call Date Watch Date \
0 221210313 E36 22054955 Outside Fire 05/01/2022 04/30/2022
1 220190150 E29 22008871 Alarms 01/19/2022 01/18/2022
2 211233271 T07 21053032 Alarms 05/03/2021 05/03/2021
3 212933533 B02 21127914 Alarms 10/20/2021 10/20/2021
4 221202543 E41 22054815 Alarms 04/30/2022 04/30/2022
Received DtTm Entry DtTm Dispatch DtTm \
0 05/01/2022 02:58:25 AM 05/01/2022 02:59:15 AM 05/01/2022 02:59:25 AM
1 01/19/2022 01:42:12 AM 01/19/2022 01:44:13 AM 01/19/2022 01:44:28 AM
2 05/03/2021 09:28:12 PM 05/03/2021 09:28:12 PM 05/03/2021 09:28:17 PM
3 10/20/2021 10:08:47 PM 10/20/2021 10:09:53 PM 10/20/2021 10:10:07 PM
4 04/30/2022 06:35:58 PM 04/30/2022 06:37:28 PM 04/30/2022 06:37:43 PM
Response DtTm ... Call Type Group Number of Alarms Unit Type \
0 05/01/2022 03:01:06 AM ... Fire 1 ENGINE
1 01/19/2022 01:46:47 AM ... Alarm 1 ENGINE
2 05/03/2021 09:29:10 PM ... Alarm 1 TRUCK
3 10/20/2021 10:11:55 PM ... Alarm 1 CHIEF
4 04/30/2022 06:38:17 PM ... Alarm 1 ENGINE
Unit sequence in call dispatch Fire Prevention District Supervisor District \
0 1 2 5
1 1 3 10
2 2 2 9
3 3 3 6
4 4 4 2
Neighborhooods - Analysis Boundaries RowID \
0 Hayes Valley 221210313-E36
1 Potrero Hill 220190150-E29
2 Mission 211233271-T07
3 Tenderloin 212933533-B02
4 Russian Hill 221202543-E41
case_location Analysis Neighborhoods
0 POINT (-122.42316555403964 37.77781524520032) 9
1 POINT (-122.39469970274361 37.76460987856451) 26
2 POINT (-122.42057572093252 37.76418194637148) 20
3 POINT (-122.41243514072728 37.78347684038771) 36
4 POINT (-122.4233369425531 37.799534868680034) 32
[5 rows x 35 columns]
%% Cell type:code id:9685c540-f710-4c91-ae79-d3ea75f72201 tags:
``` python
df.dtypes
```
%% Output
[('Call Number', 'string'),
('Unit ID', 'string'),
('Incident Number', 'string'),
('Call Type', 'string'),
('Call Date', 'string'),
('Watch Date', 'string'),
('Received DtTm', 'string'),
('Entry DtTm', 'string'),
('Dispatch DtTm', 'string'),
('Response DtTm', 'string'),
('On Scene DtTm', 'string'),
('Transport DtTm', 'string'),
('Hospital DtTm', 'string'),
('Call Final Disposition', 'string'),
('Available DtTm', 'string'),
('Address', 'string'),
('City', 'string'),
('Zipcode of Incident', 'string'),
('Battalion', 'string'),
('Station Area', 'string'),
('Box', 'string'),
('Original Priority', 'string'),
('Priority', 'string'),
('Final Priority', 'string'),
('ALS Unit', 'string'),
('Call Type Group', 'string'),
('Number of Alarms', 'string'),
('Unit Type', 'string'),
('Unit sequence in call dispatch', 'string'),
('Fire Prevention District', 'string'),
('Supervisor District', 'string'),
('Neighborhooods - Analysis Boundaries', 'string'),
('RowID', 'string'),
('case_location', 'string'),
('Analysis Neighborhoods', 'string')]
%% Cell type:code id:50c40bca-2cde-4d40-abdb-398e4e1b50e3 tags:
``` python
%%time
df = spark.read.format("csv").option("header", True).option("inferSchema", True).load("hdfs://nn:9000/sf.csv")
```
%% Output
[Stage 35:====================================================> (16 + 1) / 17]
CPU times: user 7.83 ms, sys: 3.52 ms, total: 11.4 ms
Wall time: 10.9 s
%% Cell type:code id:577e1773-4308-45b8-9ddc-c74f43f4c3f8 tags:
``` python
df.dtypes
```
%% Output
[('Call Number', 'int'),
('Unit ID', 'string'),
('Incident Number', 'int'),
('Call Type', 'string'),
('Call Date', 'string'),
('Watch Date', 'string'),
('Received DtTm', 'string'),
('Entry DtTm', 'string'),
('Dispatch DtTm', 'string'),
('Response DtTm', 'string'),
('On Scene DtTm', 'string'),
('Transport DtTm', 'string'),
('Hospital DtTm', 'string'),
('Call Final Disposition', 'string'),
('Available DtTm', 'string'),
('Address', 'string'),
('City', 'string'),
('Zipcode of Incident', 'int'),
('Battalion', 'string'),
('Station Area', 'string'),
('Box', 'string'),
('Original Priority', 'string'),
('Priority', 'string'),
('Final Priority', 'int'),
('ALS Unit', 'boolean'),
('Call Type Group', 'string'),
('Number of Alarms', 'int'),
('Unit Type', 'string'),
('Unit sequence in call dispatch', 'int'),
('Fire Prevention District', 'string'),
('Supervisor District', 'string'),
('Neighborhooods - Analysis Boundaries', 'string'),
('RowID', 'string'),
('case_location', 'string'),
('Analysis Neighborhoods', 'int')]
%% Cell type:code id:3f5725e7-2e67-4685-8780-b999c0f7aa17 tags:
``` python
df.rdd.getNumPartitions()
```
%% Output
17
%% Cell type:code id:5a599aa5-48f4-46ce-8ee2-84d57d4453e1 tags:
``` python
# example 1: how can we cleanup the strings (upper case), and get date types
```
%% Cell type:code id:699bb766-6ba5-40a6-9a18-c292f37f8066 tags:
``` python
df.select("Call Type", "Call Date").limit(3).toPandas()
```
%% Output
Call Type Call Date
0 Outside Fire 05/01/2022
1 Alarms 01/19/2022
2 Alarms 05/03/2021
%% Cell type:code id:b306d906-0bd7-4cbf-b0fa-856ded1f0725 tags:
``` python
from pyspark.sql.functions import col, expr
df.select(
expr("upper(`Call Type`)").alias("Call_Type"),
expr("to_date(`Call Date`, 'MM/dd/yyyy')").alias("Call_Date")
).limit(3).toPandas()
```
%% Output
Call_Type Call_Date
0 OUTSIDE FIRE 2022-05-01
1 ALARMS 2022-01-19
2 ALARMS 2021-05-03
%% Cell type:code id:f6b0d597-ffcf-44e7-8817-81e850435104 tags:
``` python
# example 2: convert the CSV to Parquet, with no spaces in the column names
```
%% Cell type:code id:ba66cbd3-8692-4dfd-b95c-c934f689660d tags:
``` python
col("Call Number").alias("Call_Number")
```
%% Output
Column<'Call Number AS Call_Number'>
%% Cell type:code id:c1503637-6e74-4211-91d0-24881b666ccb tags:
``` python
(
df
.select([col(c).alias(c.replace(" ", "_")) for c in df.columns])
.write
.mode("overwrite")
.format("parquet")
.save("hdfs://nn:9000/sf.parquet")
)
```
%% Output
%% Cell type:code id:9842dfa1-b43e-4890-9873-fff11ce76b8a tags:
``` python
! hdfs dfs -ls hdfs://nn:9000/sf.parquet/
```
%% Output
Found 18 items
-rw-r--r-- 3 root supergroup 0 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/_SUCCESS
-rw-r--r-- 3 root supergroup 27806510 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00000-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 27789781 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00001-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 40478442 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00002-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 36017328 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00003-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 36033379 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00004-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 36082202 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00005-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 35944952 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00006-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 35912043 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00007-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 36436328 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00008-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 35368134 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00009-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 34238988 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00010-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 33948649 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00011-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 33488640 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00012-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 34900131 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00013-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 35715813 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00014-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 35769206 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00015-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
-rw-r--r-- 3 root supergroup 29363174 2025-03-10 01:54 hdfs://nn:9000/sf.parquet/part-00016-cade4df6-026c-40f3-9f8d-8394a0e146ef-c000.snappy.parquet
%% Cell type:code id:cef40087-bfd5-43c2-9c6d-9e6644a7079d tags:
``` python
%%time
df = spark.read.format("parquet").load("hdfs://nn:9000/sf.parquet")
```
%% Output
CPU times: user 2.69 ms, sys: 960 μs, total: 3.65 ms
Wall time: 231 ms
%% Cell type:code id:ccc39826-d6f1-433a-a005-269d8fa88e9f tags:
``` python
%%time
df.count()
```
%% Output
CPU times: user 719 μs, sys: 993 μs, total: 1.71 ms
Wall time: 443 ms
6016056
%% Cell type:code id:574746ea-5e08-4585-be3c-e6d151262e20 tags:
``` python
df
```
%% Output
DataFrame[Call_Number: int, Unit_ID: string, Incident_Number: int, Call_Type: string, Call_Date: string, Watch_Date: string, Received_DtTm: string, Entry_DtTm: string, Dispatch_DtTm: string, Response_DtTm: string, On_Scene_DtTm: string, Transport_DtTm: string, Hospital_DtTm: string, Call_Final_Disposition: string, Available_DtTm: string, Address: string, City: string, Zipcode_of_Incident: int, Battalion: string, Station_Area: string, Box: string, Original_Priority: string, Priority: string, Final_Priority: int, ALS_Unit: boolean, Call_Type_Group: string, Number_of_Alarms: int, Unit_Type: string, Unit_sequence_in_call_dispatch: int, Fire_Prevention_District: string, Supervisor_District: string, Neighborhooods_-_Analysis_Boundaries: string, RowID: string, case_location: string, Analysis_Neighborhoods: int]
%% Cell type:code id:67086ea3-b1e9-4e17-8cbe-acd155371d80 tags:
``` python
df.rdd.getNumPartitions()
```
%% Output
6
FROM ubuntu:24.04
RUN apt-get update; apt-get install -y wget curl openjdk-11-jdk python3-pip nano
# SPARK
#RUN wget https://archive.apache.org/dist/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz && tar -xf spark-3.5.5-bin-hadoop3.tgz && rm spark-3.5.5-bin-hadoop3.tgz
RUN wget https://dlcdn.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz && tar -xf spark-3.5.5-bin-hadoop3.tgz && rm spark-3.5.5-bin-hadoop3.tgz
# HDFS
RUN wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz && tar -xf hadoop-3.3.6.tar.gz && rm hadoop-3.3.6.tar.gz
# Jupyter
RUN pip3 install jupyterlab==4.3.5 pandas==2.2.3 pyspark==3.5.5 matplotlib==3.10.1 --break-system-packages
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ENV PATH="${PATH}:/hadoop-3.3.6/bin"
ENV HADOOP_HOME=/hadoop-3.3.6
services:
nb:
image: spark-demo
ports:
- "127.0.0.1:5000:5000"
- "127.0.0.1:4040:4040"
volumes:
- "./nb:/nb"
command: python3 -m jupyterlab --no-browser --ip=0.0.0.0 --port=5000 --allow-root --NotebookApp.token=''
nn:
image: spark-demo
hostname: nn
command: sh -c "hdfs namenode -format -force && hdfs namenode -D dfs.replication=1 -fs hdfs://nn:9000"
dn:
image: spark-demo
command: hdfs datanode -fs hdfs://nn:9000
spark-boss:
image: spark-demo
hostname: boss
command: sh -c "/spark-3.5.5-bin-hadoop3/sbin/start-master.sh && sleep infinity"
spark-worker:
image: spark-demo
command: sh -c "/spark-3.5.5-bin-hadoop3/sbin/start-worker.sh spark://boss:7077 -c 2 -m 2g && sleep infinity"
deploy:
replicas: 2
date,holiday
01/01/2013,New Year's Day
01/01/2014,New Year's Day
01/01/2015,New Year's Day
01/01/2016,New Year's Day
01/01/2018,New Year's Day
01/01/2019,New Year's Day
01/01/2020,New Year's Day
01/01/2021,New Year's Day
01/02/2012,New Year's Day
01/02/2017,New Year's Day
01/15/2018,"Birthday of Martin Luther King, Jr."
01/16/2012,"Birthday of Martin Luther King, Jr."
01/16/2017,"Birthday of Martin Luther King, Jr."
01/17/2011,"Birthday of Martin Luther King, Jr."
01/17/2022,"Birthday of Martin Luther King, Jr."
01/18/2016,"Birthday of Martin Luther King, Jr."
01/18/2021,"Birthday of Martin Luther King, Jr."
01/19/2015,"Birthday of Martin Luther King, Jr."
01/20/2014,"Birthday of Martin Luther King, Jr."
01/20/2020,"Birthday of Martin Luther King, Jr."
01/20/2021,Inauguration Day
01/21/2013,"Birthday of Martin Luther King, Jr."
01/21/2019,"Birthday of Martin Luther King, Jr."
02/15/2016,Washington's Birthday
02/15/2021,Washington's Birthday
02/16/2015,Washington's Birthday
02/17/2014,Washington's Birthday
02/17/2020,Washington's Birthday
02/18/2013,Washington's Birthday
02/18/2019,Washington's Birthday
02/19/2018,Washington's Birthday
02/20/2012,Washington's Birthday
02/20/2017,Washington's Birthday
02/21/2011,Washington's Birthday
02/21/2022,Washington's Birthday
05/25/2015,Memorial Day
05/25/2020,Memorial Day
05/26/2014,Memorial Day
05/27/2013,Memorial Day
05/27/2019,Memorial Day
05/28/2012,Memorial Day
05/28/2018,Memorial Day
05/29/2017,Memorial Day
05/30/2011,Memorial Day
05/30/2016,Memorial Day
05/30/2022,Memorial Day
05/31/2021,Memorial Day
06/18/2021,Juneteenth National Independence Day
06/20/2022,Juneteenth National Independence Day
07/03/2015,Independence Day
07/03/2020,Independence Day
07/04/2011,Independence Day
07/04/2012,Independence Day
07/04/2013,Independence Day
07/04/2014,Independence Day
07/04/2016,Independence Day
07/04/2017,Independence Day
07/04/2018,Independence Day
07/04/2019,Independence Day
07/04/2022,Independence Day
07/05/2021,Independence Day
09/01/2014,Labor Day
09/02/2013,Labor Day
09/02/2019,Labor Day
09/03/2012,Labor Day
09/03/2018,Labor Day
09/04/2017,Labor Day
09/05/2011,Labor Day
09/05/2016,Labor Day
09/05/2022,Labor Day
09/06/2021,Labor Day
09/07/2015,Labor Day
09/07/2020,Labor Day
10/08/2012,Columbus Day
10/08/2018,Columbus Day
10/09/2017,Columbus Day
10/10/2011,Columbus Day
10/10/2016,Columbus Day
10/10/2022,Columbus Day
10/11/2021,Columbus Day
10/12/2015,Columbus Day
10/12/2020,Columbus Day
10/13/2014,Columbus Day
10/14/2013,Columbus Day
10/14/2019,Columbus Day
11/10/2017,Veterans Day
11/11/2011,Veterans Day
11/11/2013,Veterans Day
11/11/2014,Veterans Day
11/11/2015,Veterans Day
11/11/2016,Veterans Day
11/11/2019,Veterans Day
11/11/2020,Veterans Day
11/11/2021,Veterans Day
11/11/2022,Veterans Day
11/12/2012,Veterans Day
11/12/2018,Veterans Day
11/22/2012,Thanksgiving Day
11/22/2018,Thanksgiving Day
11/23/2017,Thanksgiving Day
11/24/2011,Thanksgiving Day
11/24/2016,Thanksgiving Day
11/24/2022,Thanksgiving Day
11/25/2021,Thanksgiving Day
11/26/2015,Thanksgiving Day
11/26/2020,Thanksgiving Day
11/27/2014,Thanksgiving Day
11/28/2013,Thanksgiving Day
11/28/2019,Thanksgiving Day
12/24/2021,Christmas Day
12/25/2012,Christmas Day
12/25/2013,Christmas Day
12/25/2014,Christmas Day
12/25/2015,Christmas Day
12/25/2017,Christmas Day
12/25/2018,Christmas Day
12/25/2019,Christmas Day
12/25/2020,Christmas Day
12/26/2011,Christmas Day
12/26/2016,Christmas Day
12/26/2022,Christmas Day
12/31/2022,New Year's Day
source diff could not be displayed: it is too large. Options to address this: view the blob.
source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:code id:c8dca847-54af-4284-97d8-0682e88a6e8d tags:
``` python
from pyspark.sql import SparkSession
spark = (SparkSession.builder.appName("cs544")
.master("spark://boss:7077")
.config("spark.executor.memory", "2G")
.config("spark.sql.warehouse.dir", "hdfs://nn:9000/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate())
```
%% Output
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/27 01:41:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
%% Cell type:code id:2294e4e0-ab19-496c-980f-31df757e7837 tags:
``` python
!hdfs dfs -cp sf.csv hdfs://nn:9000/sf.csv
```
%% Cell type:code id:cb54bacc-b52a-4c25-93d2-2ba0f61de9b0 tags:
``` python
df = (spark.read.format("csv")
.option("header", True)
.option("inferSchema", True)
.load("hdfs://nn:9000/sf.csv"))
```
%% Output
%% Cell type:code id:c1298818-83f6-444b-b8a0-4be5b16fd6fb tags:
``` python
from pyspark.sql.functions import col, expr
cols = [col(c).alias(c.replace(" ", "_")) for c in df.columns]
df.select(cols).write.format("parquet").save("hdfs://nn:9000/sf.parquet")
```
%% Output
23/10/27 01:43:57 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
%% Cell type:code id:37d1ded3-ed8a-4e39-94cb-dd3a3272af91 tags:
``` python
!hdfs dfs -rm hdfs://nn:9000/sf.csv
```
%% Cell type:code id:abea48b5-e012-4ae2-a53a-e40350f94e20 tags:
``` python
df = spark.read.format("parquet").load("hdfs://nn:9000/sf.parquet")
```
# DRAFT! Don't start yet.
# P4 (4% of grade): SQL and HDFS
## Overview
......@@ -16,17 +14,34 @@ Before starting, please review the [general project directions](../projects.md).
## Corrections/Clarifications
* none yet
- Mar 5: A hint about HDFS environment variables added; a dataflow diagram added; some minor typos fixed.
- Mar 5: Fix the wrong expected file size in Part 1 and sum of blocks in Part 2.
- Mar 6: Released `autobadger` for `p4` (`0.1.6`)
- Mar 7:
- Some minor updates on p4 `Readme.md`.
- Update `autobadgere` to version `0.1.7`
- Fixed exception handling, now Autobadger can correctly print error messages.
- Expanded the expected file size range in test4 `test_Hdfs_size`.
- Make the error messages clearer.
## Introduction
You'll need to deploy a system including 5 docker containers like this:
You'll need to deploy a system including 6 docker containers like this:
<img src="arch.png" width=600>
The data flow roughly follows this:
<img src="dataflow.png" width=600>
We have provided the other components; what you only need is to complete the work within the gRPC server and its Dockerfile.
### Client
This project will use `docker exec -it` to run the client on the gRPC server's container. Usage of `client.py` is as follows:
This project will use `docker exec` to run the client on the gRPC server's container. Usage of `client.py` is as follows:
```
#Inside the server container
python3 client.py DbToHdfs
......@@ -65,7 +80,7 @@ export PROJECT=p4
**Hint 2:** Think about whether there is any .sh script that will help you quickly test code changes. For example, you may want it to rebuild your Dockerfiles, cleanup an old Compose cluster, and deploy a new cluster.
**Hint 3:** If you're low on disk space, consider running `docker system prune -a --volumes -f`
**Hint 3:** If you're low on disk space, consider running `docker system prune --volumes -f`
## Part 1: `DbToHdfs` gRPC Call
......@@ -73,34 +88,37 @@ In this part, your task is to implement the `DbToHdfs` gRPC call (you can find t
**DbToHdfs:** To be more specific, you need to:
1. Connect to the SQL server, with the database name as `CS544` and the password as `abc`. There are two tables in databse: `loans` ,and `loan_types`. The former records all information related to loans, while the latter maps the numbers in the loan_type column of the loans table to their corresponding loan types. There should be **447367** rows in table `loans`. It's like:
```mysql
mysql> show tables;
+-----------------+
| Tables_in_CS544 |
+-----------------+
| loan_types |
| loans |
+-----------------+
mysql> select count(*) from new_table;
+----------+
| count(*) |
+----------+
| 426716 |
+----------+
```
2. What are the actual types for those loans?
Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
```mysql
mysql> show tables;
+-----------------+
| Tables_in_CS544 |
+-----------------+
| loan_types |
| loans |
+-----------------+
mysql> select count(*) from loans;
+----------+
| count(*) |
+----------+
| 447367 |
+----------+
```
2. What are the actual types for those loans?
Perform an inner join on these two tables so that a new column `loan_type_name` added to the `loans` table, where its value is the corresponding `loan_type_name` from the `loan_types` table based on the matching `loan_type_id` in `loans`.
3. Filter all rows where `loan_amount` is **greater than 30,000** and **less than 800,000**. After filtering, this table should have only **426716** rows.
4. Upload the generated table to `/hdma-wi-2021.parquet` in the HDFS, with **2x** replication and a **1-MB** block size, using PyArrow (https://arrow.apache.org/docs/python/generated/pyarrow.fs.HadoopFileSystem.html).
To check whether the upload was correct, you can use `docker exec -it` to enter the gRPC server's container and use HDFS command `hdfs dfs -du -h <path>`to see the file size. The expected result is:
```
15.3 M 30.5 M hdfs://nn:9000/hdma-wi-2021.parquet
```
To check whether the upload was correct, you can use `docker exec -it <container_name> bash` to enter the gRPC server's container and use HDFS command `hdfs dfs -du -h <path>`to see the file size. The expected result should like:
```
14.4 M 28.9 M hdfs://nn:9000/hdma-wi-2021.parquet
```
Note: Your file size might have slight difference from this.
>That's because when we join two tables, rows from one table get matches with rows in the other, but the order of output rows is not guaranteed. If we have the same rows in a different order, the compressibility of snappy (used by Parquet by default) will vary because it is based on compression windows, and there may be more or less redundancy in a window depending on row ordering.
**Hint 1:** We used similar tables in lecture: https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/tree/main/lec/15-sql
**Hint 2:** To get more familiar with these tables, you can use SQL queries to print the table schema or retrieve sample data. A convenient way to do this is to use `docker exec -it` to enter the SQL Server, then run mysql client `mysql -p CS544` to access the SQL Server and then perform queries.
**Hint 2:** To get more familiar with these tables, you can use SQL queries to print the table schema or retrieve sample data. A convenient way to do this is to use `docker exec -it <container name> bash` to enter the SQL Server, then run mysql client `mysql -p CS544` to access the SQL Server and then perform queries.
**Hint 3:** After `docker compose up`, the SQL Server needs some time to load the data before it is ready. Therefore, you need to wait for a while, or preferably, add a retry mechanism for the SQL connection.
......@@ -115,7 +133,7 @@ In this part, your task is to implement the `BlockLocations` gRPC call (you can
For example, running `docker exec -it p4-server-1 python3 /client.py BlockLocations -f /hdma-wi-2021.parquet` should show something like this:
```
{'7eb74ce67e75': 15, 'f7747b42d254': 6, '39750756065d': 11}
{'7eb74ce67e75': 15, 'f7747b42d254': 7, '39750756065d': 8}
```
Note: DataNode location is the randomly generated container ID for the
......@@ -126,6 +144,8 @@ The documents [here](https://hadoop.apache.org/docs/r3.3.6/hadoop-project-dist/h
Use a `GETFILEBLOCKLOCATIONS` operation to find the block locations.
**Hint:** You have to set appropriate environment variable `CLASSPATH` to access HDFS correctly. See example [here](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/blob/main/lec/18-hdfs/notebook.Dockerfile?ref_type=heads).
## Part 3: `CalcAvgLoan` gRPC Call
In this part, your task is to implement the `CalcAvgLoan` gRPC call (you can find the interface definition in the proto file).
......@@ -188,9 +208,9 @@ docker compose up -d
Then run the client like this:
```
docker exec -it p4-server-1 python3 /client.py DbToHdfs
docker exec -it p4-server-1 python3 /client.py BlockLocations -f /hdma-wi-2021.parquet
docker exec -it p4-server-1 python3 /client.py CalcAvgLoan -c 55001
docker exec p4-server-1 python3 /client.py DbToHdfs
docker exec p4-server-1 python3 /client.py BlockLocations -f /hdma-wi-2021.parquet
docker exec p4-server-1 python3 /client.py CalcAvgLoan -c 55001
```
Note that we will copy in the the provided files (docker-compose.yml, client.py, lender.proto, hdma-wi-2021.sql.gz, etc.), overwriting anything you might have changed. Please do NOT push hdma-wi-2021.sql.gz to your repo because it is large, and we want to keep the repos small.
......@@ -199,4 +219,11 @@ Please make sure you have `client.py` copied into the p4-server image. We will r
## Tester
Not released yet.
Please be sure that your installed `autobadger` is on version `0.1.7`. You can print the version using
```bash
autobadger --info
```
See [projects.md](https://git.doit.wisc.edu/cdis/cs/courses/cs544/s25/main/-/blob/main/projects.md#testing) for more information.