Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • cdis/cs/courses/cs320/s24
  • EBBARTELS/s24
  • kenninger/s24
  • hbartle/s24
  • jvoegeli/s24
  • chin6/s24
  • lallo/s24
  • cbjensen/s24
  • bjhicks/s24
  • JPERLOFF/s24
  • RMILLER56/s24
  • sswain2/s24
  • SHINEGEORGE/s24
  • SKALMAZROUEI/s24
  • nkempf2/s24
  • kmalovrh/s24
  • alagiriswamy/s24
  • SWEINGARTEN2/s24
  • SKALMAZROUEI/s-24-fork
  • jchasco/s24
20 results
Show changes
Showing
with 426 additions and 17402 deletions
1) B
2) C
3) A
4) D
5) D
6) C
7) A
8) C
9) A
10) C
11) D
12) B
13) A
14) B
15) D
16) B
17) B
18) C
19) C
20) A
21) B
22) C
23) B
24) B
25) C
26) B
27) B
28) D
29) C
30) A
File added
1) A
2) B
3) D
4) A
5) A
6) D
7) C
8) C
9) B
10) C
11) A
12) B
13) C
14) B
15) B
16) A
17) A
18) C
19) A
20) C
21) D
22) B
23) D
24) C
25) C
26) D
27) C
28) C
29) C
30) C
\ No newline at end of file
File added
1) C
2) C
3) D
4) D
5) B
6) B
7) A
8) E
9) A
10) C
11) A
12) C
13) C
14) A
15) B
16) A
17) B
18) D
19) A
20) B
21) B
22) D
23) B
24) B
25) A
26) E
27) B
28) A
29) A
30) A
File added
File added
File added
File added
File added
cs320.cs.wisc.edu @ a358174d
Subproject commit a358174dbd16992bb01668111a6c50b345cee33e
Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:markdown id: tags:
# Running and Timing Programs
In this notebook, we'll learn how to write programs that can launch other programs and time how long it takes to do things (you'll often be combining these skills to time how long it takes to run a program).
Both these skills are covered in much more detail in an optional reading, Chapter 17 of Automate the Boring Stuff: https://automatetheboringstuff.com/2e/chapter17/. If you decide to read that, we recommend skipping the middle sections, "Multithreading" through "Project: Multithreaded XKCD Downloader"
## Running Programs
### Example 1: Running `pwd`
Remember that running the `pwd` program in a shell tells you what directory you're currently in. Let's write some Python code to run the `pwd` program automatically and capture the output. We'll do this with the `check_output` function in the `subprocess` module (https://docs.python.org/2/library/subprocess.html#subprocess.check_output) -- let's import that.
%% Cell type:code id: tags:
``` python
from subprocess import check_output
```
%% Cell type:markdown id: tags:
In the simplest form, we can run pass a program name (as a string) to the function, which will capture and return the output:
%% Cell type:code id: tags:
``` python
output = check_output("pwd")
output
```
%% Output
b'/home/trh/lec3\n'
%% Cell type:markdown id: tags:
What type is that output? It looks like a string, but with a "b" in front. Hmmmm....
%% Cell type:code id: tags:
``` python
type(output)
```
%% Output
bytes
%% Cell type:markdown id: tags:
The `bytes` type in Python is a sequence, like a string. The difference is that `bytes` may contain letters (as in this case), or other types. If we know the encoding of a bytes sequence, we can convert to a string as follows:
%% Cell type:code id: tags:
``` python
str_output = str(output, encoding="utf-8")
str_output
```
%% Output
'/home/trh/lec3\n'
%% Cell type:code id: tags:
``` python
type(str_output)
```
%% Output
str
%% Cell type:markdown id: tags:
### Example 2: Checking Versions
What version of git do we have on this computer? From the command line, we could run `git --version` to find out. But let's do that in code. This is a little trickier because we have both a program name, `git`, and an argument, `--version`. The `checkout_output` function supports two ways of running programs with arguments.
Way 1: pass `shell=True`
%% Cell type:code id: tags:
``` python
check_output("git --version", shell=True)
```
%% Output
b'git version 2.17.1\n'
%% Cell type:markdown id: tags:
Or (preferred), we can pass the program and arguments in one list:
%% Cell type:code id: tags:
``` python
check_output(["git", "--version"])
```
%% Output
b'git version 2.17.1\n'
%% Cell type:markdown id: tags:
Let's actually do the string manipulation work to isolate the version:
%% Cell type:code id: tags:
``` python
output = str(check_output(["git", "--version"]), encoding="utf-8")
output
```
%% Output
'git version 2.17.1\n'
%% Cell type:code id: tags:
``` python
parts = output.strip().split()
parts
```
%% Output
['git', 'version', '2.17.1']
%% Cell type:code id: tags:
``` python
version = parts[-1]
version
```
%% Output
'2.17.1'
%% Cell type:markdown id: tags:
If we needed to have a specific version, we might use the above to have an assert like this:
```
assert version == `2.17.1`
```
What if the program isn't installed, or we pass it some arguments that cause it to crash, as in the following example? We'll want to have catch some exceptions in these scenarios:
%% Cell type:code id: tags:
``` python
import subprocess
try:
output = str(check_output(["git", "--oops"]), encoding="utf-8")
except FileNotFoundError:
print("program not installed?")
except subprocess.CalledProcessError as e:
print("program crashed")
# if there were any output before it crashed, we could look at it
# with this:
print("OUTPUT:", e.output)
```
%% Output
program crashed
OUTPUT: b''
%% Cell type:markdown id: tags:
### Example 3: Making Animations
A common situation is that there will be some program that does something useful that we can't directly do in Python, and we'll want to write Python code to run these external programs to make use of their features.
For example, the `ffmpeg` program can make an animated video by glueing together a bunch of `.png` image files in sequence. There are ways to make animations directly in Python, but for now let's see how we can execute `ffmpeg` with `check_output` to make a video.
First, you should install the `ffmpeg` program on Ubuntu so we can use it -- run the following in the shell:
```
sudo apt install ffmpeg
```
Now, let's write some code to make a series of plots with a red dot in different positions, and save those plots as `0.png`, `1.png`, etc. The idea is that these images are similar enough that if you flipped through them, it would look like a rough video.
%% Cell type:code id: tags:
``` python
import os
import matplotlib
from matplotlib import pyplot as plt
```
%% Cell type:code id: tags:
``` python
%matplotlib inline
```
%% Cell type:code id: tags:
``` python
matplotlib.rcParams["font.size"] = 16
```
%% Cell type:code id: tags:
``` python
def plot_circle(filename, x, y):
fig, ax = plt.subplots()
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.plot(x, y, 'ro', markersize=20)
fig.savefig(os.path.join("img", filename))
if not os.path.exists("img"):
os.mkdir("img")
plot_circle("0.png", x=0, y=0)
plot_circle("1.png", x=0.25, y=0.25)
plot_circle("2.png", x=0.5, y=0.5)
plot_circle("3.png", x=0.75, y=0.25)
plot_circle("4.png", x=1, y=0)
```
%% Output
%% Cell type:markdown id: tags:
Let's check that we created the png files in the `img` directory:
%% Cell type:code id: tags:
``` python
os.listdir("img")
```
%% Output
['0.png', '1.png', '4.png', '2.png', '3.png']
%% Cell type:markdown id: tags:
Let's also check that they look right. In the `IPython.display` module, there are `Image(...)` and `HTML(...)` functions that are useful for loading pictures and HTML directly into our notebook. Let's use the first function to check that our `0.png` file looks right.
%% Cell type:code id: tags:
``` python
from IPython.display import Image, HTML
Image(filename='img/0.png')
```
%% Output
<IPython.core.display.Image object>
%% Cell type:markdown id: tags:
Great! Now, from the command line, try running this command, inside the same directory where this notebook is running:
```
'ffmpeg -y -framerate 5 -i img/%d.png out.mp4
```
If it succeeds, there should be an out.mp4 file generated. Try downloading it to your computer via the Jupyter interface (don't try to open it directly from Jupyter) and open it on your laptop. Cool, huh?
Now let's try running that same command with `check_output`. We'll need to break up all the arguments into different entries in a list:
%% Cell type:code id: tags:
``` python
check_output(['ffmpeg', '-y', '-framerate', '5', '-i', 'img/%d.png', 'out.mp4'])
```
%% Output
b''
%% Cell type:markdown id: tags:
There was no output, which is fine. But it should have created an `out.mp4` file, as before. You can embed `.mp4` video files in websites with the `<video>` tag. This is great because we can inject HTML using the `HTML(...)` function from earlier:
%% Cell type:code id: tags:
``` python
HTML("This is <b>bold</b> text.")
```
%% Output
<IPython.core.display.HTML object>
%% Cell type:markdown id: tags:
The natural thing to do is to inject some HTML to embed the `out.mp4` animation we just created:
%% Cell type:code id: tags:
``` python
HTML("""
<video width="320" height="240" controls>
<source src="out.mp4" type="video/mp4">
</video>
""")
```
%% Output
<IPython.core.display.HTML object>
%% Cell type:markdown id: tags:
## Measuring Time
The easiest way to measure how long something takes is to check the time before and after we do it. We can check with the `time` function inside the `time` module. This function returns the number of seconds elapsed since Jan 1, 1970:
%% Cell type:code id: tags:
``` python
import time
now = time.time()
now
```
%% Output
1580137926.8447468
%% Cell type:code id: tags:
``` python
minutes = now / 60
hours = minutes / 60
days = hours / 24
years = days / 365
years # should be about number of years since 1970 -- is it?
```
%% Output
50.10584496590395
%% Cell type:markdown id: tags:
Let's use this to time how long a print call takes:
%% Cell type:code id: tags:
``` python
before = time.time()
print("I'm printing something")
after = time.time()
print("It took", (after-before), "seconds to print")
```
%% Output
I'm printing something
It took 0.0007390975952148438 seconds to print
%% Cell type:markdown id: tags:
A slightly cleaner version of the same that computers milliseconds (1ms is 1/1000 seconds):
%% Cell type:code id: tags:
``` python
t0 = time.time()
print("I'm printing something")
t1 = time.time()
ms = (t1-t0) * 1000
print("It took", ms, "ms to print")
```
%% Output
I'm printing something
It took 0.7138252258300781 ms to print
%% Cell type:markdown id: tags:
How long does it take to append something to the end of a list?
%% Cell type:code id: tags:
``` python
L = []
t0 = time.time()
L.append("test")
t1 = time.time()
us = (t1-t0) * 1e6 # microseconds (there are 1 second has 1000000 microseconds)
print("microseconds:", us)
```
%% Output
microseconds: 57.45887756347656
%% Cell type:markdown id: tags:
The problem with the above measurement is that it is it varies significantly each time you try it, and we can easily end up measuring something other than append time. For example, what if calling `time.time()` is much slower than calling `L.append("test")`? It is better to perform an operation many times between checking the start+stop times and then divide to get the average cost of the operation:
%% Cell type:code id: tags:
``` python
L = []
append_count = 1000000 # do 1 million appends
t0 = time.time()
for i in range(append_count):
L.append("test")
t1 = time.time()
us = (t1-t0) / append_count * 1e6 # microseconds (there are 1 second has 1000000 microseconds)
print("microseconds:", us)
```
%% Output
microseconds: 0.11862897872924805
%% Cell type:markdown id: tags:
### Example 1: the `in` operator
The `in` operator can be used to check whether a value is in a list or a set, but it's much faster on a set. If your code needs to perform the `in` operation a lot, this is a good reason to use a set rather than a list. Let's review how `in` works on each:
%% Cell type:code id: tags:
``` python
L = ["A", "B", "C"]
S = {"A", "B", "C"}
```
%% Cell type:code id: tags:
``` python
"A" in L, "D" in L, "A" in S, "D" in S
```
%% Output
(True, False, True, False)
%% Cell type:markdown id: tags:
Let's see how fast `in` is if we are checking over 1 million numbers in a list or a set.
%% Cell type:code id: tags:
``` python
seq_size = 1000000
L = list(range(seq_size))
S = set(range(seq_size))
# return average microseconds to perform lookup
def time_lookup(data, search):
trials = 1000
t0 = time.time()
for i in range(trials):
found = search in data
t1 = time.time()
return (t1-t0)*1e6/trials
time_lookup(L, 0), time_lookup(S, 0)
```
%% Output
(0.03719329833984375, 0.03743171691894531)
%% Cell type:markdown id: tags:
Ok, looks like looking up `0` (the first number) is about equally fast in either data structure.
What if we lookup a number that's not stored?
%% Cell type:code id: tags:
``` python
time_lookup(L, -1), time_lookup(S, -1)
```
%% Output
(10068.64070892334, 0.0438690185546875)
%% Cell type:markdown id: tags:
Woah, now the list is >10K times slower! What if we lookup the last item in the list?
%% Cell type:code id: tags:
``` python
time_lookup(L, 999999), time_lookup(S, 999999)
```
%% Output
(11480.518341064453, 0.0591278076171875)
%% Cell type:markdown id: tags:
The set is fast again, but the list is still really slow (about as slow as looking up something that doesn't exist). What if we lookup a number in the middle?
%% Cell type:code id: tags:
``` python
time_lookup(L, 500000), time_lookup(S, 500000)
```
%% Output
(5739.615678787231, 0.05888938903808594)
%% Cell type:markdown id: tags:
Well, checking for something in the middle of a list is about twice as fast as checking for the last item. Can you guess why?
It turns out that while sets are designed around making `in` fast, running `in` on a list amounts to looping over ever item, much like a call to the following function.
%% Cell type:code id: tags:
``` python
def is_in(L, search):
for item in L:
if search == L: # if this is True early in the list, the search is fast
return True
return False
```
%% Cell type:markdown id: tags:
How does the list size factor in when we perform an `in` and we don't find anything? Let's do an experiment to find out.
%% Cell type:code id: tags:
``` python
from pandas import Series
times = Series()
for size in [1000, 2000, 5000, 10000]:
L = list(range(size))
microseconds = time_lookup(L, -1)
times.loc[size] = microseconds
times
```
%% Output
1000 9.510994
2000 18.847942
5000 47.323465
10000 95.795155
dtype: float64
%% Cell type:code id: tags:
``` python
ax = times.plot.line(color="r")
# following makes plot look better (only necessary if we plan to share it with others)
ax.spines["right"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.set_xlabel("List Size")
ax.set_ylabel("Lookup Miss Time (μs)")
None
```
%% Output
%% Cell type:markdown id: tags:
Looking at the above, we would say that the `in` operator scales *linearly*. In otherwords, doubling the list size doubles the time it takes to perform the operation.
### Example 2: Ratio Search
Not all functions we'll encounter will scale linearly. For example, consider this one, which checks whether the ratio of any two numbers in a list matches the ratio we're searching for:
%% Cell type:code id: tags:
``` python
def ratio_search(L, ratio):
for numerator in L:
for denominator in L:
if numerator / denominator == ratio:
return True
return False
ratio_search([1, 2, 3, 4], 0.75)
```
%% Output
True
%% Cell type:code id: tags:
``` python
ratio_search([1, 2, 3, 4], 0.2)
```
%% Output
False
%% Cell type:markdown id: tags:
Let's see how it scales when we search for a ratio we know we won't find.
%% Cell type:code id: tags:
``` python
import random, string
times = Series()
for i in range(6):
size = i * 1000
L = list(range(1, size+1)) # don't include 0, because we need to divide
t0 = time.time()
found = ratio_search(L, -1)
t1 = time.time()
times.loc[size] = t1-t0
times
```
%% Output
0 0.000003
1000 0.056291
2000 0.222333
3000 0.499056
4000 0.890178
5000 1.397182
dtype: float64
%% Cell type:code id: tags:
``` python
ax = times.plot.line(color="r")
# following makes plot look better (only necessary if we plan to share it with others)
ax.spines["right"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.set_xlabel("List Size")
ax.set_ylabel("Lookup Miss Time (μs)")
None
```
%% Output
%% Cell type:markdown id: tags:
The above is an example of quadratic scaling: doubling the list size quadruples the time it takes to run!
%% Cell type:markdown id: tags:
# Conclusion
In this notebook, we've learned how to automatically run programs and time code. Together, these skills provide the empirical basis for exploring performance and scalability. Soon, we'll be learning a bit of theory (complexity analysis) and notation (big-O) for thinking about what happens to performance as we add more data.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id:d15a3b25 tags:
# Performance 1
%% Cell type:markdown id:64bfcf90 tags:
%% Cell type:markdown id:cd7f646c tags:
### Few shortcuts
* shift + enter = exceute a cell (= Run) and move to the next cell
* ctrl + enter = excecute a cell and stay in the same cell
* ESC + A = add a cell above the current cell
* ESC + B = add a cell below the current cell
* ctrl + / = toggle comment(s) (that is, adds/removes #)
%% Cell type:markdown id:ea8c9210 tags:
Recommendation: include all `import` statements in a cell at the top of the notebook file or your script file (`.py`).
### Two styles of import
1. `from <module> import <some_function, some_variable>`
- invocation `some_function()`
2. `import <module>`
- invocation `<module>.some_function()`
%% Cell type:code id:4782ff79 tags:
``` python
# import statements
# TODO: use from style of import for importing "check_output" from subprocess
from subprocess import check_output
# TODO: use import style of import for importing "time" module
import time
```
%% Cell type:markdown id:de8ea97c tags:
### How to open documentation about a function inside `jupyter`?
Press "Shift + tab" after entering function name.
%% Cell type:code id:4a61ed24 tags:
``` python
# TODO: open documentation for check_output
check_output
```
%% Output
<function subprocess.check_output(*popenargs, timeout=None, **kwargs)>
%% Cell type:markdown id:fc3025fc tags:
### What does `check_output` do?
Enables us to run a command with or without arguments. It returns the output of the command.
- Argument: command to run
- Return value: output of the command as a `byte` object.
%% Cell type:code id:e09ab658 tags:
``` python
# TODO: invoke check_output to execute "pwd"
pwd_output = check_output("pwd")
pwd_output
```
%% Output
b'/home/msyamkumar/temp/cs320-lecture-notes/lec_04_Performance_1\n'
b'/home/gurmail.singh\n'
%% Cell type:code id:668afc99 tags:
``` python
# TODO: use type function call to check the output type of check_output
type(pwd_output)
```
%% Output
bytes
%% Cell type:markdown id:6dffec77 tags:
### What is a `byte` object?
- `byte` is an example of a sequence.
- Recall that `list`, `str`, `tuple` are examples of Python sequences.
- Key sequence features:
- indexing `seq[index]`
- slicing `seq[start index:exclusive end index]`
- iteration `for val in seq:`
- length `len(seq)`
- length `length(seq)`
- existence / constituency match `<val> in seq`
- indexing:
- begins with 0 and increases by 1 for every value
- can use negative values: -1 represents index for last value, -2 penultimate, etc.,
%% Cell type:code id:70e41d0b-a12a-4e35-be97-083aa91af28f tags:
``` python
# TODO: use indexing to extract value at index 0
pwd_output[0]
```
%% Output
47
%% Cell type:markdown id:60504389 tags:
### `byte` conversion to `str`
- requires details about encoding
- `str(<byte_variable>, <encoding>)`
- Most programs in linux use `utf-8` encoding
%% Cell type:code id:09e5bd77 tags:
``` python
# Can we just convert bytes directly into str?
# Not really, you need specify the encoding
str(pwd_output)
```
%% Output
"b'/home/msyamkumar/temp/cs320-lecture-notes/lec_04_Performance_1\\n'"
"b'/home/gurmail.singh\\n'"
%% Cell type:code id:fbc2ad71 tags:
``` python
# TODO: let's try utf-8 encoding
pwd_output_str = str(pwd_output, "utf-8")
pwd_output_str
```
%% Output
'/home/msyamkumar/temp/cs320-lecture-notes/lec_04_Performance_1\n'
'/home/gurmail.singh\n'
%% Cell type:markdown id:5ce854cd tags:
Recall that, when you print an `str`, it formats the output.
%% Cell type:code id:8e0c335c tags:
``` python
print(pwd_output_str)
```
%% Output
/home/msyamkumar/temp/cs320-lecture-notes/lec_04_Performance_1
/home/gurmail.singh
%% Cell type:code id:da2f1bc1 tags:
``` python
# You must use the correct encoding, otherwise the conversion will fail
str(pwd_output, "cp273")
```
%% Output
'\x07Ç?_Á\x07_Ë`/_,Í_/Ê\x07ÈÁ_ø\x07[Ë\x93\x16\x90\x05%Á[ÈÍÊÁ\x05>?ÈÁË\x07%Á[^\x90\x94^&ÁÊÃ?Ê_/>[Á^\x91\x8e'
'\x07Ç?_Á\x07ÅÍÊ_/Ñ%\x06ËÑ>ÅÇ\x8e'
%% Cell type:markdown id:11ce814c tags:
### `str` methods recap
- `<str_variable>.strip()`: removes leading and trailing whitespace
- `<str_varaible>.split(<separator>)`: returns list of strings split by separator
%% Cell type:code id:8d8bb61a-39ce-404b-81a1-992434ef26de tags:
``` python
# TODO: try strip method
pwd_output_str.strip()
```
%% Output
'/home/msyamkumar/temp/cs320-lecture-notes/lec_04_Performance_1'
'/home/gurmail.singh'
%% Cell type:code id:fabe2232 tags:
``` python
# TODO: try split method using "/" as separator
pwd_output_str.split("/")
```
%% Output
['',
'home',
'msyamkumar',
'temp',
'cs320-lecture-notes',
'lec_04_Performance_1\n']
['', 'home', 'gurmail.singh\n']
%% Cell type:code id:27590458-53b4-4b8d-b7db-25ea54e304c3 tags:
``` python
# You can string methods or function calls together
# TODO: first strip and then split the string
pwd_output_str.strip().split("/")
```
%% Output
['',
'home',
'msyamkumar',
'temp',
'cs320-lecture-notes',
'lec_04_Performance_1']
['', 'home', 'gurmail.singh']
%% Cell type:markdown id:a83a11b3 tags:
### What does `check_output` do when the command doesn't exist?
- `FileNotFoundError`
%% Cell type:code id:b865a24c-bf5f-4fbf-be5b-0b578df66df8 tags:
``` python
# TODO: invoke check_output by passing "hahaha" as argument
check_output("hahaha")
```
%% Output
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[13], line 2
1 # TODO: invoke check_output by passing "hahaha" as argument
----> 2 check_output("hahaha")
File /usr/lib/python3.10/subprocess.py:420, in check_output(timeout, *popenargs, **kwargs)
417 empty = b''
418 kwargs['input'] = empty
--> 420 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
421 **kwargs).stdout
File /usr/lib/python3.10/subprocess.py:501, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
498 kwargs['stdout'] = PIPE
499 kwargs['stderr'] = PIPE
--> 501 with Popen(*popenargs, **kwargs) as process:
502 try:
503 stdout, stderr = process.communicate(input, timeout=timeout)
File /usr/lib/python3.10/subprocess.py:969, in Popen.__init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask, pipesize)
965 if self.text_mode:
966 self.stderr = io.TextIOWrapper(self.stderr,
967 encoding=encoding, errors=errors)
--> 969 self._execute_child(args, executable, preexec_fn, close_fds,
970 pass_fds, cwd, env,
971 startupinfo, creationflags, shell,
972 p2cread, p2cwrite,
973 c2pread, c2pwrite,
974 errread, errwrite,
975 restore_signals,
976 gid, gids, uid, umask,
977 start_new_session)
978 except:
979 # Cleanup if the child failed starting.
980 for f in filter(None, (self.stdin, self.stdout, self.stderr)):
File /usr/lib/python3.10/subprocess.py:1845, in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, gid, gids, uid, umask, start_new_session)
1843 if errno_num != 0:
1844 err_msg = os.strerror(errno_num)
-> 1845 raise child_exception_type(errno_num, err_msg, err_filename)
1846 raise child_exception_type(err_msg)
File /usr/lib/python3.10/subprocess.py:421, in check_output(timeout, *popenargs, **kwargs)
418 empty = b''
419 kwargs['input'] = empty
--> 421 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
422 **kwargs).stdout
File /usr/lib/python3.10/subprocess.py:503, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
500 kwargs['stdout'] = PIPE
501 kwargs['stderr'] = PIPE
--> 503 with Popen(*popenargs, **kwargs) as process:
504 try:
505 stdout, stderr = process.communicate(input, timeout=timeout)
File /usr/lib/python3.10/subprocess.py:971, in Popen.__init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask, pipesize)
967 if self.text_mode:
968 self.stderr = io.TextIOWrapper(self.stderr,
969 encoding=encoding, errors=errors)
--> 971 self._execute_child(args, executable, preexec_fn, close_fds,
972 pass_fds, cwd, env,
973 startupinfo, creationflags, shell,
974 p2cread, p2cwrite,
975 c2pread, c2pwrite,
976 errread, errwrite,
977 restore_signals,
978 gid, gids, uid, umask,
979 start_new_session)
980 except:
981 # Cleanup if the child failed starting.
982 for f in filter(None, (self.stdin, self.stdout, self.stderr)):
File /usr/lib/python3.10/subprocess.py:1863, in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, gid, gids, uid, umask, start_new_session)
1861 if errno_num != 0:
1862 err_msg = os.strerror(errno_num)
-> 1863 raise child_exception_type(errno_num, err_msg, err_filename)
1864 raise child_exception_type(err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'hahaha'
%% Cell type:markdown id:8512a276 tags:
### How can we use `check_output` to execute a command with arguments?
- option 1: pass the command with arguments as a string and pass `True` as argument to parameter `shell`
- option 2: pass a list of strings; for example: `[<command>, <arg1>, <arg2>]`
%% Cell type:markdown id:9a5d0722 tags:
### git --version
%% Cell type:code id:fad47b1d-1f52-4c8e-8df4-0847fe1718c6 tags:
``` python
# TODO: use option 1 to run "git --version"
check_output("git --version", shell=True)
```
%% Output
b'git version 2.34.1\n'
%% Cell type:markdown id:632699d8 tags:
What would happen if we switch the order of the two arguments? Recall that positional arguments should come before keyword arguments.
%% Cell type:code id:6efc81d6 tags:
``` python
check_output(shell=True, "git --version")
```
%% Output
Cell In[15], line 1
check_output(shell=True, "git --version")
^
SyntaxError: positional argument follows keyword argument
%% Cell type:code id:847b3c3d-bd9b-45d9-acf7-745a562c579e tags:
``` python
# TODO: use option 2 to run "git --version"
check_output(["git", "--version"])
```
%% Output
b'git version 2.34.1\n'
%% Cell type:code id:7bcf8ef4 tags:
``` python
# TODO: combine check_output with str typecast
git_version_str = str(check_output(["git", "--version"]), "utf-8")
```
%% Cell type:code id:d0d6c938 tags:
``` python
# TODO: write code to extract just the version number
print(git_version_str.strip().split(" ")[-1]) # option 1
print(git_version_str[-7:-1]) # option 2
```
%% Output
2.34.1
2.34.1
%% Cell type:markdown id:2ef0b826-cb54-4d43-abe4-7dda3425ce65 tags:
### How long does it take to run code?
Let's learn about `time` module `time` function. It returns the current time in seconds since epoch.
What is epoch? epoch is January 1, 1970. **FUN FACT**: epoch is considered beginning of time for computers.
%% Cell type:code id:89b5982c-40b5-498f-9acf-f05a9d233f59 tags:
``` python
# TODO: invoke time module time function
# keep in mind that we used import style of import
time.time()
# number of seconds since Jan 1, 1970
```
%% Output
1675231286.744164
1706617377.4253736
%% Cell type:code id:0c41fd03-d2b8-4641-a288-4b95c48b5d24 tags:
``` python
start_time = time.time()
# DO SOMETHING (e.g., check_output)
end_time = time.time()
print(end_time - start_time)
```
%% Output
3.814697265625e-05
4.5299530029296875e-05
%% Cell type:code id:31737e44 tags:
``` python
# TODO: let's convert to milliseconds
print((end_time-start_time) * 1e3)
# TODO: let's convert to microseconds
print((end_time-start_time) * 1e6)
```
%% Output
0.03814697265625
38.14697265625
0.045299530029296875
45.299530029296875
%% Cell type:markdown id:01651895 tags:
How long does it take to run simple computations (example: 4 + 5)?
%% Cell type:code id:0b739045-bfbe-439e-ac57-7e3f069db7b6 tags:
``` python
start_time = time.time()
x = 4 + 5
end_time = time.time()
print(end_time - start_time)
```
%% Output
7.677078247070312e-05
4.9591064453125e-05
%% Cell type:markdown id:1e378bfb tags:
How long does it take to print simple computations (example: 4 + 5)?
%% Cell type:code id:2665edb8-bd7d-4cf7-baf5-5143bd6f25a0 tags:
``` python
start_time = time.time()
print(4 + 5)
end_time = time.time()
print((end_time-start_time) * 1e3)
```
%% Output
9
1.7392635345458984
0.9014606475830078
%% Cell type:markdown id:a0b5c105 tags:
Printing is a relatively slow operation. If your program is printing lot of things, its performance might get impacted!
%% Cell type:markdown id:030b867b tags:
How long does it take to run a python program?
Let's do a recap of python interactive mode.
`python3 -c "code"`
%% Cell type:code id:991218f8 tags:
``` python
start_time = time.time()
check_output(["python3", "-c", "print(4 + 5)"])
end_time = time.time()
print((end_time-start_time) * 1e3)
```
%% Output
30.985593795776367
31.372547149658203
%% Cell type:markdown id:1c68a8cd tags:
### Everytime we run a command, we get slightly different output. How can we eliminate the noise?
%% Cell type:markdown id:7dad0036 tags:
Let's try this with "pwd".
%% Cell type:code id:5b4dba6f tags:
``` python
start_time = time.time()
check_output("pwd")
end_time = time.time()
print((end_time-start_time) * 1e3)
```
%% Output
2.9201507568359375
4.441499710083008
%% Cell type:markdown id:f853aab6 tags:
Recall that `range` built-in function produces a sequence of integers starting at 0.
%% Cell type:code id:b45228b7 tags:
``` python
iters = 1000
start_time = time.time()
for i in range(iters):
check_output("pwd")
end_time = time.time()
print((end_time-start_time) * 1e3 / iters)
```
%% Output
1.6988284587860107
1.8102648258209229
%% Cell type:markdown id:ff7e8971 tags:
### Data structures review
- lists (sequence: ordered)
- sets (not a sequence: not ordered):
- indexing doesn't work, but `in` operator works
- only stores unique values
%% Cell type:code id:10a09035 tags:
``` python
# TODO: create a simple list of integers
some_numbers = [11, 22, 33]
some_numbers
```
%% Output
[11, 22, 33]
%% Cell type:code id:553b126b tags:
``` python
# TODO: use range() to produce a list containing 1000000 numbers
some_numbers = list(range(1000000))
```
%% Cell type:markdown id:a73c753a tags:
`in` operator: existence / constituency match
%% Cell type:code id:e927c0b7 tags:
``` python
100 in some_numbers
```
%% Output
True
%% Cell type:code id:896176d0 tags:
``` python
-20 in some_numbers
```
%% Output
False
%% Cell type:markdown id:bc5f912b tags:
How long does `in` operator take? It kind of depends on the location of the item we are searching.
%% Cell type:code id:e0eda15b tags:
``` python
# TODO: time how long it takes to find 99 in some_numbers
start_time = time.time()
99 in some_numbers
end_time = time.time()
print((end_time-start_time) * 1e3)
```
%% Output
0.09298324584960938
0.08988380432128906
%% Cell type:code id:ae667c2f tags:
``` python
# TODO: time how long it takes to find 999999 in some_numbers
start_time = time.time()
999999 in some_numbers
end_time = time.time()
print((end_time-start_time) * 1e3)
```
%% Output
11.307954788208008
11.118888854980469
%% Cell type:code id:e77a228d-1709-4f9a-b8ff-7c97fbda38bc tags:
``` python
# TODO: time how long it takes to find -1 in some_numbers
start_time = time.time()
-1 in some_numbers
end_time = time.time()
print((end_time-start_time) * 1e3)
```
%% Output
11.208295822143555
9.768009185791016
%% Cell type:code id:3ba890dc tags:
``` python
# TODO: create a simple set of numbers
some_set = {11, 22, 33}
some_set
```
%% Output
{11, 22, 33}
%% Cell type:code id:64d7b73f-d8a9-43c7-a3a3-01e9aaa9a09f tags:
``` python
# TODO: convert some_numbers into set
some_set = set(some_numbers)
```
%% Cell type:code id:a55e59ef-31e3-4468-a0db-aad1c3fc291d tags:
``` python
# TODO: time how long it takes to find -1 in some_numbers
start_time = time.time()
-1 in some_set
end_time = time.time()
print((end_time-start_time) * 1e3)
```
%% Output
0.06175041198730469
0.11181831359863281
......
%% Cell type:markdown id:d617eefb tags:
# Performance 2
%% Cell type:code id:783117c5-146f-454a-963e-ed2873b8a6d3 tags:
``` python
# known import statements
import pandas as pd
import csv
from subprocess import check_output
# new import statements
import zipfile
from io import TextIOWrapper
```
%% Cell type:markdown id:4e2be82d tags:
### Let's take a look at the files inside the current working directory.
%% Cell type:code id:4eaa8a8d tags:
``` python
str(check_output(["ls", "-lh"]), encoding="utf-8").split("\n")
```
%% Output
['total 21M',
'-rw-rw-r-- 1 gurmail.singh gurmail.singh 2.0K Jan 30 20:49 lec2.ipynb',
'-rw-rw-r-- 1 gurmail.singh gurmail.singh 5.2K Feb 1 13:08 lecture.ipynb',
'-rw------- 1 gurmail.singh gurmail.singh 230K Feb 1 13:09 nohup.out',
'drwxrwxr-x 3 gurmail.singh gurmail.singh 4.0K Jan 30 20:42 paper',
'-rw-rw-r-- 1 gurmail.singh gurmail.singh 39 Jan 25 18:32 paper1.txt',
'drwxrwxr-x 8 gurmail.singh gurmail.singh 4.0K Jan 30 14:06 s24',
'drwx------ 3 gurmail.singh gurmail.singh 4.0K Jan 30 12:31 snap',
'-rw-rw-r-- 1 gurmail.singh gurmail.singh 21M Feb 1 12:44 wi.zip',
'']
%% Cell type:markdown id:b8c7dc7f tags:
### Let's `unzip` "wi.zip".
%% Cell type:code id:ed32cf4c tags:
``` python
check_output(["unzip", "wi.zip"])
```
%% Output
b'Archive: wi.zip\n inflating: wi.csv \n'
%% Cell type:markdown id:4eac1b48 tags:
### Let's take a look at the files inside the current working directory.
%% Cell type:code id:a6852e43 tags:
``` python
str(check_output(["ls", "-lh"]), encoding="utf-8").split("\n")
```
%% Output
['total 198M',
'-rw-rw-r-- 1 gurmail.singh gurmail.singh 2.0K Jan 30 20:49 lec2.ipynb',
'-rw-rw-r-- 1 gurmail.singh gurmail.singh 5.2K Feb 1 13:08 lecture.ipynb',
'-rw------- 1 gurmail.singh gurmail.singh 230K Feb 1 13:09 nohup.out',
'drwxrwxr-x 3 gurmail.singh gurmail.singh 4.0K Jan 30 20:42 paper',
'-rw-rw-r-- 1 gurmail.singh gurmail.singh 39 Jan 25 18:32 paper1.txt',
'drwxrwxr-x 8 gurmail.singh gurmail.singh 4.0K Jan 30 14:06 s24',
'drwx------ 3 gurmail.singh gurmail.singh 4.0K Jan 30 12:31 snap',
'-rw-rw-r-- 1 gurmail.singh gurmail.singh 177M Jan 14 2022 wi.csv',
'-rw-rw-r-- 1 gurmail.singh gurmail.singh 21M Feb 1 12:44 wi.zip',
'']
%% Cell type:markdown id:8ba94151 tags:
### Traditional way of reading data using pandas
%% Cell type:code id:529a4bd2 tags:
``` python
df = pd.read_csv("wi.csv")
```
%% Output
/tmp/ipykernel_36341/3756477020.py:1: DtypeWarning: Columns (22,23,24,26,27,28,29,30,31,32,33,38,43,44) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv("wi.csv")
%% Cell type:code id:570485b8 tags:
``` python
df.head(5) # Top 5 rows within the DataFrame
```
%% Output
activity_year lei derived_msa-md state_code \
0 2020 549300FX7K8PTEQUU487 31540 WI
1 2020 549300FX7K8PTEQUU487 99999 WI
2 2020 549300FX7K8PTEQUU487 99999 WI
3 2020 549300FX7K8PTEQUU487 99999 WI
4 2020 549300FX7K8PTEQUU487 33460 WI
county_code census_tract conforming_loan_limit \
0 55025.0 5.502500e+10 C
1 55013.0 5.501397e+10 C
2 55127.0 5.512700e+10 C
3 55127.0 5.512700e+10 C
4 55109.0 5.510912e+10 C
derived_loan_product_type derived_dwelling_category \
0 Conventional:First Lien Single Family (1-4 Units):Site-Built
1 Conventional:First Lien Single Family (1-4 Units):Site-Built
2 VA:First Lien Single Family (1-4 Units):Site-Built
3 Conventional:Subordinate Lien Single Family (1-4 Units):Site-Built
4 VA:First Lien Single Family (1-4 Units):Site-Built
derived_ethnicity ... denial_reason-2 denial_reason-3 \
0 Not Hispanic or Latino ... NaN NaN
1 Not Hispanic or Latino ... NaN NaN
2 Not Hispanic or Latino ... NaN NaN
3 Ethnicity Not Available ... NaN NaN
4 Not Hispanic or Latino ... NaN NaN
denial_reason-4 tract_population tract_minority_population_percent \
0 NaN 3572 41.15
1 NaN 2333 9.90
2 NaN 5943 13.26
3 NaN 5650 7.63
4 NaN 7210 4.36
ffiec_msa_md_median_family_income tract_to_msa_income_percentage \
0 96600 64
1 68000 87
2 68000 104
3 68000 124
4 97300 96
tract_owner_occupied_units tract_one_to_four_family_homes \
0 812 910
1 1000 2717
2 1394 1856
3 1712 2104
4 2101 2566
tract_median_age_of_housing_units
0 45
1 34
2 44
3 36
4 22
[5 rows x 99 columns]
%% Cell type:markdown id:bad7dce4 tags:
### How can we see all the column names?
%% Cell type:code id:d0a98751 tags:
``` python
df.columns
```
%% Output
Index(['activity_year', 'lei', 'derived_msa-md', 'state_code', 'county_code',
'census_tract', 'conforming_loan_limit', 'derived_loan_product_type',
'derived_dwelling_category', 'derived_ethnicity', 'derived_race',
'derived_sex', 'action_taken', 'purchaser_type', 'preapproval',
'loan_type', 'loan_purpose', 'lien_status', 'reverse_mortgage',
'open-end_line_of_credit', 'business_or_commercial_purpose',
'loan_amount', 'loan_to_value_ratio', 'interest_rate', 'rate_spread',
'hoepa_status', 'total_loan_costs', 'total_points_and_fees',
'origination_charges', 'discount_points', 'lender_credits', 'loan_term',
'prepayment_penalty_term', 'intro_rate_period', 'negative_amortization',
'interest_only_payment', 'balloon_payment',
'other_nonamortizing_features', 'property_value', 'construction_method',
'occupancy_type', 'manufactured_home_secured_property_type',
'manufactured_home_land_property_interest', 'total_units',
'multifamily_affordable_units', 'income', 'debt_to_income_ratio',
'applicant_credit_score_type', 'co-applicant_credit_score_type',
'applicant_ethnicity-1', 'applicant_ethnicity-2',
'applicant_ethnicity-3', 'applicant_ethnicity-4',
'applicant_ethnicity-5', 'co-applicant_ethnicity-1',
'co-applicant_ethnicity-2', 'co-applicant_ethnicity-3',
'co-applicant_ethnicity-4', 'co-applicant_ethnicity-5',
'applicant_ethnicity_observed', 'co-applicant_ethnicity_observed',
'applicant_race-1', 'applicant_race-2', 'applicant_race-3',
'applicant_race-4', 'applicant_race-5', 'co-applicant_race-1',
'co-applicant_race-2', 'co-applicant_race-3', 'co-applicant_race-4',
'co-applicant_race-5', 'applicant_race_observed',
'co-applicant_race_observed', 'applicant_sex', 'co-applicant_sex',
'applicant_sex_observed', 'co-applicant_sex_observed', 'applicant_age',
'co-applicant_age', 'applicant_age_above_62',
'co-applicant_age_above_62', 'submission_of_application',
'initially_payable_to_institution', 'aus-1', 'aus-2', 'aus-3', 'aus-4',
'aus-5', 'denial_reason-1', 'denial_reason-2', 'denial_reason-3',
'denial_reason-4', 'tract_population',
'tract_minority_population_percent',
'ffiec_msa_md_median_family_income', 'tract_to_msa_income_percentage',
'tract_owner_occupied_units', 'tract_one_to_four_family_homes',
'tract_median_age_of_housing_units'],
dtype='object')
%% Cell type:markdown id:1a6cc54c tags:
# Performance 2
%% Cell type:code id:783117c5-146f-454a-963e-ed2873b8a6d3 tags:
``` python
# known import statements
import pandas as pd
import csv
from subprocess import check_output
# new import statements
import zipfile
from io import TextIOWrapper
```
%% Cell type:markdown id:66db2ad0 tags:
### Let's take a look at the files inside the current working directory.
%% Cell type:code id:6cef713e tags:
``` python
str(check_output(["ls", "-lh"]), encoding="utf-8").split("\n")
```
%% Cell type:markdown id:c76f819d tags:
### Let's `unzip` "wi.zip".
%% Cell type:code id:0e87ec01 tags:
``` python
check_output(["unzip", "wi.zip"])
```
%% Cell type:markdown id:274fa49a tags:
### Let's take a look at the files inside the current working directory.
%% Cell type:code id:a2da3cd0 tags:
``` python
str(check_output(["ls", "-lh"]), encoding="utf-8").split("\n")
```
%% Cell type:markdown id:90b11343 tags:
### Traditional way of reading data using pandas
%% Cell type:code id:a3175526 tags:
``` python
df = pd.read_csv("wi.csv")
```
%% Cell type:code id:13e6e034 tags:
``` python
df.head(5) # Top 5 rows within the DataFrame
```
%% Cell type:markdown id:5c79984c tags:
### How can we see all the column names?
%% Cell type:code id:08d9501d tags:
``` python
df.columns
```