# Zip Files

As you deal with bigger datasets, those datasets will often be
compressed.  Compressed means that the format takes advantage of
patterns and redundancy in data to store a bigger file in less space.

For example, say you have a string like this: "HAHAHAHAHAHAHAHAHAHA".
You should imagine inventing a notation for representing that string
with fewer characters (maybe something like "HA{x10}").

Zip is one common compression format.  In addition to compressing
files, .zips often bundle multiple files together.  In the past, you
would have run `unzip` in the terminal before starting to write your
code.  However, it is also possible to directly read the contents of a
`.zip` file in Python.  Doing so is often more convenient; the code
may also quite possibly be faster.

## Generating a .zip

To create an `example.zip` file, run the following (don't worry,
understanding this particular snippet isn't expected for this lab):

```python
import pandas as pd
from zipfile import ZipFile, ZIP_DEFLATED
from io import TextIOWrapper

with open("hello.txt", "w") as f:
    f.write("hello world")

with ZipFile("example.zip", "w", compression=ZIP_DEFLATED) as zf:
    with zf.open("hello.txt", "w") as f:
        f.write(bytes("hello world", "utf-8"))
    with zf.open("ha.txt", "w") as f:
        f.write(bytes("ha"*10000, "utf-8"))
    with zf.open("bugs.csv", "w") as f:
        pd.DataFrame([["Mon",7], ["Tue",4], ["Wed",3], ["Thu",6], ["Fri",9]],
                     columns=["day", "bugs"]).to_csv(TextIOWrapper(f), index=False)
```

## ZipFile

We can access the file by using the `ZipFile` type, imported from the `zipfile` module:

```python
from zipfile import ZipFile
```

ZipFiles are context managers, much like file objects.  Let's try
creating one using `with`, then loop over info about the files inside
using [this
method](https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.infolist):

```python
with ZipFile('example.zip') as zf:
    for info in zf.infolist():
        print(info)
```

Let's print off the size and compression ratio (uncompressed size divided by compressed size) of each file:

```python
with ZipFile('example.zip') as zf:
    for info in zf.infolist():
        orig_mb = info.file_size / (1024**2) # there are 1024**2 bytes in a MB
        ratio = info.file_size / info.compress_size
        s = "file {name:s}, {mb:.3f} MB (uncompressed), {ratio:.1f} compression ratio"
        print(s.format(name=info.filename, mb=orig_mb, ratio=ratio))
```

Take a minute to look through -- which file is largest?  What is its
compression ratio?

The compression ratio is the original size divided by the compressed
size, so bigger means more savings.  `ha.txt` contains "hahahahaha..."
(repeated 10 thousand times), which is highly compressible.

As practice, compute the overall compression ration (sum of all
uncompressed sizes divided by sum of all compressed sizes) -- it ought
to be about 216.

## Binary Open

Ok, forget zips for a minute, and run the following:

```python
with open("hello.txt", "r") as f:
    data1 = f.read()

with open("hello.txt", "rb") as f:
    data2 = f.read()

print(type(data1), type(data2))
```

What type does `f.read()` return if we use "r" for the mode?  What
about "rb"?

The "b" stands for "binary" or "bytes", so we get back type `bytes`.
If we open in text mode (the default), as in the first open, the bytes
automatically get translated to strings, using some encoding (like
"utf-8") that assigns characters to byte-represented numbers.

Run this:

```python
from io import TextIOWrapper
```

`TextIOWrapper` objects "wrap" file objects are used to convert bytes
to characters on the fly.  For example, try the following:

```python
with open("hello.txt", "rb") as f:
    tio = TextIOWrapper(f)
    data3 = tio.read()
print(type(data3))
```

Even though we open in binary mode, we get a string thanks to
`TextIOWrapper`!  You can think of the example where we read into
`data1` as a shorthand for what we did to get `data3`.

## Reading Files

A ZipFile has a method named `open` that works a lot like the `open`
function you're familiar with.  A ZipFile is a context manager, and so
is the object returned by `ZipFile.open(...)`, so we'll end up with
nested `with` statements to make sure everything gets closed up
properly.  Let's take a look at the compressed schedule file:

```python
with ZipFile('example.zip') as zf:
    with zf.open("hello.txt", "r") as f:
        print(f.read())
```

Woah, why do we get `b'hello world'`?  For regular files, "r" mode
defaults to reading text, but for files inside a zip, it defaults to
binary mode, so we got back bytes.

TextIOWrapper saves the day:

```python
with ZipFile('example.zip') as zf:
    with zf.open("hello.txt", "r") as f:
        tio = TextIOWrapper(f)
        print(tio.read())
```

With regular files, TextIOWrapper is a bit useless (why not just open
with "r" instead of "rb"?), but for zips, it is crucial.

## Pandas

Pandas can read a DataFrame even from a binary stream.  So you can can do this:

```python
with ZipFile('example.zip') as zf:
    with zf.open("bugs.csv") as f:
         df = pd.read_csv(f)
df
```