diff --git a/lecture_material/04-performance2/reading.ipynb b/lecture_material/04-performance2/reading.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..24a5a877ecd74f307b8045fac4bca5b3a8072955 --- /dev/null +++ b/lecture_material/04-performance2/reading.ipynb @@ -0,0 +1,545 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Complexity of List Operations\n", + "\n", + "Last time, we learned that a \"step\" is any unit of work with bounded execution time (it doesn't keep getting slower with growing input size).\n", + "\n", + "This definition provides significant flexibility. For example, consider the following:\n", + "\n", + "```python\n", + "print(1)\n", + "print(2)\n", + "print(3)\n", + "```\n", + "\n", + "On could count each print as a single step, but given the definition, it would be equally legitimate to consider the 3-line code snippet to be a single step.\n", + "\n", + "Just like a step might correspond to multiple lines, sometimes a single line corresponds to multiple steps. For example, the following line is not a single step, or a O(1) snippet of code:\n", + "\n", + "```python\n", + "total = add_them_up(nums)\n", + "```\n", + "\n", + "It turns out the above in an O(N) operation, where N is the length of the nums list. Of course, you would only know that if we also show you the definition of `add_them_up`:\n", + "\n", + "```python\n", + "def add_them_up(values):\n", + " total = 0\n", + " for v in values:\n", + " total += v\n", + " return total\n", + "```\n", + "\n", + "A common misconception is that functions and operations built into Python count as single steps, but this is not so. Python's `sum` function works much like the above `add_them_up`, so the following line is an O(N) snippet of code, not a single step:\n", + "\n", + "```python\n", + "total = add_them_up(nums)\n", + "```\n", + "\n", + "In this reading, we'll consider 8 common list operations, each of which is either O(N) or O(1) -- calls to the latter can be counted as a single step. This has great practical significance: your code will generally be faster if you avoid the O(N) operations except when necessary.\n", + "\n", + "1. len\n", + "2. index\n", + "3. append\n", + "4. insert\n", + "5. pop\n", + "6. extend\n", + "7. in\n", + "8. max\n", + "\n", + "Remember that every process keeps all its data in a big \"list\" called an address space. It's a slight simplification, but useful to imagine each Python list as occupying a range of spots within the address space. For example, we could imagine a list of the letters A through D occupying addresses 3-6, like this:\n", + "\n", + "```\n", + " ABCD [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```\n", + "\n", + "Accessing at a single address counts a single step (by \"accessing\", we mean looking it up or changing it) -- we'll use this to reason about which of are other operations can or cannot count as a single step." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## len(L)\n", + "\n", + "We already saw that `sum` is an O(N) operation. You might guess `len` is too, and it would be if it were implemented something like this:\n", + "\n", + "```python\n", + "def get_length(L):\n", + " length = 0\n", + " for item in L:\n", + " length += 1\n", + " return length\n", + "```\n", + "\n", + "Fortunately, Python lists keep some extra statistics to make this faster, something like this:\n", + "\n", + "```\n", + " 4ABCD [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```\n", + "\n", + "When `len(L)` is called, Python looks up the pre-computed length instead of re-counting. Of course, this will need to be updated each time a new value it added:\n", + "\n", + "```\n", + " 5ABCDE [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```\n", + "\n", + "You could imagine a better version of the Python list that also keeps track of the sum, as items are added/modified. For example, a list of `[5,1,2]` might look like this:\n", + "\n", + "```\n", + " 37512 [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```\n", + "\n", + "In this case, the values would be at addresses 4-6. The length of the list would be stored at position 2 and the sum at position 3. Hypothetically, if Python lists worked like this, both `len` and `sum` would be O(1) operations; in reality, only `len` is O(1) and sum is O(N). One takeaway is that when it comes to performance, you need to learn about the specific data structures you are using in your specific programming language. Your intuition about how things work may apply to varying degrees across different languages and environments." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Indexing\n", + "\n", + "`L[0]` is fast (an O(1) operation) -- Python knows the address where the list values starts (in this case, address 3), and it can just quickly access that location in the address space.\n", + "\n", + "```\n", + " 4ABCD [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```\n", + "\n", + "What about other positions, such as the end of the list? For example, `L[2]`? This is fast too. If the list starts at location 3 in the address space, then the value referred to by `L[2]` lives at location 3+2 in the address space. If we know the address, we can quickly get the value.\n", + "\n", + "What about negative indexes? Do we need to loop over every item to find the end when you use `L[-1]`? No. `L[-k]` is the same as `L[len(L)-k]`. Comuting the length is O(1) and subtraction is O(1). This converts the negative index to a positive, and we've already discussed how accessing a positive index is O(1). Thus, the whole thing is O(1), or a single step." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## L.append(...)\n", + "\n", + "This is fast. Assuming our list looks something like this:\n", + "\n", + "```\n", + " 4ABCD [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```\n", + "\n", + "Appending a value accesses a fixed number of locations (locations 2 and 7, to be precise):\n", + "\n", + "```\n", + " 5ABCDE [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## L.insert(...)\n", + "\n", + "Insterting is like appending, except we can at the value at any index we like." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[1, 2, 100, 3, 4]" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "L = [1,2,3,4]\n", + "L.insert(2, 100) # in the middle\n", + "L" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[100, 1, 2, 3, 4]" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "L = [1,2,3,4]\n", + "L.insert(0, 100) # at the beginning\n", + "L" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Inserting in the middle, or worse the beginning, causes many values to shift over.\n", + "\n", + "```\n", + " 4ABCD [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```\n", + "\n", + "After `L.insert(0, \"Z\")`, it looks like this:\n", + "\n", + "```\n", + " 5ZABCD [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```\n", + "\n", + "We had to move D from spot 6 to 7, C from spot 5 to 6, etc. As we have more values, it will run proportionately slower. Insert (in the worst case), is an O(N) operation." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# L.pop(...)\n", + "\n", + "```\n", + " 4ABCD [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```\n", + "\n", + "Popping from the end with `L.pop(-1)` is O(1):\n", + "\n", + "```\n", + " 3ABC [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```\n", + "\n", + "In contrast, popping from the beginning with `L.pop(0)` is slow (O(N)), as we need to shift all the other values too (just like when we insert at the beginning):\n", + "\n", + "```\n", + " 2BC [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# L.extend(...)\n", + "\n", + "extend is a little different than append. Let's review the difference." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[1, 2, 3, [4, 5, 6]]" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "L = [1,2,3]\n", + "L2 = [4,5,6]\n", + "L.append(L2)\n", + "L" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[1, 2, 3, 4, 5, 6]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "L = [1,2,3]\n", + "L2 = [4,5,6]\n", + "L.extend(L2)\n", + "L" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If L is `[\"A\", \"B\", \"C\"]` and L2 is `[\"X\", \"Y\"]`, the address space might look something like this:\n", + "\n", + "```\n", + " ABC XY [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```\n", + "\n", + "After extend, it would look like this:\n", + "\n", + "```\n", + " ABCXY XY [VALUES]\n", + "0123456789 [ADDRESSES]\n", + "```\n", + "\n", + "We've been categorizing the input size with one variable, N. Here, we have to lists, so we should we two variables. We'll use N for len(L) and M for len(L2). In this case, we have to copy every item from L2 (XY in the example above). So the complexity is in terms of L2's size: O(M). The time it takes to execute the extend does not depend on the size of L (that is, N)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## in\n", + "\n", + "You should imagine the `in` operator like this:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "nums = [2,7,3,9,8,4]\n", + "4 in nums" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def in_function(values, target):\n", + " for v in values:\n", + " if v == target:\n", + " return True\n", + " return False\n", + "\n", + "in_function(nums, 4)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Thus, the `in` operator in O(N). If you want fast checks with the in operator, use sets:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "nums = set([2,7,3,9,8,4])\n", + "4 in nums" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The worst case complexity is still O(N), but it's actually difficult to construct a situation for this worst case. In the average, or common, case, you should think of the `in` operators for sets as an O(1) operation." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## max\n", + "\n", + "The `max(...)` function is O(N), as it is similar to the following function:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "9" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def largest(values):\n", + " best = values[0]\n", + " for v in values:\n", + " if v > best:\n", + " best = v\n", + " return best\n", + "\n", + "nums = [2,7,3,9,8,4]\n", + "largest(nums)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Similarly, `min` is O(N)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Optimization Example\n", + "\n", + "Consider the following code, that prints off the percentage of each entry, relative to the total:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2 is 6 percent\n", + "7 is 21 percent\n", + "3 is 9 percent\n", + "9 is 27 percent\n", + "8 is 24 percent\n", + "4 is 12 percent\n" + ] + } + ], + "source": [ + "nums = [2,7,3,9,8,4]\n", + "\n", + "for num in nums:\n", + " print(num, \"is\", round(100 * num / sum(nums)), \"percent\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The above is $O(N^2)$. If N is the length of the list, then the code loops N times. In each loop iteration, `sum(nums)`, an O(N) operation is called.\n", + "\n", + "We can optimize it by moving the `sum` call above the loop, calling it once, and saving the result. The following is O(N) -- yay!" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2 is 6 percent\n", + "7 is 21 percent\n", + "3 is 9 percent\n", + "9 is 27 percent\n", + "8 is 24 percent\n", + "4 is 12 percent\n" + ] + } + ], + "source": [ + "nums = [2,7,3,9,8,4]\n", + "\n", + "total = sum(nums)\n", + "for num in nums:\n", + " print(num, \"is\", round(100 * num / total), \"percent\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/lecture_material/04-performance2/solution.ipynb b/lecture_material/04-performance2/solution.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..f7d5b2441932a9c56c0583397fed02b631fb10d3 --- /dev/null +++ b/lecture_material/04-performance2/solution.ipynb @@ -0,0 +1,494 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d617eefb", + "metadata": {}, + "source": [ + "# Performance 2" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "783117c5-146f-454a-963e-ed2873b8a6d3", + "metadata": {}, + "outputs": [], + "source": [ + "# known import statements\n", + "import pandas as pd\n", + "import csv\n", + "from subprocess import check_output\n", + "\n", + "# new import statements\n", + "import zipfile\n", + "from io import TextIOWrapper" + ] + }, + { + "cell_type": "markdown", + "id": "4e2be82d", + "metadata": {}, + "source": [ + "### Let's take a look at the files inside the current working directory." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "4eaa8a8d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['total 21M',\n", + " '-rw-rw-r-- 1 gurmail.singh gurmail.singh 2.0K Jan 30 20:49 lec2.ipynb',\n", + " '-rw-rw-r-- 1 gurmail.singh gurmail.singh 5.2K Feb 1 13:08 lecture.ipynb',\n", + " '-rw------- 1 gurmail.singh gurmail.singh 230K Feb 1 13:09 nohup.out',\n", + " 'drwxrwxr-x 3 gurmail.singh gurmail.singh 4.0K Jan 30 20:42 paper',\n", + " '-rw-rw-r-- 1 gurmail.singh gurmail.singh 39 Jan 25 18:32 paper1.txt',\n", + " 'drwxrwxr-x 8 gurmail.singh gurmail.singh 4.0K Jan 30 14:06 s24',\n", + " 'drwx------ 3 gurmail.singh gurmail.singh 4.0K Jan 30 12:31 snap',\n", + " '-rw-rw-r-- 1 gurmail.singh gurmail.singh 21M Feb 1 12:44 wi.zip',\n", + " '']" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "str(check_output([\"ls\", \"-lh\"]), encoding=\"utf-8\").split(\"\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "b8c7dc7f", + "metadata": {}, + "source": [ + "### Let's `unzip` \"wi.zip\"." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "ed32cf4c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "b'Archive: wi.zip\\n inflating: wi.csv \\n'" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "check_output([\"unzip\", \"wi.zip\"])" + ] + }, + { + "cell_type": "markdown", + "id": "4eac1b48", + "metadata": {}, + "source": [ + "### Let's take a look at the files inside the current working directory." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "a6852e43", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['total 198M',\n", + " '-rw-rw-r-- 1 gurmail.singh gurmail.singh 2.0K Jan 30 20:49 lec2.ipynb',\n", + " '-rw-rw-r-- 1 gurmail.singh gurmail.singh 5.2K Feb 1 13:08 lecture.ipynb',\n", + " '-rw------- 1 gurmail.singh gurmail.singh 230K Feb 1 13:09 nohup.out',\n", + " 'drwxrwxr-x 3 gurmail.singh gurmail.singh 4.0K Jan 30 20:42 paper',\n", + " '-rw-rw-r-- 1 gurmail.singh gurmail.singh 39 Jan 25 18:32 paper1.txt',\n", + " 'drwxrwxr-x 8 gurmail.singh gurmail.singh 4.0K Jan 30 14:06 s24',\n", + " 'drwx------ 3 gurmail.singh gurmail.singh 4.0K Jan 30 12:31 snap',\n", + " '-rw-rw-r-- 1 gurmail.singh gurmail.singh 177M Jan 14 2022 wi.csv',\n", + " '-rw-rw-r-- 1 gurmail.singh gurmail.singh 21M Feb 1 12:44 wi.zip',\n", + " '']" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "str(check_output([\"ls\", \"-lh\"]), encoding=\"utf-8\").split(\"\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "8ba94151", + "metadata": {}, + "source": [ + "### Traditional way of reading data using pandas" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "529a4bd2", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/tmp/ipykernel_36341/3756477020.py:1: DtypeWarning: Columns (22,23,24,26,27,28,29,30,31,32,33,38,43,44) have mixed types. Specify dtype option on import or set low_memory=False.\n", + " df = pd.read_csv(\"wi.csv\")\n" + ] + } + ], + "source": [ + "df = pd.read_csv(\"wi.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "570485b8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>activity_year</th>\n", + " <th>lei</th>\n", + " <th>derived_msa-md</th>\n", + " <th>state_code</th>\n", + " <th>county_code</th>\n", + " <th>census_tract</th>\n", + " <th>conforming_loan_limit</th>\n", + " <th>derived_loan_product_type</th>\n", + " <th>derived_dwelling_category</th>\n", + " <th>derived_ethnicity</th>\n", + " <th>...</th>\n", + " <th>denial_reason-2</th>\n", + " <th>denial_reason-3</th>\n", + " <th>denial_reason-4</th>\n", + " <th>tract_population</th>\n", + " <th>tract_minority_population_percent</th>\n", + " <th>ffiec_msa_md_median_family_income</th>\n", + " <th>tract_to_msa_income_percentage</th>\n", + " <th>tract_owner_occupied_units</th>\n", + " <th>tract_one_to_four_family_homes</th>\n", + " <th>tract_median_age_of_housing_units</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>2020</td>\n", + " <td>549300FX7K8PTEQUU487</td>\n", + " <td>31540</td>\n", + " <td>WI</td>\n", + " <td>55025.0</td>\n", + " <td>5.502500e+10</td>\n", + " <td>C</td>\n", + " <td>Conventional:First Lien</td>\n", + " <td>Single Family (1-4 Units):Site-Built</td>\n", + " <td>Not Hispanic or Latino</td>\n", + " <td>...</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " <td>3572</td>\n", + " <td>41.15</td>\n", + " <td>96600</td>\n", + " <td>64</td>\n", + " <td>812</td>\n", + " <td>910</td>\n", + " <td>45</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>2020</td>\n", + " <td>549300FX7K8PTEQUU487</td>\n", + " <td>99999</td>\n", + " <td>WI</td>\n", + " <td>55013.0</td>\n", + " <td>5.501397e+10</td>\n", + " <td>C</td>\n", + " <td>Conventional:First Lien</td>\n", + " <td>Single Family (1-4 Units):Site-Built</td>\n", + " <td>Not Hispanic or Latino</td>\n", + " <td>...</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " <td>2333</td>\n", + " <td>9.90</td>\n", + " <td>68000</td>\n", + " <td>87</td>\n", + " <td>1000</td>\n", + " <td>2717</td>\n", + " <td>34</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>2020</td>\n", + " <td>549300FX7K8PTEQUU487</td>\n", + " <td>99999</td>\n", + " <td>WI</td>\n", + " <td>55127.0</td>\n", + " <td>5.512700e+10</td>\n", + " <td>C</td>\n", + " <td>VA:First Lien</td>\n", + " <td>Single Family (1-4 Units):Site-Built</td>\n", + " <td>Not Hispanic or Latino</td>\n", + " <td>...</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " <td>5943</td>\n", + " <td>13.26</td>\n", + " <td>68000</td>\n", + " <td>104</td>\n", + " <td>1394</td>\n", + " <td>1856</td>\n", + " <td>44</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>2020</td>\n", + " <td>549300FX7K8PTEQUU487</td>\n", + " <td>99999</td>\n", + " <td>WI</td>\n", + " <td>55127.0</td>\n", + " <td>5.512700e+10</td>\n", + " <td>C</td>\n", + " <td>Conventional:Subordinate Lien</td>\n", + " <td>Single Family (1-4 Units):Site-Built</td>\n", + " <td>Ethnicity Not Available</td>\n", + " <td>...</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " <td>5650</td>\n", + " <td>7.63</td>\n", + " <td>68000</td>\n", + " <td>124</td>\n", + " <td>1712</td>\n", + " <td>2104</td>\n", + " <td>36</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>2020</td>\n", + " <td>549300FX7K8PTEQUU487</td>\n", + " <td>33460</td>\n", + " <td>WI</td>\n", + " <td>55109.0</td>\n", + " <td>5.510912e+10</td>\n", + " <td>C</td>\n", + " <td>VA:First Lien</td>\n", + " <td>Single Family (1-4 Units):Site-Built</td>\n", + " <td>Not Hispanic or Latino</td>\n", + " <td>...</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " <td>7210</td>\n", + " <td>4.36</td>\n", + " <td>97300</td>\n", + " <td>96</td>\n", + " <td>2101</td>\n", + " <td>2566</td>\n", + " <td>22</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>5 rows × 99 columns</p>\n", + "</div>" + ], + "text/plain": [ + " activity_year lei derived_msa-md state_code \\\n", + "0 2020 549300FX7K8PTEQUU487 31540 WI \n", + "1 2020 549300FX7K8PTEQUU487 99999 WI \n", + "2 2020 549300FX7K8PTEQUU487 99999 WI \n", + "3 2020 549300FX7K8PTEQUU487 99999 WI \n", + "4 2020 549300FX7K8PTEQUU487 33460 WI \n", + "\n", + " county_code census_tract conforming_loan_limit \\\n", + "0 55025.0 5.502500e+10 C \n", + "1 55013.0 5.501397e+10 C \n", + "2 55127.0 5.512700e+10 C \n", + "3 55127.0 5.512700e+10 C \n", + "4 55109.0 5.510912e+10 C \n", + "\n", + " derived_loan_product_type derived_dwelling_category \\\n", + "0 Conventional:First Lien Single Family (1-4 Units):Site-Built \n", + "1 Conventional:First Lien Single Family (1-4 Units):Site-Built \n", + "2 VA:First Lien Single Family (1-4 Units):Site-Built \n", + "3 Conventional:Subordinate Lien Single Family (1-4 Units):Site-Built \n", + "4 VA:First Lien Single Family (1-4 Units):Site-Built \n", + "\n", + " derived_ethnicity ... denial_reason-2 denial_reason-3 \\\n", + "0 Not Hispanic or Latino ... NaN NaN \n", + "1 Not Hispanic or Latino ... NaN NaN \n", + "2 Not Hispanic or Latino ... NaN NaN \n", + "3 Ethnicity Not Available ... NaN NaN \n", + "4 Not Hispanic or Latino ... NaN NaN \n", + "\n", + " denial_reason-4 tract_population tract_minority_population_percent \\\n", + "0 NaN 3572 41.15 \n", + "1 NaN 2333 9.90 \n", + "2 NaN 5943 13.26 \n", + "3 NaN 5650 7.63 \n", + "4 NaN 7210 4.36 \n", + "\n", + " ffiec_msa_md_median_family_income tract_to_msa_income_percentage \\\n", + "0 96600 64 \n", + "1 68000 87 \n", + "2 68000 104 \n", + "3 68000 124 \n", + "4 97300 96 \n", + "\n", + " tract_owner_occupied_units tract_one_to_four_family_homes \\\n", + "0 812 910 \n", + "1 1000 2717 \n", + "2 1394 1856 \n", + "3 1712 2104 \n", + "4 2101 2566 \n", + "\n", + " tract_median_age_of_housing_units \n", + "0 45 \n", + "1 34 \n", + "2 44 \n", + "3 36 \n", + "4 22 \n", + "\n", + "[5 rows x 99 columns]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.head(5) # Top 5 rows within the DataFrame" + ] + }, + { + "cell_type": "markdown", + "id": "bad7dce4", + "metadata": {}, + "source": [ + "### How can we see all the column names?" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "d0a98751", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['activity_year', 'lei', 'derived_msa-md', 'state_code', 'county_code',\n", + " 'census_tract', 'conforming_loan_limit', 'derived_loan_product_type',\n", + " 'derived_dwelling_category', 'derived_ethnicity', 'derived_race',\n", + " 'derived_sex', 'action_taken', 'purchaser_type', 'preapproval',\n", + " 'loan_type', 'loan_purpose', 'lien_status', 'reverse_mortgage',\n", + " 'open-end_line_of_credit', 'business_or_commercial_purpose',\n", + " 'loan_amount', 'loan_to_value_ratio', 'interest_rate', 'rate_spread',\n", + " 'hoepa_status', 'total_loan_costs', 'total_points_and_fees',\n", + " 'origination_charges', 'discount_points', 'lender_credits', 'loan_term',\n", + " 'prepayment_penalty_term', 'intro_rate_period', 'negative_amortization',\n", + " 'interest_only_payment', 'balloon_payment',\n", + " 'other_nonamortizing_features', 'property_value', 'construction_method',\n", + " 'occupancy_type', 'manufactured_home_secured_property_type',\n", + " 'manufactured_home_land_property_interest', 'total_units',\n", + " 'multifamily_affordable_units', 'income', 'debt_to_income_ratio',\n", + " 'applicant_credit_score_type', 'co-applicant_credit_score_type',\n", + " 'applicant_ethnicity-1', 'applicant_ethnicity-2',\n", + " 'applicant_ethnicity-3', 'applicant_ethnicity-4',\n", + " 'applicant_ethnicity-5', 'co-applicant_ethnicity-1',\n", + " 'co-applicant_ethnicity-2', 'co-applicant_ethnicity-3',\n", + " 'co-applicant_ethnicity-4', 'co-applicant_ethnicity-5',\n", + " 'applicant_ethnicity_observed', 'co-applicant_ethnicity_observed',\n", + " 'applicant_race-1', 'applicant_race-2', 'applicant_race-3',\n", + " 'applicant_race-4', 'applicant_race-5', 'co-applicant_race-1',\n", + " 'co-applicant_race-2', 'co-applicant_race-3', 'co-applicant_race-4',\n", + " 'co-applicant_race-5', 'applicant_race_observed',\n", + " 'co-applicant_race_observed', 'applicant_sex', 'co-applicant_sex',\n", + " 'applicant_sex_observed', 'co-applicant_sex_observed', 'applicant_age',\n", + " 'co-applicant_age', 'applicant_age_above_62',\n", + " 'co-applicant_age_above_62', 'submission_of_application',\n", + " 'initially_payable_to_institution', 'aus-1', 'aus-2', 'aus-3', 'aus-4',\n", + " 'aus-5', 'denial_reason-1', 'denial_reason-2', 'denial_reason-3',\n", + " 'denial_reason-4', 'tract_population',\n", + " 'tract_minority_population_percent',\n", + " 'ffiec_msa_md_median_family_income', 'tract_to_msa_income_percentage',\n", + " 'tract_owner_occupied_units', 'tract_one_to_four_family_homes',\n", + " 'tract_median_age_of_housing_units'],\n", + " dtype='object')" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.columns" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/lecture_material/04-performance2/template_lec_001.ipynb b/lecture_material/04-performance2/template_lec_001.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..3f63d2f0fa6253d66fd452e3e68c36ee1266e2e2 --- /dev/null +++ b/lecture_material/04-performance2/template_lec_001.ipynb @@ -0,0 +1,150 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "1a6cc54c", + "metadata": {}, + "source": [ + "# Performance 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "783117c5-146f-454a-963e-ed2873b8a6d3", + "metadata": {}, + "outputs": [], + "source": [ + "# known import statements\n", + "import pandas as pd\n", + "import csv\n", + "from subprocess import check_output\n", + "\n", + "# new import statements\n", + "import zipfile\n", + "from io import TextIOWrapper" + ] + }, + { + "cell_type": "markdown", + "id": "66db2ad0", + "metadata": {}, + "source": [ + "### Let's take a look at the files inside the current working directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6cef713e", + "metadata": {}, + "outputs": [], + "source": [ + "str(check_output([\"ls\", \"-lh\"]), encoding=\"utf-8\").split(\"\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "c76f819d", + "metadata": {}, + "source": [ + "### Let's `unzip` \"wi.zip\"." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e87ec01", + "metadata": {}, + "outputs": [], + "source": [ + "check_output([\"unzip\", \"wi.zip\"])" + ] + }, + { + "cell_type": "markdown", + "id": "274fa49a", + "metadata": {}, + "source": [ + "### Let's take a look at the files inside the current working directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a2da3cd0", + "metadata": {}, + "outputs": [], + "source": [ + "str(check_output([\"ls\", \"-lh\"]), encoding=\"utf-8\").split(\"\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "90b11343", + "metadata": {}, + "source": [ + "### Traditional way of reading data using pandas" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a3175526", + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv(\"wi.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "13e6e034", + "metadata": {}, + "outputs": [], + "source": [ + "df.head(5) # Top 5 rows within the DataFrame" + ] + }, + { + "cell_type": "markdown", + "id": "5c79984c", + "metadata": {}, + "source": [ + "### How can we see all the column names?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "08d9501d", + "metadata": {}, + "outputs": [], + "source": [ + "df.columns" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/lecture_material/04-performance2/template_lec_002.ipynb b/lecture_material/04-performance2/template_lec_002.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..3f63d2f0fa6253d66fd452e3e68c36ee1266e2e2 --- /dev/null +++ b/lecture_material/04-performance2/template_lec_002.ipynb @@ -0,0 +1,150 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "1a6cc54c", + "metadata": {}, + "source": [ + "# Performance 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "783117c5-146f-454a-963e-ed2873b8a6d3", + "metadata": {}, + "outputs": [], + "source": [ + "# known import statements\n", + "import pandas as pd\n", + "import csv\n", + "from subprocess import check_output\n", + "\n", + "# new import statements\n", + "import zipfile\n", + "from io import TextIOWrapper" + ] + }, + { + "cell_type": "markdown", + "id": "66db2ad0", + "metadata": {}, + "source": [ + "### Let's take a look at the files inside the current working directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6cef713e", + "metadata": {}, + "outputs": [], + "source": [ + "str(check_output([\"ls\", \"-lh\"]), encoding=\"utf-8\").split(\"\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "c76f819d", + "metadata": {}, + "source": [ + "### Let's `unzip` \"wi.zip\"." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e87ec01", + "metadata": {}, + "outputs": [], + "source": [ + "check_output([\"unzip\", \"wi.zip\"])" + ] + }, + { + "cell_type": "markdown", + "id": "274fa49a", + "metadata": {}, + "source": [ + "### Let's take a look at the files inside the current working directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a2da3cd0", + "metadata": {}, + "outputs": [], + "source": [ + "str(check_output([\"ls\", \"-lh\"]), encoding=\"utf-8\").split(\"\\n\")" + ] + }, + { + "cell_type": "markdown", + "id": "90b11343", + "metadata": {}, + "source": [ + "### Traditional way of reading data using pandas" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a3175526", + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.read_csv(\"wi.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "13e6e034", + "metadata": {}, + "outputs": [], + "source": [ + "df.head(5) # Top 5 rows within the DataFrame" + ] + }, + { + "cell_type": "markdown", + "id": "5c79984c", + "metadata": {}, + "source": [ + "### How can we see all the column names?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "08d9501d", + "metadata": {}, + "outputs": [], + "source": [ + "df.columns" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/lecture_material/04-performance2/wi.zip b/lecture_material/04-performance2/wi.zip new file mode 100644 index 0000000000000000000000000000000000000000..3a3fe315015a5d74558a5c810a1a5eb4a45b97f0 Binary files /dev/null and b/lecture_material/04-performance2/wi.zip differ diff --git a/lecture_material/04-performance2/worksheet.pdf b/lecture_material/04-performance2/worksheet.pdf new file mode 100644 index 0000000000000000000000000000000000000000..36df0b5d8f1a667cd47649aad56c4cd48f74b3f7 Binary files /dev/null and b/lecture_material/04-performance2/worksheet.pdf differ