Compare revisions

JINLANG WANG · JINLANG WANG · gsingh58 · gsingh58 · gsingh58 · JINLANG WANG
--- a/Labs/Lab1/lab1.md
+++ b/Labs/Lab1/lab1.md
 # Lab 1: VM Setup

-1. Find your group. You can find your group under Canvas - People - Study Group. Ask your TA/peer mentor if you need help finding your group members.
+1. You are free to form groups with others. We recommend groups of 4-7 members. You can sign up to your group under Canvas - People - Study Group Self Sign Up. 

 2. Get to know your group members, asking each other the following:


--- a/Labs/Lab1/vm/gcp/gcp.md
+++ b/Labs/Lab1/vm/gcp/gcp.md
@@ -31,7 +31,7 @@ Any **resources** we create (like virtual machines), are grouped into

 2. Click "CREATE PROJECT"

-3. Call it "cs320-fa23" and associate it with your account that has the free credits.  Sometimes an option will appear to select an organization in which to nest your project.  If this happens, select "wisc.edu".
+3. Call it "cs320-sp23" and associate it with your account that has the free credits.  Sometimes an option will appear to select an organization in which to nest your project.  If this happens, select "wisc.edu".

 <img src="img/2.png" width=600>


--- a/Labs/Lab10/README.md
+++ b/Labs/Lab10/README.md
+# Lab 10: Classification
+
+1. Install `sklearn`, load and show some images from the MNIST dataset: either download the csv dataset, or look into `sklearn.datasets.fetch_openml` [example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_logistic_regression_mnist.html#sphx-glr-auto-examples-linear-model-plot-sparse-logistic-regression-mnist-py).
+
+# Screenshot Requirement
+
+A sreenshot that shows sklearn is installed successfully.
\ No newline at end of file
--- a/Labs/Lab11/README.md
+++ b/Labs/Lab11/README.md
+# Lab 11: Regression
+
+1. Continue working on P5
+
+2. Related lab documents: [SQL](./SQL.md), [Matrix](./counting-cells.md), [Raster](./raster.md)
+
+# Screenshot Requirement
+
+A screenshot that shows your progress
\ No newline at end of file
--- a/Labs/Lab11/SQL.md
+++ b/Labs/Lab11/SQL.md
+# SQL Database Queries
+
+SQLite databases contain multiple tables.  We can write queries to ask
+questions about the data in these tables.  One way is by putting our
+queries in strings and using `pd.read_sql` to get the results back in
+a DataFrame.
+
+We'll give some examples here that will help you get the data you need
+for P6.  The `INNER JOIN` and `GROUY BY` operations will be especially
+useful.
+
+```python
+import pandas as pd
+import sqlite3
+
+df = pd.DataFrame([
+    {"state": "wi", "city": "madison", "address": "123 Main St.", "item": "apples", "quantity": 3, "price": 1},
+    {"state": "wi", "city": "madison", "address": "123 Main St.", "item": "oranges", "quantity": 4, "price": 0.8},
+    {"state": "wi", "city": "madison", "address": "123 Main St.", "item": "cantaloupe", "quantity": 5, "price": 2},
+    {"state": "wi", "city": "milwaukee", "address": "456 State St", "item": "apples", "quantity": 6, "price": 0.9},
+    {"state": "wi", "city": "milwaukee", "address": "456 State St.", "item": "oranges", "quantity": 8, "price": 1.2},
+])
+connection = sqlite3.connect("grocery.db")
+df.to_sql("sales", connection, if_exists="replace", index=False)
+```
+
+Take a look at the data:
+
+```python
+pd.read_sql("SELECT * FROM sales", connection)
+```
+
+To review `GROUP BY`, take a look at this query that computes how much
+revenue each kind of fruit generates and run it:
+
+```python
+pd.read_sql("SELECT item, SUM(quantity*price) AS dollars FROM sales GROUP BY item", connection)
+```
+
+Now, try to write a query that counts sales per location.
+
+<details>
+    <summary>ANSWER</summary>
+    <code>
+    pd.read_sql("SELECT state, city, address, SUM(quantity*price) AS dollars FROM sales GROUP BY state, city, address", connection)
+    </code>
+</details>
+
+
+Notice a problem?  The issue is that all address information is
+repeated for each location.  That wastes space, but much worse, it
+opens the possibility for typos leading to results such as this:
+
+<img src="err.png" width=400>
+
+To avoid these issues, it's common in practice to break up such a
+table into two smaller tables, perhaps named `locations` and `sales`.
+A `location_id` field might make it possible to combine the
+information.
+
+```python
+df = pd.DataFrame([
+    {"location_id": 1, "state": "wi", "city": "madison", "address": "123 Main St."},
+    {"location_id": 2, "state": "wi", "city": "milwaukee", "address": "456 State St."},
+])
+df.to_sql("locations", connection, if_exists="replace", index=False)
+
+df = pd.DataFrame([
+    {"location_id": 1, "item": "apples", "quantity": 3, "price": 1},
+    {"location_id": 1, "item": "oranges", "quantity": 4, "price": 0.8},
+    {"location_id": 1, "item": "cantaloupe", "quantity": 5, "price": 2},
+    {"location_id": 2, "item": "apples", "quantity": 6, "price": 0.9},
+    {"location_id": 2, "item": "oranges", "quantity": 8, "price": 1.2},
+])
+df.to_sql("sales", connection, if_exists="replace", index=False)
+```
+
+Take a look at each table:
+
+* `pd.read_sql("SELECT * FROM sales", connection)`
+* `pd.read_sql("SELECT * FROM locations", connection)`
+
+Note that you *could* figure out the location for each sale in the
+`sales` table by using the `location_id` to find that information in
+`locations`.
+
+There's an easier way, `INNER JOIN` (there are other kinds of joins
+that we won't discuss in CS 320).
+
+Try running this:
+
+```
+pd.read_sql("SELECT * FROM locations INNER JOIN sales", connection)
+```
+
+Notice that the `INNER JOIN` is creating a row for every combination
+of the 2 rows in `locations` and the 5 rows in `sales`, for a total of
+10 result rows.  Most of these results are meaningless: many of the
+output rows have `location_id` appearing twice, with the two values
+being inconsistent.
+
+We need to add an `ON` clause to match up each `sales` row with the
+`locations` row that has the same `location_id`.  Add `ON
+locations.location_id = sales.location_id` to the end of the query:
+
+```python
+pd.read_sql("SELECT * FROM locations INNER JOIN sales ON locations.location_id = sales.location_id", connection)
+```
+
+The `location_id` was only useful for matching up the rows, so you may
+want to drop in in pandas (there's not a simple way in SQL):
+
+```python
+pd.read_sql("""
+  SELECT * FROM 
+  locations INNER JOIN sales 
+  ON locations.location_id = sales.location_id""",
+  connection).drop(columns="location_id")
+```
+
+We can also do similar queries as we could before when we only had one
+table.  The `GROUP BY` will come after the `INNER JOIN`.  How much
+revenue did each fruit generate?
+
+```python
+pd.read_sql("""
+  SELECT item, SUM(quantity*price) AS dollars
+  FROM locations INNER JOIN sales 
+  ON locations.location_id = sales.location_id
+  GROUP BY item""", connection)
+```
+
+Now, try write a query to answer the question, how much revenue was
+there at each location?
+
+<details>
+    <summary>ANSWER (option 1)</summary>
+    <code>
+pd.read_sql("""
+  SELECT state, city, address, SUM(quantity*price) AS dollars
+  FROM locations INNER JOIN sales 
+  ON locations.location_id = sales.location_id
+  GROUP BY state, city, address""", connection)
+    </code>
+</details>
+
+<details>
+    <summary>ANSWER (option 2)</summary>
+    <code>
+pd.read_sql("""
+  SELECT state, city, address, SUM(quantity*price) AS dollars
+  FROM locations INNER JOIN sales 
+  ON locations.location_id = sales.location_id
+  GROUP BY locations.location_id""", connection)
+    </code>
+</details>
--- a/Labs/Lab11/counting-cells.md
+++ b/Labs/Lab11/counting-cells.md
+# Counting Cells
+
+Run the following:
+
+```python
+import numpy as np
+a = np.array([
+    [0,0,5,8],
+    [1,2,4,8],
+    [2,4,6,9],
+])
+```
+
+How many even numbers are in this matrix?  What percentage of the
+numbers are even?  We'll walk you through the solution.  Please run
+each step (which builds on the previous) to see what's happening.
+
+First step, mod by 2, to get a 0 in every odd cell:
+
+```python
+a % 2
+```
+
+Now, let's do an elementwise comparison to get a True in every place where there is an even number:
+
+```python
+a % 2 == 0
+```
+
+It will be easier to count matches if we represent True as 1 and False as 0:
+
+```python
+(a % 2 == 0).astype(int)
+```
+
+How many is that?
+
+```python
+(a % 2 == 0).astype(int).sum()
+```
+
+And what percent of the total is that?
+
+```python
+(a % 2 == 0).astype(int).mean() * 100
+```
+
+This may be useful for counting what percentage of an area matches a
+given land type in P6.
--- a/Labs/Lab11/raster.md
+++ b/Labs/Lab11/raster.md
+# Geographic Raster Data
+
+In class, we learned geopandas, which is a *vector-based* GIS tool --
+that means geo data is represented by vectors of coordinates, which
+form polygons and other shapes.
+
+*Raster* data is the other common kind of geo data you'll encounter.
+ With raster data, you break land into a matrix, with numbers in each
+ cell telling you something about the land at a given position.
+
+In this part, we'll learn a bit about the `rasterio` module.  It will
+help us create numpy arrays corresponding to how land is used in a
+given WI county (this will be useful for predicting things like a
+county's population).
+
+First, install some packages:
+
+```
+pip3 install rasterio Pillow
+```
+
+P6 includes a `land.zip` dataset.  Let's open it (this assumes
+your lab notebook is in the `p6` directory -- you may need to modify the path to
+`land.zip` if you're elsewhere):
+
+```python
+import rasterio
+land = rasterio.open("zip://[path_to_mp5]/mp5/land.zip!wi.tif")
+```
+
+This is the dataset for all of WI.  Let's say we want to only see Dane
+County (which contains Madison).  We can get this from TIGERweb, a
+service run by the US Census Bureau.
+
+1. go to https://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/
+2. click "TIGERweb/tigerWMS_Census2020"
+3. click "Counties (82)"
+4. at the bottom, click "Query"
+5. in the "Where" box, type `NAME='Dane County'` exactly
+6. under "Format", choose "GeoJSON"
+7. click "Query (GET)"
+8. copy the URL
+
+Paste the URL into the following code snippet:
+
+```python
+import geopandas as gpd
+url = "????"
+dane = gpd.read_file(url)
+dane.plot()
+```
+
+You should see a rough outline of Dane County.
+
+**NOTE: do not make requests to TIGERweb as part of P6.  We have
+  already done so and saved the results in a geojson file we
+  provide.**
+
+We can use that outline as a *mask* on the raster data to get a numpy
+array of land use.  A mask identifies specific cells in a matrix that
+matter to us (note that we need to convert our geopandas data to the
+same CRS as the rasterio data):
+
+```python
+from rasterio.mask import mask
+matrix, _ = mask(land, dane.to_crs(land.crs)["geometry"], crop=True)
+matrix = matrix[0]
+```
+
+Let's visualize the county:
+
+```
+import matplotlib.pyplot as plt
+plt.imshow(matrix)
+```
+
+It should look like this:
+
+<img src="dane.png" width=400>
+
+Browse the legend here: https://www.mrlc.gov/data/legends/national-land-cover-database-2019-nlcd2019-legend
+
+We see water is encoded as 11.  We can highlight all the water regions in Dane County like this:
+
+```python
+plt.imshow(matrix == 11)
+```
+
+Try filtering the matrix in different ways to see where the following land covers are dominant:
+
+* Deciduous Forest
+* Cultivated Crops
+* Developed, Low Intensity
--- a/Labs/Lab12/8-geo/city.zip
+++ b/Labs/Lab12/8-geo/city.zip
--- a/Labs/Lab12/8-geo/expected.png
+++ b/Labs/Lab12/8-geo/expected.png
--- a/Labs/Lab12/8-geo/lakes.zip
+++ b/Labs/Lab12/8-geo/lakes.zip
--- a/Labs/Lab12/8-geo/main.ipynb
+++ b/Labs/Lab12/8-geo/main.ipynb
--- a/Labs/Lab12/8-geo/solution.ipynb
+++ b/Labs/Lab12/8-geo/solution.ipynb
--- a/Labs/Lab12/8-geo/street.zip
+++ b/Labs/Lab12/8-geo/street.zip
--- a/Labs/Lab12/README.md
+++ b/Labs/Lab12/README.md
+# Lab 12
+
+1. Start work on `UserPredictor` of P6
+
+2. Use a LinearRegression to [predict how long](./regression) it will take to run a piece of code
+
+3. Practice [comparing models](./model-comparison)
+
+4. Review [dot products/matrix multiplication](./dot-product-matrix-multiplication)
+
+5. Try to fix the bugs in the [`main.ipynb`](./8-geo/main.ipynb).  There is a `solution.ipynb` too, but only peek as a last resort!
+
+# Screenshot Requirement
+
+Submit your process of `UserPredictor` 
+
+To provide you with flexibility (or peace of mind), we have set the P5 deadline for April 23. However, we strongly recommend that you complete your project before this date and begin working on P6. As an incentive to start P6 early, we may assign a task in lab 12 that encourages you to finish P5 ahead of schedule and start working on P6.
\ No newline at end of file
--- a/Labs/Lab12/dot-product-matrix-multiplication/README.md
+++ b/Labs/Lab12/dot-product-matrix-multiplication/README.md
+# Vector Dot Product and Matrix Multiplication
+
+In lecture, we've talked about what the dot product means to multiply
+a vector by a vector.  Here, we'll review that, and also learn what it
+means to multiply a matrix by a vector, or a matrix by a matrix.
+
+### 1. Dot Product of Two Vectors
+
+Complete the following function so that it compute the dot product of
+two vectors:
+
+```python
+def v_v_dot_product(v1, v2):
+    assert len(v1) == len(v2)
+    total = 0
+    for i in range(len(v1)):
+        total += ????
+    return total
+
+a = np.array([100,10,1])
+b = np.array([3,2,0])
+v_v_dot_product(a, b) # should be 320
+```
+
+<details>
+    <summary>ANSWER</summary>
+    <code>v1[i] * v2[i]</code>
+</details>
+
+### 2. Matrix - Vector Multiplication
+
+We can multiply a matrix by a vector by doing vector-by-vector
+multiplications, taking one row at a time from the matrix for the
+first vector.  This will give us a vector containing output value per
+row in the matrix.
+
+Complete the following function so that it computes the multiplication of
+a matrix and a 1-dimensional vector (vertical):
+
+```python
+def m_v_multiplication(m, v):
+    output = []
+    for row in m:
+        assert len(row) == len(v)
+        output.append(????)
+    return np.array(output)
+
+A = np.array([
+    [1,0,3],
+    [0,2,3],
+])
+x = np.array([1,10,100])
+m_v_multiplication(A, x) # should be [301, 320]
+```
+
+<details>
+    <summary>ANSWER</summary>
+    <code>v_v_dot_product(row, v)</code>
+</details>
+
+### 3. Matrix - Matrix Multiplication
+
+We can multiply a matrix by a matrix by doing matrix-by-vector
+multiplications, taking one column at a time from the second matrix
+for the vector.  Each of these multiplications gives an output vector
+-- arranging these output vectors as columns in an output matrix gives
+the result of the matrix multiplication.
+
+Complete the following function so that it compute the multiplication of
+a two matrices:
+
+```python
+def m_m_multiplication(m1, m2):
+    output_cols = []
+    for col in m2.T:
+        output_cols.append(????)
+    return np.array(output_cols).T
+
+A = np.array([
+    [1,0],
+    [1,2],
+    [1,3],
+    [0,5],
+    [100,200],
+])
+B = np.array([
+    [1,0,10],
+    [0,1,1],
+])
+m_m_multiplication(A, B)
+```
+
+The result should be this:
+
+```
+array([[   1,    0,   10],
+       [   1,    2,   12],
+       [   1,    3,   13],
+       [   0,    5,    5],
+       [ 100,  200, 1200]])
+```
+
+<details>
+    <summary>ANSWER</summary>
+    <code>m_v_multiplication(m1, col)</code>
+</details>
--- a/Labs/Lab12/model-comparison/README.md
+++ b/Labs/Lab12/model-comparison/README.md
+# Model Comparison
+
+In this lab, we'll compare different polynomial regression models and
+pick the one that best explains our dataset.
+
+## Dataset
+
+When learning new machine learning tools, it's often useful to
+generate random datasets with some noise in them (instead of using
+real data).  Then you know the real underlying pattern, and you can
+see whether the model detects it.
+
+First, randomly generate 100 x values uniformly between 0 and 10:
+
+```python
+import numpy as np
+x = np.random.????(0, 10, ????)
+x
+```
+
+Browse the numpy documentation to look for a function that can
+generate random values uniformly in some range and click it to read
+about the parameters and find an example that generates multiple
+values at once:
+
+https://numpy.org/doc/1.16/reference/routines.random.html
+
+You should get something like the following (your random x values will of course differ):
+
+```python
+array([2.79687525, 3.79759323, 3.28057227, 0.53394018, 3.02631135,
+       9.80546091, 9.52734311, 5.39445937, 0.88123164, 9.39220611,
+       0.14952772, 9.98741116, 5.41985529, 0.53689649, 5.13812755,
+       6.72324944, 6.85498995, 2.50218211, 2.69041511, 9.72999312,
+       4.59943722, 8.66264111, 8.6791649 , 8.789668  , 1.97837428,
+       7.41131163, 6.38631481, 8.01050144, 7.40393371, 8.52159954,
+       6.86880071, 0.4429817 , 2.63150248, 9.70783847, 8.57701317,
+       4.08390691, 1.53379304, 3.92925136, 5.59249091, 0.82697436,
+       2.11395572, 3.45483354, 3.35563161, 7.71499755, 5.7887254 ,
+       9.57698669, 1.45691284, 8.10710812, 1.51699873, 9.76220787,
+       4.1302431 , 9.30973542, 6.55166107, 8.31202397, 2.75940007,
+       0.74598903, 6.87346587, 2.9402988 , 3.47905205, 5.79509849,
+       6.71840305, 7.42857789, 5.11721878, 9.41966954, 8.46706032,
+       0.09892478, 6.11903957, 3.95076744, 0.22090436, 8.03670151,
+       8.36679871, 6.47744917, 9.24849941, 1.56997753, 9.32665206,
+       2.63553367, 0.42176439, 0.21810782, 6.18061177, 8.28879711,
+       4.2926099 , 6.50542003, 1.05920583, 4.27601354, 9.65403314,
+       4.58078682, 2.13464238, 1.11633827, 9.69418261, 6.16784997,
+       1.45127682, 7.54690907, 4.454097  , 8.32580719, 6.64915113,
+       9.44550501, 8.50366841, 5.77728997, 9.21509513, 3.05229763])
+```
+
+Put your x values in an `x` column of a new DataFrame:
+
+```python
+import pandas as pd
+df = pd.DataFrame({"x": x})
+df
+```
+
+Let's say we want the relationship between a y variable and our x
+variable to be *y = 2x + 5*.  We can add the y column and plot the
+relationship like this:
+
+```python
+df["y"] = df["x"] * 2 + 5
+df.plot.scatter(x="x", y="y")
+```
+
+Let's say you want to add some random noise to the relationship.  Add
+
+` + np.random.normal(scale=3, size=100)` to the end of the `df["y"] =
+...` line, and look and the new scatter plot.
+
+To create the data for the following work, modify the example as follows:
+
+* use *y = 5x - x^2 + 50*
+* use 8 for the scale of the noise
+
+It should look roughly like the following:
+
+<img src="data.png">
+
+## sklearn setup
+
+Import the following from sklearn, and make the appropriate call to split `df` into train/test data:
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.preprocessing import PolynomialFeatures
+from sklearn.linear_model import LinearRegression
+from sklearn.model_selection import train_test_split, cross_val_score
+
+train, test = ????(df)
+len(train), len(test)
+```
+
+## 2nd Degree Model
+
+Complete the following to create a 2nd degree regression model
+pipeline that relates your y column to your x column and average
+explained variance in the `scores_df` DataFrame:
+
+```python
+scores_df = pd.DataFrame()
+
+degree = 2
+model = Pipeline([
+    ("poly", ????(degree=degree, include_bias=False)),
+    ("model", ????()),
+])
+scores = cross_val_score(model, train[["????"]], train["????"])
+scores_df.at[f"degree {degree}", "score"] = scores.mean()
+scores_df.at[f"degree {degree}", "std_dev"] = scores.std()
+scores_df
+```
+
+Create a bar plot with one bar showing the average explained variance
+of the model and the standard devation of the scores:
+
+```python
+scores_df["????"].plot.bar(yerr=scores_df["????"])
+```
+
+## N Degree Models
+
+Adapt the above example so that instead of `degree = 2` you loop over
+multiple degrees with `for degree in range(????, ????):`.  Try all
+degrees between 1 and 10.
+
+Plot it as before:
+
+```python
+scores_df["score"].plot.bar(yerr=scores_df["std_dev"])
+```
+
+You should see something like this:
+
+<img src="compare.png">
--- a/Labs/Lab12/model-comparison/compare.png
+++ b/Labs/Lab12/model-comparison/compare.png
--- a/Labs/Lab12/model-comparison/data.png
+++ b/Labs/Lab12/model-comparison/data.png
--- a/Labs/Lab12/regression/README.md
+++ b/Labs/Lab12/regression/README.md
+# Regression
+
+This method randomly re-arranges the items in a list:
+https://docs.python.org/3/library/random.html#random.shuffle.  The
+longer the list, the longer it takes.
+
+Lets see if we can predict how long it will take to shuffle one
+million numbers by (1) measuring how long it takes to shuffle one to
+ten thousand numbers, (2) fitting a LinearRegression model to the
+time/size measures, and (3) predicting/extrapolating to one million.
+
+Create this table (we'll soon fill in the millisecond column):
+
+```python
+import time, random
+import pandas as pd
+from sklearn.linear_model import LinearRegression
+
+times_df = pd.DataFrame({"length": [i * 1000 for i in range(11)], "ms": None}, dtype=float)
+times_df
+```
+
+Complete and test the following function so that it uses `time.time()`
+to measure how long it takes to do the shuffle, then returns that
+amount of time in milliseconds:
+
+```python
+def measure_shuffle(list_len):
+    nums = list(range(list_len))
+    t0 = ????
+    random.shuffle(nums)
+    t1 = ????
+    return ????
+```
+
+Now use `measure_shuffle` to fill in the `ms` column from our table
+earlier (replace `????` with the column names in `times_df`) and plot
+the relationship.
+
+```python
+for i in times_df.index:
+    length = int(times_df.at[i, "length"])
+    times_df.at[i, "ms"] = measure_shuffle(length)
+
+times_df.plot.scatter(x="????", y="????")
+```
+
+<img src="regression.png" width=400>
+
+Now train a model on the measured times, and use that to predict how
+long it will take to shuffle a million numbers:
+
+```python
+lr = LinearRegression()
+lr.fit(times_df[[????]], times_df[????])
+lr.predict([[1000000]])
+```
+
+Call `measure_shuffle` with 1000000 to see how good your prediction
+was.  When I did this, the model predicted 943.0 milliseconds, but it
+actually took 887.6 milliseconds.  Not bad, considering we're
+extrapolating to 100x larger than our largest measurement!
+
+Note: LinearRegression worked well because `random.shuffle` uses an
+O(N) algorithm.  Think about what would happen if you used a
+LinearRegression to extrapolate the time it takes to do a non-O(N)
+piece of work.  Or, better, replace `random.shuffle(nums)` with
+`nums.sort()`, which as complexity O(N log N), as re-check how
+accurate the predictions are.
--- a/Labs/Lab12/regression/regression.png
+++ b/Labs/Lab12/regression/regression.png
No results found