Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • cdis/cs/courses/cs320/s24
  • EBBARTELS/s24
  • kenninger/s24
  • hbartle/s24
  • jvoegeli/s24
  • chin6/s24
  • lallo/s24
  • cbjensen/s24
  • bjhicks/s24
  • JPERLOFF/s24
  • RMILLER56/s24
  • sswain2/s24
  • SHINEGEORGE/s24
  • SKALMAZROUEI/s24
  • nkempf2/s24
  • kmalovrh/s24
  • alagiriswamy/s24
  • SWEINGARTEN2/s24
  • SKALMAZROUEI/s-24-fork
  • jchasco/s24
20 results
Show changes
Commits on Source (95)
Showing
with 1154 additions and 2 deletions
# Lab 1: VM Setup
1. Find your group. You can find your group under Canvas - People - Study Group. Ask your TA/peer mentor if you need help finding your group members.
1. You are free to form groups with others. We recommend groups of 4-7 members. You can sign up to your group under Canvas - People - Study Group Self Sign Up.
2. Get to know your group members, asking each other the following:
......
......@@ -31,7 +31,7 @@ Any **resources** we create (like virtual machines), are grouped into
2. Click "CREATE PROJECT"
3. Call it "cs320-fa23" and associate it with your account that has the free credits. Sometimes an option will appear to select an organization in which to nest your project. If this happens, select "wisc.edu".
3. Call it "cs320-sp23" and associate it with your account that has the free credits. Sometimes an option will appear to select an organization in which to nest your project. If this happens, select "wisc.edu".
<img src="img/2.png" width=600>
......
# Lab 10: Classification
1. Install `sklearn`, load and show some images from the MNIST dataset: either download the csv dataset, or look into `sklearn.datasets.fetch_openml` [example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_sparse_logistic_regression_mnist.html#sphx-glr-auto-examples-linear-model-plot-sparse-logistic-regression-mnist-py).
# Screenshot Requirement
A sreenshot that shows sklearn is installed successfully.
\ No newline at end of file
# Lab 11: Regression
1. Continue working on P5
2. Related lab documents: [SQL](./SQL.md), [Matrix](./counting-cells.md), [Raster](./raster.md)
# Screenshot Requirement
A screenshot that shows your progress
\ No newline at end of file
# SQL Database Queries
SQLite databases contain multiple tables. We can write queries to ask
questions about the data in these tables. One way is by putting our
queries in strings and using `pd.read_sql` to get the results back in
a DataFrame.
We'll give some examples here that will help you get the data you need
for P6. The `INNER JOIN` and `GROUY BY` operations will be especially
useful.
```python
import pandas as pd
import sqlite3
df = pd.DataFrame([
{"state": "wi", "city": "madison", "address": "123 Main St.", "item": "apples", "quantity": 3, "price": 1},
{"state": "wi", "city": "madison", "address": "123 Main St.", "item": "oranges", "quantity": 4, "price": 0.8},
{"state": "wi", "city": "madison", "address": "123 Main St.", "item": "cantaloupe", "quantity": 5, "price": 2},
{"state": "wi", "city": "milwaukee", "address": "456 State St", "item": "apples", "quantity": 6, "price": 0.9},
{"state": "wi", "city": "milwaukee", "address": "456 State St.", "item": "oranges", "quantity": 8, "price": 1.2},
])
connection = sqlite3.connect("grocery.db")
df.to_sql("sales", connection, if_exists="replace", index=False)
```
Take a look at the data:
```python
pd.read_sql("SELECT * FROM sales", connection)
```
To review `GROUP BY`, take a look at this query that computes how much
revenue each kind of fruit generates and run it:
```python
pd.read_sql("SELECT item, SUM(quantity*price) AS dollars FROM sales GROUP BY item", connection)
```
Now, try to write a query that counts sales per location.
<details>
<summary>ANSWER</summary>
<code>
pd.read_sql("SELECT state, city, address, SUM(quantity*price) AS dollars FROM sales GROUP BY state, city, address", connection)
</code>
</details>
Notice a problem? The issue is that all address information is
repeated for each location. That wastes space, but much worse, it
opens the possibility for typos leading to results such as this:
<img src="err.png" width=400>
To avoid these issues, it's common in practice to break up such a
table into two smaller tables, perhaps named `locations` and `sales`.
A `location_id` field might make it possible to combine the
information.
```python
df = pd.DataFrame([
{"location_id": 1, "state": "wi", "city": "madison", "address": "123 Main St."},
{"location_id": 2, "state": "wi", "city": "milwaukee", "address": "456 State St."},
])
df.to_sql("locations", connection, if_exists="replace", index=False)
df = pd.DataFrame([
{"location_id": 1, "item": "apples", "quantity": 3, "price": 1},
{"location_id": 1, "item": "oranges", "quantity": 4, "price": 0.8},
{"location_id": 1, "item": "cantaloupe", "quantity": 5, "price": 2},
{"location_id": 2, "item": "apples", "quantity": 6, "price": 0.9},
{"location_id": 2, "item": "oranges", "quantity": 8, "price": 1.2},
])
df.to_sql("sales", connection, if_exists="replace", index=False)
```
Take a look at each table:
* `pd.read_sql("SELECT * FROM sales", connection)`
* `pd.read_sql("SELECT * FROM locations", connection)`
Note that you *could* figure out the location for each sale in the
`sales` table by using the `location_id` to find that information in
`locations`.
There's an easier way, `INNER JOIN` (there are other kinds of joins
that we won't discuss in CS 320).
Try running this:
```
pd.read_sql("SELECT * FROM locations INNER JOIN sales", connection)
```
Notice that the `INNER JOIN` is creating a row for every combination
of the 2 rows in `locations` and the 5 rows in `sales`, for a total of
10 result rows. Most of these results are meaningless: many of the
output rows have `location_id` appearing twice, with the two values
being inconsistent.
We need to add an `ON` clause to match up each `sales` row with the
`locations` row that has the same `location_id`. Add `ON
locations.location_id = sales.location_id` to the end of the query:
```python
pd.read_sql("SELECT * FROM locations INNER JOIN sales ON locations.location_id = sales.location_id", connection)
```
The `location_id` was only useful for matching up the rows, so you may
want to drop in in pandas (there's not a simple way in SQL):
```python
pd.read_sql("""
SELECT * FROM
locations INNER JOIN sales
ON locations.location_id = sales.location_id""",
connection).drop(columns="location_id")
```
We can also do similar queries as we could before when we only had one
table. The `GROUP BY` will come after the `INNER JOIN`. How much
revenue did each fruit generate?
```python
pd.read_sql("""
SELECT item, SUM(quantity*price) AS dollars
FROM locations INNER JOIN sales
ON locations.location_id = sales.location_id
GROUP BY item""", connection)
```
Now, try write a query to answer the question, how much revenue was
there at each location?
<details>
<summary>ANSWER (option 1)</summary>
<code>
pd.read_sql("""
SELECT state, city, address, SUM(quantity*price) AS dollars
FROM locations INNER JOIN sales
ON locations.location_id = sales.location_id
GROUP BY state, city, address""", connection)
</code>
</details>
<details>
<summary>ANSWER (option 2)</summary>
<code>
pd.read_sql("""
SELECT state, city, address, SUM(quantity*price) AS dollars
FROM locations INNER JOIN sales
ON locations.location_id = sales.location_id
GROUP BY locations.location_id""", connection)
</code>
</details>
# Counting Cells
Run the following:
```python
import numpy as np
a = np.array([
[0,0,5,8],
[1,2,4,8],
[2,4,6,9],
])
```
How many even numbers are in this matrix? What percentage of the
numbers are even? We'll walk you through the solution. Please run
each step (which builds on the previous) to see what's happening.
First step, mod by 2, to get a 0 in every odd cell:
```python
a % 2
```
Now, let's do an elementwise comparison to get a True in every place where there is an even number:
```python
a % 2 == 0
```
It will be easier to count matches if we represent True as 1 and False as 0:
```python
(a % 2 == 0).astype(int)
```
How many is that?
```python
(a % 2 == 0).astype(int).sum()
```
And what percent of the total is that?
```python
(a % 2 == 0).astype(int).mean() * 100
```
This may be useful for counting what percentage of an area matches a
given land type in P6.
# Geographic Raster Data
In class, we learned geopandas, which is a *vector-based* GIS tool --
that means geo data is represented by vectors of coordinates, which
form polygons and other shapes.
*Raster* data is the other common kind of geo data you'll encounter.
With raster data, you break land into a matrix, with numbers in each
cell telling you something about the land at a given position.
In this part, we'll learn a bit about the `rasterio` module. It will
help us create numpy arrays corresponding to how land is used in a
given WI county (this will be useful for predicting things like a
county's population).
First, install some packages:
```
pip3 install rasterio Pillow
```
P6 includes a `land.zip` dataset. Let's open it (this assumes
your lab notebook is in the `p6` directory -- you may need to modify the path to
`land.zip` if you're elsewhere):
```python
import rasterio
land = rasterio.open("zip://[path_to_mp5]/mp5/land.zip!wi.tif")
```
This is the dataset for all of WI. Let's say we want to only see Dane
County (which contains Madison). We can get this from TIGERweb, a
service run by the US Census Bureau.
1. go to https://tigerweb.geo.census.gov/arcgis/rest/services/TIGERweb/
2. click "TIGERweb/tigerWMS_Census2020"
3. click "Counties (82)"
4. at the bottom, click "Query"
5. in the "Where" box, type `NAME='Dane County'` exactly
6. under "Format", choose "GeoJSON"
7. click "Query (GET)"
8. copy the URL
Paste the URL into the following code snippet:
```python
import geopandas as gpd
url = "????"
dane = gpd.read_file(url)
dane.plot()
```
You should see a rough outline of Dane County.
**NOTE: do not make requests to TIGERweb as part of P6. We have
already done so and saved the results in a geojson file we
provide.**
We can use that outline as a *mask* on the raster data to get a numpy
array of land use. A mask identifies specific cells in a matrix that
matter to us (note that we need to convert our geopandas data to the
same CRS as the rasterio data):
```python
from rasterio.mask import mask
matrix, _ = mask(land, dane.to_crs(land.crs)["geometry"], crop=True)
matrix = matrix[0]
```
Let's visualize the county:
```
import matplotlib.pyplot as plt
plt.imshow(matrix)
```
It should look like this:
<img src="dane.png" width=400>
Browse the legend here: https://www.mrlc.gov/data/legends/national-land-cover-database-2019-nlcd2019-legend
We see water is encoded as 11. We can highlight all the water regions in Dane County like this:
```python
plt.imshow(matrix == 11)
```
Try filtering the matrix in different ways to see where the following land covers are dominant:
* Deciduous Forest
* Cultivated Crops
* Developed, Low Intensity
File added
Labs/Lab12/8-geo/expected.png

232 KiB

File added
Source diff could not be displayed: it is too large. Options to address this: view the blob.
This diff is collapsed.
File added
# Lab 12
1. Start work on `UserPredictor` of P6
2. Use a LinearRegression to [predict how long](./regression) it will take to run a piece of code
3. Practice [comparing models](./model-comparison)
4. Review [dot products/matrix multiplication](./dot-product-matrix-multiplication)
5. Try to fix the bugs in the [`main.ipynb`](./8-geo/main.ipynb). There is a `solution.ipynb` too, but only peek as a last resort!
# Screenshot Requirement
Submit your process of `UserPredictor`
To provide you with flexibility (or peace of mind), we have set the P5 deadline for April 23. However, we strongly recommend that you complete your project before this date and begin working on P6. As an incentive to start P6 early, we may assign a task in lab 12 that encourages you to finish P5 ahead of schedule and start working on P6.
\ No newline at end of file
# Vector Dot Product and Matrix Multiplication
In lecture, we've talked about what the dot product means to multiply
a vector by a vector. Here, we'll review that, and also learn what it
means to multiply a matrix by a vector, or a matrix by a matrix.
### 1. Dot Product of Two Vectors
Complete the following function so that it compute the dot product of
two vectors:
```python
def v_v_dot_product(v1, v2):
assert len(v1) == len(v2)
total = 0
for i in range(len(v1)):
total += ????
return total
a = np.array([100,10,1])
b = np.array([3,2,0])
v_v_dot_product(a, b) # should be 320
```
<details>
<summary>ANSWER</summary>
<code>v1[i] * v2[i]</code>
</details>
### 2. Matrix - Vector Multiplication
We can multiply a matrix by a vector by doing vector-by-vector
multiplications, taking one row at a time from the matrix for the
first vector. This will give us a vector containing output value per
row in the matrix.
Complete the following function so that it computes the multiplication of
a matrix and a 1-dimensional vector (vertical):
```python
def m_v_multiplication(m, v):
output = []
for row in m:
assert len(row) == len(v)
output.append(????)
return np.array(output)
A = np.array([
[1,0,3],
[0,2,3],
])
x = np.array([1,10,100])
m_v_multiplication(A, x) # should be [301, 320]
```
<details>
<summary>ANSWER</summary>
<code>v_v_dot_product(row, v)</code>
</details>
### 3. Matrix - Matrix Multiplication
We can multiply a matrix by a matrix by doing matrix-by-vector
multiplications, taking one column at a time from the second matrix
for the vector. Each of these multiplications gives an output vector
-- arranging these output vectors as columns in an output matrix gives
the result of the matrix multiplication.
Complete the following function so that it compute the multiplication of
a two matrices:
```python
def m_m_multiplication(m1, m2):
output_cols = []
for col in m2.T:
output_cols.append(????)
return np.array(output_cols).T
A = np.array([
[1,0],
[1,2],
[1,3],
[0,5],
[100,200],
])
B = np.array([
[1,0,10],
[0,1,1],
])
m_m_multiplication(A, B)
```
The result should be this:
```
array([[ 1, 0, 10],
[ 1, 2, 12],
[ 1, 3, 13],
[ 0, 5, 5],
[ 100, 200, 1200]])
```
<details>
<summary>ANSWER</summary>
<code>m_v_multiplication(m1, col)</code>
</details>
# Model Comparison
In this lab, we'll compare different polynomial regression models and
pick the one that best explains our dataset.
## Dataset
When learning new machine learning tools, it's often useful to
generate random datasets with some noise in them (instead of using
real data). Then you know the real underlying pattern, and you can
see whether the model detects it.
First, randomly generate 100 x values uniformly between 0 and 10:
```python
import numpy as np
x = np.random.????(0, 10, ????)
x
```
Browse the numpy documentation to look for a function that can
generate random values uniformly in some range and click it to read
about the parameters and find an example that generates multiple
values at once:
https://numpy.org/doc/1.16/reference/routines.random.html
You should get something like the following (your random x values will of course differ):
```python
array([2.79687525, 3.79759323, 3.28057227, 0.53394018, 3.02631135,
9.80546091, 9.52734311, 5.39445937, 0.88123164, 9.39220611,
0.14952772, 9.98741116, 5.41985529, 0.53689649, 5.13812755,
6.72324944, 6.85498995, 2.50218211, 2.69041511, 9.72999312,
4.59943722, 8.66264111, 8.6791649 , 8.789668 , 1.97837428,
7.41131163, 6.38631481, 8.01050144, 7.40393371, 8.52159954,
6.86880071, 0.4429817 , 2.63150248, 9.70783847, 8.57701317,
4.08390691, 1.53379304, 3.92925136, 5.59249091, 0.82697436,
2.11395572, 3.45483354, 3.35563161, 7.71499755, 5.7887254 ,
9.57698669, 1.45691284, 8.10710812, 1.51699873, 9.76220787,
4.1302431 , 9.30973542, 6.55166107, 8.31202397, 2.75940007,
0.74598903, 6.87346587, 2.9402988 , 3.47905205, 5.79509849,
6.71840305, 7.42857789, 5.11721878, 9.41966954, 8.46706032,
0.09892478, 6.11903957, 3.95076744, 0.22090436, 8.03670151,
8.36679871, 6.47744917, 9.24849941, 1.56997753, 9.32665206,
2.63553367, 0.42176439, 0.21810782, 6.18061177, 8.28879711,
4.2926099 , 6.50542003, 1.05920583, 4.27601354, 9.65403314,
4.58078682, 2.13464238, 1.11633827, 9.69418261, 6.16784997,
1.45127682, 7.54690907, 4.454097 , 8.32580719, 6.64915113,
9.44550501, 8.50366841, 5.77728997, 9.21509513, 3.05229763])
```
Put your x values in an `x` column of a new DataFrame:
```python
import pandas as pd
df = pd.DataFrame({"x": x})
df
```
Let's say we want the relationship between a y variable and our x
variable to be *y = 2x + 5*. We can add the y column and plot the
relationship like this:
```python
df["y"] = df["x"] * 2 + 5
df.plot.scatter(x="x", y="y")
```
Let's say you want to add some random noise to the relationship. Add
` + np.random.normal(scale=3, size=100)` to the end of the `df["y"] =
...` line, and look and the new scatter plot.
To create the data for the following work, modify the example as follows:
* use *y = 5x - x^2 + 50*
* use 8 for the scale of the noise
It should look roughly like the following:
<img src="data.png">
## sklearn setup
Import the following from sklearn, and make the appropriate call to split `df` into train/test data:
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
train, test = ????(df)
len(train), len(test)
```
## 2nd Degree Model
Complete the following to create a 2nd degree regression model
pipeline that relates your y column to your x column and average
explained variance in the `scores_df` DataFrame:
```python
scores_df = pd.DataFrame()
degree = 2
model = Pipeline([
("poly", ????(degree=degree, include_bias=False)),
("model", ????()),
])
scores = cross_val_score(model, train[["????"]], train["????"])
scores_df.at[f"degree {degree}", "score"] = scores.mean()
scores_df.at[f"degree {degree}", "std_dev"] = scores.std()
scores_df
```
Create a bar plot with one bar showing the average explained variance
of the model and the standard devation of the scores:
```python
scores_df["????"].plot.bar(yerr=scores_df["????"])
```
## N Degree Models
Adapt the above example so that instead of `degree = 2` you loop over
multiple degrees with `for degree in range(????, ????):`. Try all
degrees between 1 and 10.
Plot it as before:
```python
scores_df["score"].plot.bar(yerr=scores_df["std_dev"])
```
You should see something like this:
<img src="compare.png">
Labs/Lab12/model-comparison/compare.png

6.2 KiB

Labs/Lab12/model-comparison/data.png

7.66 KiB

# Regression
This method randomly re-arranges the items in a list:
https://docs.python.org/3/library/random.html#random.shuffle. The
longer the list, the longer it takes.
Lets see if we can predict how long it will take to shuffle one
million numbers by (1) measuring how long it takes to shuffle one to
ten thousand numbers, (2) fitting a LinearRegression model to the
time/size measures, and (3) predicting/extrapolating to one million.
Create this table (we'll soon fill in the millisecond column):
```python
import time, random
import pandas as pd
from sklearn.linear_model import LinearRegression
times_df = pd.DataFrame({"length": [i * 1000 for i in range(11)], "ms": None}, dtype=float)
times_df
```
Complete and test the following function so that it uses `time.time()`
to measure how long it takes to do the shuffle, then returns that
amount of time in milliseconds:
```python
def measure_shuffle(list_len):
nums = list(range(list_len))
t0 = ????
random.shuffle(nums)
t1 = ????
return ????
```
Now use `measure_shuffle` to fill in the `ms` column from our table
earlier (replace `????` with the column names in `times_df`) and plot
the relationship.
```python
for i in times_df.index:
length = int(times_df.at[i, "length"])
times_df.at[i, "ms"] = measure_shuffle(length)
times_df.plot.scatter(x="????", y="????")
```
<img src="regression.png" width=400>
Now train a model on the measured times, and use that to predict how
long it will take to shuffle a million numbers:
```python
lr = LinearRegression()
lr.fit(times_df[[????]], times_df[????])
lr.predict([[1000000]])
```
Call `measure_shuffle` with 1000000 to see how good your prediction
was. When I did this, the model predicted 943.0 milliseconds, but it
actually took 887.6 milliseconds. Not bad, considering we're
extrapolating to 100x larger than our largest measurement!
Note: LinearRegression worked well because `random.shuffle` uses an
O(N) algorithm. Think about what would happen if you used a
LinearRegression to extrapolate the time it takes to do a non-O(N)
piece of work. Or, better, replace `random.shuffle(nums)` with
`nums.sort()`, which as complexity O(N log N), as re-check how
accurate the predictions are.
Labs/Lab12/regression/regression.png

19.2 KiB