Skip to content
Snippets Groups Projects
Commit bc0f2497 authored by JINLANG WANG's avatar JINLANG WANG
Browse files

lab12

parent 3b25b608
No related branches found
No related tags found
No related merge requests found
Pipeline #825980 passed
File added
Labs/Lab12/8-geo/expected.png

232 KiB

File added
Source diff could not be displayed: it is too large. Options to address this: view the blob.
This diff is collapsed.
File added
# Lab 12 # Lab 12
1. Finish P5 with the group. 1. Try to fix the bugs in the [`main.ipynb`](./8-geo/main.ipynb). There is a `solution.ipynb` too, but only peek as a last resort!
2. Use a LinearRegression to [predict how long](./regression) it will take to run a piece of code
3. Practice [comparing models](./model-comparison)
4. Review [dot products/matrix multiplication](./dot-product-matrix-multiplication)
# Screenshot Requirement # Screenshot Requirement
......
# Vector Dot Product and Matrix Multiplication
In lecture, we've talked about what the dot product means to multiply
a vector by a vector. Here, we'll review that, and also learn what it
means to multiply a matrix by a vector, or a matrix by a matrix.
### 1. Dot Product of Two Vectors
Complete the following function so that it compute the dot product of
two vectors:
```python
def v_v_dot_product(v1, v2):
assert len(v1) == len(v2)
total = 0
for i in range(len(v1)):
total += ????
return total
a = np.array([100,10,1])
b = np.array([3,2,0])
v_v_dot_product(a, b) # should be 320
```
<details>
<summary>ANSWER</summary>
<code>v1[i] * v2[i]</code>
</details>
### 2. Matrix - Vector Multiplication
We can multiply a matrix by a vector by doing vector-by-vector
multiplications, taking one row at a time from the matrix for the
first vector. This will give us a vector containing output value per
row in the matrix.
Complete the following function so that it computes the multiplication of
a matrix and a 1-dimensional vector (vertical):
```python
def m_v_multiplication(m, v):
output = []
for row in m:
assert len(row) == len(v)
output.append(????)
return np.array(output)
A = np.array([
[1,0,3],
[0,2,3],
])
x = np.array([1,10,100])
m_v_multiplication(A, x) # should be [301, 320]
```
<details>
<summary>ANSWER</summary>
<code>v_v_dot_product(row, v)</code>
</details>
### 3. Matrix - Matrix Multiplication
We can multiply a matrix by a matrix by doing matrix-by-vector
multiplications, taking one column at a time from the second matrix
for the vector. Each of these multiplications gives an output vector
-- arranging these output vectors as columns in an output matrix gives
the result of the matrix multiplication.
Complete the following function so that it compute the multiplication of
a two matrices:
```python
def m_m_multiplication(m1, m2):
output_cols = []
for col in m2.T:
output_cols.append(????)
return np.array(output_cols).T
A = np.array([
[1,0],
[1,2],
[1,3],
[0,5],
[100,200],
])
B = np.array([
[1,0,10],
[0,1,1],
])
m_m_multiplication(A, B)
```
The result should be this:
```
array([[ 1, 0, 10],
[ 1, 2, 12],
[ 1, 3, 13],
[ 0, 5, 5],
[ 100, 200, 1200]])
```
<details>
<summary>ANSWER</summary>
<code>m_v_multiplication(m1, col)</code>
</details>
# Model Comparison
In this lab, we'll compare different polynomial regression models and
pick the one that best explains our dataset.
## Dataset
When learning new machine learning tools, it's often useful to
generate random datasets with some noise in them (instead of using
real data). Then you know the real underlying pattern, and you can
see whether the model detects it.
First, randomly generate 100 x values uniformly between 0 and 10:
```python
import numpy as np
x = np.random.????(0, 10, ????)
x
```
Browse the numpy documentation to look for a function that can
generate random values uniformly in some range and click it to read
about the parameters and find an example that generates multiple
values at once:
https://numpy.org/doc/1.16/reference/routines.random.html
You should get something like the following (your random x values will of course differ):
```python
array([2.79687525, 3.79759323, 3.28057227, 0.53394018, 3.02631135,
9.80546091, 9.52734311, 5.39445937, 0.88123164, 9.39220611,
0.14952772, 9.98741116, 5.41985529, 0.53689649, 5.13812755,
6.72324944, 6.85498995, 2.50218211, 2.69041511, 9.72999312,
4.59943722, 8.66264111, 8.6791649 , 8.789668 , 1.97837428,
7.41131163, 6.38631481, 8.01050144, 7.40393371, 8.52159954,
6.86880071, 0.4429817 , 2.63150248, 9.70783847, 8.57701317,
4.08390691, 1.53379304, 3.92925136, 5.59249091, 0.82697436,
2.11395572, 3.45483354, 3.35563161, 7.71499755, 5.7887254 ,
9.57698669, 1.45691284, 8.10710812, 1.51699873, 9.76220787,
4.1302431 , 9.30973542, 6.55166107, 8.31202397, 2.75940007,
0.74598903, 6.87346587, 2.9402988 , 3.47905205, 5.79509849,
6.71840305, 7.42857789, 5.11721878, 9.41966954, 8.46706032,
0.09892478, 6.11903957, 3.95076744, 0.22090436, 8.03670151,
8.36679871, 6.47744917, 9.24849941, 1.56997753, 9.32665206,
2.63553367, 0.42176439, 0.21810782, 6.18061177, 8.28879711,
4.2926099 , 6.50542003, 1.05920583, 4.27601354, 9.65403314,
4.58078682, 2.13464238, 1.11633827, 9.69418261, 6.16784997,
1.45127682, 7.54690907, 4.454097 , 8.32580719, 6.64915113,
9.44550501, 8.50366841, 5.77728997, 9.21509513, 3.05229763])
```
Put your x values in an `x` column of a new DataFrame:
```python
import pandas as pd
df = pd.DataFrame({"x": x})
df
```
Let's say we want the relationship between a y variable and our x
variable to be *y = 2x + 5*. We can add the y column and plot the
relationship like this:
```python
df["y"] = df["x"] * 2 + 5
df.plot.scatter(x="x", y="y")
```
Let's say you want to add some random noise to the relationship. Add
` + np.random.normal(scale=3, size=100)` to the end of the `df["y"] =
...` line, and look and the new scatter plot.
To create the data for the following work, modify the example as follows:
* use *y = 5x - x^2 + 50*
* use 8 for the scale of the noise
It should look roughly like the following:
<img src="data.png">
## sklearn setup
Import the following from sklearn, and make the appropriate call to split `df` into train/test data:
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
train, test = ????(df)
len(train), len(test)
```
## 2nd Degree Model
Complete the following to create a 2nd degree regression model
pipeline that relates your y column to your x column and average
explained variance in the `scores_df` DataFrame:
```python
scores_df = pd.DataFrame()
degree = 2
model = Pipeline([
("poly", ????(degree=degree, include_bias=False)),
("model", ????()),
])
scores = cross_val_score(model, train[["????"]], train["????"])
scores_df.at[f"degree {degree}", "score"] = scores.mean()
scores_df.at[f"degree {degree}", "std_dev"] = scores.std()
scores_df
```
Create a bar plot with one bar showing the average explained variance
of the model and the standard devation of the scores:
```python
scores_df["????"].plot.bar(yerr=scores_df["????"])
```
## N Degree Models
Adapt the above example so that instead of `degree = 2` you loop over
multiple degrees with `for degree in range(????, ????):`. Try all
degrees between 1 and 10.
Plot it as before:
```python
scores_df["score"].plot.bar(yerr=scores_df["std_dev"])
```
You should see something like this:
<img src="compare.png">
Labs/Lab12/model-comparison/compare.png

6.2 KiB

Labs/Lab12/model-comparison/data.png

7.66 KiB

# Regression
This method randomly re-arranges the items in a list:
https://docs.python.org/3/library/random.html#random.shuffle. The
longer the list, the longer it takes.
Lets see if we can predict how long it will take to shuffle one
million numbers by (1) measuring how long it takes to shuffle one to
ten thousand numbers, (2) fitting a LinearRegression model to the
time/size measures, and (3) predicting/extrapolating to one million.
Create this table (we'll soon fill in the millisecond column):
```python
import time, random
import pandas as pd
from sklearn.linear_model import LinearRegression
times_df = pd.DataFrame({"length": [i * 1000 for i in range(11)], "ms": None}, dtype=float)
times_df
```
Complete and test the following function so that it uses `time.time()`
to measure how long it takes to do the shuffle, then returns that
amount of time in milliseconds:
```python
def measure_shuffle(list_len):
nums = list(range(list_len))
t0 = ????
random.shuffle(nums)
t1 = ????
return ????
```
Now use `measure_shuffle` to fill in the `ms` column from our table
earlier (replace `????` with the column names in `times_df`) and plot
the relationship.
```python
for i in times_df.index:
length = int(times_df.at[i, "length"])
times_df.at[i, "ms"] = measure_shuffle(length)
times_df.plot.scatter(x="????", y="????")
```
<img src="regression.png" width=400>
Now train a model on the measured times, and use that to predict how
long it will take to shuffle a million numbers:
```python
lr = LinearRegression()
lr.fit(times_df[[????]], times_df[????])
lr.predict([[1000000]])
```
Call `measure_shuffle` with 1000000 to see how good your prediction
was. When I did this, the model predicted 943.0 milliseconds, but it
actually took 887.6 milliseconds. Not bad, considering we're
extrapolating to 100x larger than our largest measurement!
Note: LinearRegression worked well because `random.shuffle` uses an
O(N) algorithm. Think about what would happen if you used a
LinearRegression to extrapolate the time it takes to do a non-O(N)
piece of work. Or, better, replace `random.shuffle(nums)` with
`nums.sort()`, which as complexity O(N log N), as re-check how
accurate the predictions are.
Labs/Lab12/regression/regression.png

19.2 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment