Skip to content
Snippets Groups Projects
Commit 48651679 authored by Ashwin Maran's avatar Ashwin Maran
Browse files

add lectures 26, 27, 28

parent 01353f34
No related branches found
No related tags found
No related merge requests found
Showing
with 12880 additions and 0 deletions
File added
File added
Source diff could not be displayed: it is too large. Options to address this: view the blob.
Source diff could not be displayed: it is too large. Options to address this: view the blob.
This diff is collapsed.
%% Cell type:markdown id: tags:
## Warmup 0: Importing Pandas!
%% Cell type:code id: tags:
``` python
import pandas as pd
```
%% Cell type:markdown id: tags:
## Warmup 1: Find the mean, median, mode, and standard deviation of the following list of scores
%% Cell type:code id: tags:
``` python
my_scores = [44, 32, 19, 67, 23, 23, 92, 47, 47, 78, 84]
```
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
## Learning Objectives
- Create a pandas **Series** from a **list** or from a **dict**,
- Use **Series** methods `max`, `min`, `mean`, `median`, `mode`, `quantile`, `value_counts`,
- Extract elements from a **Series** using **Boolean indexing**,
- Access **Series** members using `.loc`, `.iloc`, `.items`, and slicing,
- Perform **Series** element-wise operations
%% Cell type:markdown id: tags:
# Pandas
%% Cell type:markdown id: tags:
**What is Pandas?**
- Pandas is a package of tools for doing Data Science
- Pandas was installed with Anaconda, so its on your computers
- [Learn More](https://en.wikipedia.org/wiki/Pandas_(software))
If for some reason, you don't have pandas installed, run the following command in terminal or powershell...
<pre>pip install pandas</pre>
%% Cell type:markdown id: tags:
A Pandas Series is like a combination of a list and a dictionary. The word 'index' is used to describe position.
%% Cell type:markdown id: tags:
## Series from a `list`
%% Cell type:code id: tags:
``` python
scores = pd.Series([44, 32, 19, 67, 23, 23, 92, 47, 47, 78, 84])
scores
```
%% Cell type:markdown id: tags:
A Pandas series acts a lot like a list; you can index and slice.
%% Cell type:code id: tags:
``` python
scores[3]
```
%% Cell type:code id: tags:
``` python
scores[3:6]
```
%% Cell type:markdown id: tags:
### Series calculations: mean, median, mode, quartiles, sd, count
%% Cell type:markdown id: tags:
#### `mean`, `median`, and `std` return the mean, median, and standard deviation
%% Cell type:code id: tags:
``` python
print(scores.mean())
print(scores.median())
print(scores.std())
```
%% Cell type:markdown id: tags:
#### There could be multiple modes, so `mode` returns a Series
%% Cell type:code id: tags:
``` python
print(scores.mode())
```
%% Cell type:markdown id: tags:
#### `quantile` returns a Series of the numbers at each specified quantile
%% Cell type:code id: tags:
``` python
print(scores.quantile([1.0, 0.75, 0.5, 0.25, 0]))
```
%% Cell type:code id: tags:
``` python
print(scores.quantile([0.9, 0.1]))
```
%% Cell type:markdown id: tags:
#### `value_counts` creates a series where the index is the data, and the value is its count in the series
%% Cell type:code id: tags:
``` python
ages = pd.Series([18, 19, 20, 20, 20, 17, 18, 24, 25, 35, 22, 20, 21, 21, 20, 23, 23, 19, 19, 19, 20, 21])
ages.value_counts()
```
%% Cell type:markdown id: tags:
#### A series can be sorted by index or by values
%% Cell type:code id: tags:
``` python
ages.value_counts().sort_index()
```
%% Cell type:code id: tags:
``` python
ages.value_counts().sort_values(ascending=False)
```
%% Cell type:markdown id: tags:
### Plotting
%% Cell type:markdown id: tags:
## Series bar chart
%% Cell type:code id: tags:
``` python
age_plot = ages.value_counts().sort_index().plot.bar(color='lightsalmon')
age_plot.set(xlabel="age", ylabel="count")
```
%% Cell type:markdown id: tags:
# Filtering
%% Cell type:markdown id: tags:
## Example 1: What ages are at least 21?
%% Cell type:code id: tags:
``` python
at_least_21 = ages[ages >= 21]
at_least_21
```
%% Cell type:markdown id: tags:
## Exercise 1: What ages are exactly 18?
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
## Advanced Filtering
- `&` means `and`
- `|` means `or`
- `~` means `not`
- we must use `()` for compound boolean expressions
%% Cell type:markdown id: tags:
## Example 2: What ages are in the range 18 to 20, inclusive?
%% Cell type:code id: tags:
``` python
certain_students = ages[(ages >= 18) & (ages <= 20)]
certain_students
```
%% Cell type:markdown id: tags:
## Exercise 2: What percentage of students are in this age range?
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
## Exercise 3: What percentage of students are ages 18 OR 21?
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
## Exercise 4: What percentage of students are NOT 19?
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
#### One more thing....
We can perform an operation on all values in a Series
%% Cell type:markdown id: tags:
## Example 3: Add 1 to everyone's age
%% Cell type:code id: tags:
``` python
ages += 1
ages.value_counts()
```
%% Cell type:markdown id: tags:
# Using a Series to store Pokemon stats
%% Cell type:code id: tags:
``` python
# Modified from https://automatetheboringstuff.com/chapter14/
import csv
def process_csv(filename):
example_file = open(filename, encoding="utf-8")
example_reader = csv.reader(example_file)
example_data = list(example_reader)
example_file.close()
return example_data
data = process_csv("pokemon_stats.csv")
header = data[0]
print(len(data))
data = data[1:]
data[15:18]
```
%% Cell type:markdown id: tags:
## Example 4: Create a Series of all the Pokemon names
%% Cell type:code id: tags:
``` python
pokemon_list = [row[1] for row in data]
pokemon_names = pd.Series(pokemon_list)
pokemon_names
```
%% Cell type:markdown id: tags:
## Exercise 5: Create a Series of all the Pokemon HPs
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
## Exercise 6: Find the most common HP
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
## Exercise 7: Find how many Pokemon have that most common HP
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
## Exercise 8: How many Pokemon have HP between 50 and 75 (inclusive)?
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
## Example 5: What are the names of weak Pokemon (`< 30` HP)?
%% Cell type:code id: tags:
``` python
weak_hps_idx = hps[hps < 30].index
pokemon_names[weak_hps_idx]
```
%% Cell type:markdown id: tags:
## Exercise 9: What are the names of the Pokemon from strongest to weakest (using HP)?
%% Cell type:code id: tags:
``` python
# write your code here
```
%% Cell type:markdown id: tags:
## Series from a `dict`
A Series is a cross between a list and a dict, so we can make a series from a dict as well
%% Cell type:code id: tags:
``` python
game1_points = pd.Series({"Chris": 10, "Kiara": 3, "Mikayla": 7, "Ann": 8, "Trish": 6})
print(game1_points)
```
%% Cell type:code id: tags:
``` python
game2_points = pd.Series({"Kiara": 7, "Chris": 3, "Trish": 11, "Mikayla": 2, "Ann": 5})
print(game2_points)
```
%% Cell type:markdown id: tags:
#### Pandas can perform operations on two series by matching up their indices
%% Cell type:code id: tags:
``` python
total = game1_points + game2_points
total
```
%% Cell type:markdown id: tags:
## Example 6: Who has the most points in total?
%% Cell type:code id: tags:
``` python
print(total.max())
print(total.idxmax())
```
%% Cell type:markdown id: tags:
#### We can use `[]` to index the name
%% Cell type:code id: tags:
``` python
total['Kiara']
```
%% Cell type:markdown id: tags:
#### We can also use `[]` to index by the sequence number, but this should be avoided, and this feature will not be available in future versions of Pandas
%% Cell type:code id: tags:
``` python
total[2]
```
%% Cell type:markdown id: tags:
#### We can have multi-indexing, slightly different from slicing
%% Cell type:code id: tags:
``` python
total[["Chris", "Trish"]]
```
%% Cell type:markdown id: tags:
### More plotting:
%% Cell type:code id: tags:
``` python
total_sorted = total.sort_values(ascending=False)
total_sorted
```
%% Cell type:code id: tags:
``` python
ax = total_sorted.plot.bar(color="green", fontsize=16)
ax.set_ylabel("total points", fontsize=16)
```
%% Cell type:markdown id: tags:
## More things to know about Series
Next, we'll get into more ways to access data using `loc` and `iloc`.
%% Cell type:code id: tags:
``` python
game1_points
```
%% Cell type:code id: tags:
``` python
game1_points.iloc[2] # looks up by integer position
```
%% Cell type:code id: tags:
``` python
game1_points.loc["Mikayla"] # looks up by pandas index
```
%% Cell type:code id: tags:
``` python
my_new_series = pd.Series({1: 89, 2: 104, 3: 681}) # this can be tricky!
my_new_series
```
%% Cell type:code id: tags:
``` python
my_new_series.iloc[1] # by integer position
```
%% Cell type:code id: tags:
``` python
my_new_series.loc[1] # by index
```
%% Cell type:code id: tags:
``` python
my_new_series[1] # by index!
```
%% Cell type:code id: tags:
``` python
my_new_series[my_new_series > 100] # ... and also boolean indexing!
```
%% Cell type:markdown id: tags:
Feel overwhelmed? Do the required reading.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
File added
name,year,type,speed,place
alice,2016,tornado,100,o
bob,2016,hurricane,200,p
cindy,2017,tornado,150,o
dan,2018,tornado,300,o
eve,2018,hurricane,250,a
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment