Skip to content
Snippets Groups Projects
Commit f690dca5 authored by Anna Meyer's avatar Anna Meyer
Browse files

p13

parent c4a6722e
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:115889c5 tags: %% Cell type:markdown id:115889c5 tags:
# Lab 13: Analyzing World Data with SQL # Lab 13: Analyzing World Data with SQL
In this lab, you will practice how to: In this lab, you will practice how to:
* write SQL queries, * write SQL queries,
* create your own plots. * create your own plots.
%% Cell type:markdown id:daed65a3 tags: %% Cell type:markdown id:daed65a3 tags:
# Segment 1: Setup # Segment 1: Setup
### Task 1.1: Import the required modules ### Task 1.1: Import the required modules
We will first import some important modules We will first import some important modules
%% Cell type:code id:e59b7bdb tags: %% Cell type:code id:e59b7bdb tags:
``` python ``` python
# it is considered a good coding practice to place all import statements at the top of the notebook # it is considered a good coding practice to place all import statements at the top of the notebook
# please place all your import statements in this cell if you need to import any more modules for this project # please place all your import statements in this cell if you need to import any more modules for this project
import sqlite3 import sqlite3
import pandas as pd import pandas as pd
import matplotlib import matplotlib
import math import math
import numpy as np # this is *only* for the function get_regression_coeff - do NOT use this module elsewhere import numpy as np # this is *only* for the function get_regression_coeff - do NOT use this module elsewhere
``` ```
%% Cell type:code id:97a3f1e8 tags: %% Cell type:code id:97a3f1e8 tags:
``` python ``` python
# this ensures that font.size setting remains uniform # this ensures that font.size setting remains uniform
%matplotlib inline %matplotlib inline
pd.set_option('display.max_colwidth', None) pd.set_option('display.max_colwidth', None)
matplotlib.rcParams["font.size"] = 13 # don't use value > 13! Otherwise your y-axis tick labels will be different. matplotlib.rcParams["font.size"] = 13 # don't use value > 13! Otherwise your y-axis tick labels will be different.
``` ```
%% Cell type:markdown id:75adca21 tags: %% Cell type:markdown id:75adca21 tags:
### Task 1.2: Use the `download` function to download `QSranking.json` ### Task 1.2: Use the `download` function to download `QSranking.json`
Warning: For the lab and the project, do **not** download the dataset `QSranking.json` manually (you **must** write Python code to download this, as in P12). When we run the autograder, this file `QSranking.json` will not be in the directory. So, unless your `p13.ipynb` downloads this file, you will get a **zero score** on the project. Also, make sure your `download` function includes code to check if the file already exists. The Gradescope autograder will **deduct points** otherwise. Warning: For the lab and the project, do **not** download the dataset `QSranking.json` manually (you **must** write Python code to download this, as in P12). When we run the autograder, this file `QSranking.json` will not be in the directory. So, unless your `p13.ipynb` downloads this file, you will get a **zero score** on the project. Also, make sure your `download` function includes code to check if the file already exists. The Gradescope autograder will **deduct points** otherwise.
%% Cell type:code id:2bb742ed tags: %% Cell type:code id:2bb742ed tags:
``` python ``` python
# copy the definition of your 'download' function from P12 here - remember to import the necessary modules # copy the definition of your 'download' function from P12 here - remember to import the necessary modules
``` ```
%% Cell type:code id:fe96e53b tags: %% Cell type:code id:fe96e53b tags:
``` python ``` python
# use the 'download' function to download the data from the webpage # use the 'download' function to download the data from the webpage
# 'https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/QSranking.json' # 'https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/raw/main/sum23/projects/p13/QSranking.json'
# to the file 'QSranking.json' # to the file 'QSranking.json'
download("https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/QSranking.json", "QSranking.json")
``` ```
%% Cell type:markdown id:0023581a tags: %% Cell type:markdown id:0023581a tags:
### Task 1.3: Create a database called 'rankings.db' out of 'QSRankings.json' ### Task 1.3: Create a database called 'rankings.db' out of 'QSRankings.json'
You can review the relevant lecture code [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/s23/Michael_lecture_notes/32_Database-1) and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/s23/Gurmail_lecture_notes/32_Database-1). You can review the relevant lecture code [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database1_notes.ipynb).
%% Cell type:code id:270d8da5 tags: %% Cell type:code id:270d8da5 tags:
``` python ``` python
# create a database called 'rankings.db' out of 'QSranking.json' # create a database called 'rankings.db' out of 'QSranking.json'
# TODO: load the data from 'QSranking.json' into a variable called 'qs_ranking' using pandas' 'read_json' function # TODO: load the data from 'QSranking.json' into a variable called 'qs_ranking' using pandas' 'read_json' function
# TODO: connect to 'rankings.db' and save it to a variable called 'conn' # TODO: connect to 'rankings.db' and save it to a variable called 'conn'
# write the contents of 'qs_ranking' to the table 'rankings' in the database # write the contents of 'qs_ranking' to the table 'rankings' in the database
# we have done this one for you # we have done this one for you
qs_ranking.to_sql("rankings", conn, if_exists="replace", index=False) qs_ranking.to_sql("rankings", conn, if_exists="replace", index=False)
``` ```
%% Cell type:markdown id:84a77c79 tags: %% Cell type:markdown id:84a77c79 tags:
### Task 1.4: Read all the rows in rankings (the database table) ### Task 1.4: Read all the rows in rankings (the database table)
You'll have to use pandas's `read_sql` function to make a query. You'll have to use pandas's `read_sql` function to make a query.
%% Cell type:code id:a300adde tags: %% Cell type:code id:a300adde tags:
``` python ``` python
# compute and store the answer in the variable 'rankings', display its head # compute and store the answer in the variable 'rankings', display its head
# remember to display ONLY the head and NOT the whole DataFrame # remember to display ONLY the head and NOT the whole DataFrame
# replace the ... with your code # replace the ... with your code
rankings = pd.read_sql("SELECT ... FROM ...", conn) rankings = pd.read_sql("SELECT ... FROM ...", conn)
rankings.head() rankings.head()
``` ```
%% Cell type:code id:3e4d16ee tags: %% Cell type:code id:3e4d16ee tags:
``` python ``` python
# run this cell to confirm that your variable has been defined properly # run this cell to confirm that your variable has been defined properly
assert len(rankings) == 1201 assert len(rankings) == 1201
assert rankings.iloc[0]["country"] == "United States" assert rankings.iloc[0]["country"] == "United States"
assert rankings.iloc[-1]["institution_name"] == "Wake Forest University" assert rankings.iloc[-1]["institution_name"] == "Wake Forest University"
``` ```
%% Cell type:markdown id:7b09ee5a tags: %% Cell type:markdown id:7b09ee5a tags:
# Segment 2: SQL Practice # Segment 2: SQL Practice
In practice, we often are more interested in writing more specific queries about our data. For example, we might be interested in finding institutions in the *United States*, or data collected in the `year` *2018*, or both. With **SQL**, **WHERE** and **AND** clauses can help filter the data accordingly. In practice, we often are more interested in writing more specific queries about our data. For example, we might be interested in finding institutions in the *United States*, or data collected in the `year` *2018*, or both. With **SQL**, **WHERE** and **AND** clauses can help filter the data accordingly.
Before proceeding with this segment, it is **recommended** that you **review** the relevant lecture code: Before proceeding with this segment, it is **recommended** that you **review** the relevant lecture code:
* [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database2_notes.ipynb) (Databases part 2) * [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database2_notes.ipynb) (Databases part 2)
and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database3_notes.ipynb) (Databases part 3) and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database3_notes.ipynb) (Databases part 3)
%% Cell type:markdown id:9cebe083 tags: %% Cell type:markdown id:9cebe083 tags:
### Task 2.1: Use WHERE to find institutions in the United States ### Task 2.1: Use WHERE to find institutions in the United States
* Write a query to select the rows from the database with *United States* as the `country`. * Write a query to select the rows from the database with *United States* as the `country`.
* Keep only the `institution_name` column. * Keep only the `institution_name` column.
* Save these institution names to a **list**. * Save these institution names to a **list**.
**Hint:** You will need to use **quotes** (`'`) around the **strings** in your query and **backticks** (``` ` ```) around **column names** as in the example below. The **quotes** and **backticks*** are only **required** when the string or column name contains special characters or spaces. But even otherwise, it is a good idea to use them to be on the safe side. **Hint:** You will need to use **quotes** (`'`) around the **strings** in your query and **backticks** (``` ` ```) around **column names** as in the example below. The **quotes** and **backticks*** are only **required** when the string or column name contains special characters or spaces. But even otherwise, it is a good idea to use them to be on the safe side.
%% Cell type:code id:64012949 tags: %% Cell type:code id:64012949 tags:
``` python ``` python
# we have done this one for you # we have done this one for you
us_institutions_df = pd.read_sql("SELECT `institution_name` FROM rankings WHERE `country` = 'United States'", conn) us_institutions_df = pd.read_sql("SELECT `institution_name` FROM rankings WHERE `country` = 'United States'", conn)
us_institutions = list(us_institutions_df['institution_name']) us_institutions = list(us_institutions_df['institution_name'])
``` ```
%% Cell type:code id:c035f899 tags: %% Cell type:code id:c035f899 tags:
``` python ``` python
# run this cell to confirm that your variable has been defined properly # run this cell to confirm that your variable has been defined properly
assert "University Of Wisconsin-Madison" in us_institutions assert "University Of Wisconsin-Madison" in us_institutions
assert "Tampere University" in list(rankings["institution_name"]) assert "Tampere University" in list(rankings["institution_name"])
assert "Tampere University" not in us_institutions assert "Tampere University" not in us_institutions
``` ```
%% Cell type:markdown id:9fe4da4e tags: %% Cell type:markdown id:9fe4da4e tags:
### Task 2.2: Add an AND clause to find institutions in the United States with at least 70 overall score ### Task 2.2: Add an AND clause to find institutions in the United States with at least 70 overall score
* Copy your query from Task 2.1. * Copy your query from Task 2.1.
* Update it to only select rows with `overall_score` of **at least** *70*. * Update it to only select rows with `overall_score` of **at least** *70*.
%% Cell type:code id:12f341ad tags: %% Cell type:code id:12f341ad tags:
``` python ``` python
# compute and store the answer in the variable 'good_us_institutions', but do NOT display it # compute and store the answer in the variable 'good_us_institutions', but do NOT display it
``` ```
%% Cell type:code id:25e2d3cc tags: %% Cell type:code id:25e2d3cc tags:
``` python ``` python
# run this cell to confirm that your variable has been defined properly # run this cell to confirm that your variable has been defined properly
assert "Massachusetts Institute Of Technology" in good_us_institutions assert "Massachusetts Institute Of Technology" in good_us_institutions
assert "University Of Wisconsin-Madison" in good_us_institutions assert "University Of Wisconsin-Madison" in good_us_institutions
assert "Wake Forest University" not in good_us_institutions assert "Wake Forest University" not in good_us_institutions
assert "University of Connecticut" not in good_us_institutions assert "University of Connecticut" not in good_us_institutions
``` ```
%% Cell type:markdown id:cf715227 tags: %% Cell type:markdown id:cf715227 tags:
### Task 2.3: Use an ORDER BY clause to display the top 5 institutions by academic reputation in 2019 ### Task 2.3: Use an ORDER BY clause to display the top 5 institutions by academic reputation in 2019
In addition to **WHERE** and **AND**, the **ORDER BY** keyword helps organize data even further. Much like the `sort_values()` function in `pandas`, the **ORDER BY** clause can be used to organize the result of the query in *increasing* (**ASC**) or *decreasing* (**DESC**) order based on a column's values. In addition to **WHERE** and **AND**, the **ORDER BY** keyword helps organize data even further. Much like the `sort_values()` function in `pandas`, the **ORDER BY** clause can be used to organize the result of the query in *increasing* (**ASC**) or *decreasing* (**DESC**) order based on a column's values.
* Write a new query to select rows in rankings where the `year` is *2019*. * Write a new query to select rows in rankings where the `year` is *2019*.
* Use **ORDER BY** and **LIMIT** to select the top 5 rows with the **highest** `academic_reputation`. * Use **ORDER BY** and **LIMIT** to select the top 5 rows with the **highest** `academic_reputation`.
* Save these institution names to a **list**. * Save these institution names to a **list**.
%% Cell type:code id:763304e0 tags: %% Cell type:code id:763304e0 tags:
``` python ``` python
# compute and store the answer in the variable 'top_5_institutions', then display it # compute and store the answer in the variable 'top_5_institutions', then display it
``` ```
%% Cell type:code id:404fa832 tags: %% Cell type:code id:404fa832 tags:
``` python ``` python
# run this cell to confirm that your variable has been defined properly # run this cell to confirm that your variable has been defined properly
assert len(top_5_institutions) == 5 assert len(top_5_institutions) == 5
assert top_5_institutions[0] == "Massachusetts Institute Of Technology" assert top_5_institutions[0] == "Massachusetts Institute Of Technology"
assert top_5_institutions[-1] == "University Of Cambridge" assert top_5_institutions[-1] == "University Of Cambridge"
``` ```
%% Cell type:markdown id:13e1803b tags: %% Cell type:markdown id:13e1803b tags:
### Task 2.4: Order by multiple columns ### Task 2.4: Order by multiple columns
If you print out the resulting dataframe from your query, you might notice that all 5 rows have the same academic reputation. This makes it hard to compare the universities, so we will add some **tiebreaking** rules. If two universities have the same `academic_reputation`, then we should order them by their `citations_per_faculty` (with the **highest** appearing first). You can do this by ordering by multiple columns. If you print out the resulting dataframe from your query, you might notice that all 5 rows have the same academic reputation. This makes it hard to compare the universities, so we will add some **tiebreaking** rules. If two universities have the same `academic_reputation`, then we should order them by their `citations_per_faculty` (with the **highest** appearing first). You can do this by ordering by multiple columns.
* Copy your query from Task 2.3. * Copy your query from Task 2.3.
* Update the **ORDER BY** clause to add this tiebreaking behavior. * Update the **ORDER BY** clause to add this tiebreaking behavior.
* Save these institution names to a **list**. * Save these institution names to a **list**.
%% Cell type:code id:26f5a433 tags: %% Cell type:code id:26f5a433 tags:
``` python ``` python
# compute and store the answer in the variable 'top_5_with_tiebreak', then display it # compute and store the answer in the variable 'top_5_with_tiebreak', then display it
``` ```
%% Cell type:code id:c5b2382b tags: %% Cell type:code id:c5b2382b tags:
``` python ``` python
# run this cell to confirm that your variable has been defined properly # run this cell to confirm that your variable has been defined properly
assert top_5_with_tiebreak[0] == "University Of California, Berkeley" assert top_5_with_tiebreak[0] == "University Of California, Berkeley"
assert top_5_with_tiebreak[-1] == "University Of California, Los Angeles" assert top_5_with_tiebreak[-1] == "University Of California, Los Angeles"
``` ```
%% Cell type:markdown id:9b991dcf tags: %% Cell type:markdown id:9b991dcf tags:
### Task 2.5: Use GROUP BY clause and SUM aggregate function to get the total number of international_students for each country in 2019 ### Task 2.5: Use GROUP BY clause and SUM aggregate function to get the total number of international_students for each country in 2019
The **GROUP BY** keyword groups rows that have the same value. It is often used with aggregate functions, such as **COUNT**, **SUM**, **AVG**, etc. to obtain a summary about groups in the data. The **GROUP BY** keyword groups rows that have the same value. It is often used with aggregate functions, such as **COUNT**, **SUM**, **AVG**, etc. to obtain a summary about groups in the data.
For example, to answer the question "What is the average rank of each country's institutions?", we could **GROUP BY** the `country` and use the **AVG** aggregate function to get the average rank of each country. For example, to answer the question "What is the average rank of each country's institutions?", we could **GROUP BY** the `country` and use the **AVG** aggregate function to get the average rank of each country.
* Write a new query that uses **GROUP BY** and **SUM** to get the total number of international students in each country, using **WHERE** to filter by the `year`. * Write a new query that uses **GROUP BY** and **SUM** to get the total number of international students in each country, using **WHERE** to filter by the `year`.
* Save the resulting **DataFrame** with **two** columns: `country` and the **sum** of the `international_students` for that country. * Save the resulting **DataFrame** with **two** columns: `country` and the **sum** of the `international_students` for that country.
%% Cell type:code id:f31786c4 tags: %% Cell type:code id:f31786c4 tags:
``` python ``` python
# compute and store the answer in the variable 'inter_students_by_country', then display its head # compute and store the answer in the variable 'inter_students_by_country', then display its head
``` ```
%% Cell type:code id:9c84f12c tags: %% Cell type:code id:9c84f12c tags:
``` python ``` python
# run this cell to confirm that your variable has been defined properly # run this cell to confirm that your variable has been defined properly
assert math.isclose(inter_students_by_country[inter_students_by_country["country"] == "Japan"].iloc[0][1], 280.9) assert math.isclose(inter_students_by_country[inter_students_by_country["country"] == "Japan"].iloc[0][1], 280.9)
assert math.isclose(inter_students_by_country[inter_students_by_country["country"] == "Australia"].iloc[0][1], 1895.5) assert math.isclose(inter_students_by_country[inter_students_by_country["country"] == "Australia"].iloc[0][1], 1895.5)
assert math.isclose(inter_students_by_country[inter_students_by_country["country"] == "United States"].iloc[0][1], 3675.0) assert math.isclose(inter_students_by_country[inter_students_by_country["country"] == "United States"].iloc[0][1], 3675.0)
``` ```
%% Cell type:markdown id:06ecba29 tags: %% Cell type:markdown id:06ecba29 tags:
### Task 2.6: Use the AS keyword to rename the new column from Task 2.5 to total_international_students ### Task 2.6: Use the AS keyword to rename the new column from Task 2.5 to total_international_students
Although the dataframe does have a column for the sum of international students for each country, the name of the column looks strange: Although the dataframe does have a column for the sum of international students for each country, the name of the column looks strange:
```sql ```sql
SUM(`international_students`) SUM(`international_students`)
``` ```
In SQL, the **AS** keyword allows us to create an simpler alias for the columns we create with our queries to make the resulting **DataFrame** easier to understand. In SQL, the **AS** keyword allows us to create an simpler alias for the columns we create with our queries to make the resulting **DataFrame** easier to understand.
* Paste your query from Task 2.5 and modify it so the **SUM** column has the name `total_international_students`. * Paste your query from Task 2.5 and modify it so the **SUM** column has the name `total_international_students`.
* Save the resulting **DataFrame** with **two** columns: `country` and `total_international_students`. * Save the resulting **DataFrame** with **two** columns: `country` and `total_international_students`.
%% Cell type:code id:3947be0d tags: %% Cell type:code id:3947be0d tags:
``` python ``` python
# compute and store the answer in the variable 'inter_students_by_country_renamed', then display its head # compute and store the answer in the variable 'inter_students_by_country_renamed', then display its head
``` ```
%% Cell type:code id:9e114959 tags: %% Cell type:code id:9e114959 tags:
``` python ``` python
# run this cell to confirm that your variable has been defined properly # run this cell to confirm that your variable has been defined properly
assert "total_international_students" in inter_students_by_country_renamed.columns assert "total_international_students" in inter_students_by_country_renamed.columns
assert math.isclose(inter_students_by_country_renamed[inter_students_by_country_renamed["country"] == "Japan"]["total_international_students"], 280.9) assert math.isclose(inter_students_by_country_renamed[inter_students_by_country_renamed["country"] == "Japan"]["total_international_students"], 280.9)
assert math.isclose(inter_students_by_country_renamed[inter_students_by_country_renamed["country"] == "Australia"]["total_international_students"], 1895.5) assert math.isclose(inter_students_by_country_renamed[inter_students_by_country_renamed["country"] == "Australia"]["total_international_students"], 1895.5)
assert math.isclose(inter_students_by_country_renamed[inter_students_by_country_renamed["country"] == "United States"]["total_international_students"], 3675.0) assert math.isclose(inter_students_by_country_renamed[inter_students_by_country_renamed["country"] == "United States"]["total_international_students"], 3675.0)
``` ```
%% Cell type:markdown id:79fdda0c tags: %% Cell type:markdown id:79fdda0c tags:
### Task 2.7: Use the HAVING keyword to only keep countries with more than 1000 international students ### Task 2.7: Use the HAVING keyword to only keep countries with more than 1000 international students
In addition to **WHERE**, the **HAVING** keyword is useful for filtering **GROUP BY** queries. Whereas **WHERE** filters the number of rows, **HAVING** filters the number of groups. In addition to **WHERE**, the **HAVING** keyword is useful for filtering **GROUP BY** queries. Whereas **WHERE** filters the number of rows, **HAVING** filters the number of groups.
* Paste your query from Task 2.6 and modify it so that it only returns countries (`country`) and `total_international_students` with **more than** *1000* international students. * Paste your query from Task 2.6 and modify it so that it only returns countries (`country`) and `total_international_students` with **more than** *1000* international students.
* Save the resulting **DataFrame** with **two** columns: `country` and `total_international_students`. * Save the resulting **DataFrame** with **two** columns: `country` and `total_international_students`.
%% Cell type:code id:8bc00cf4 tags: %% Cell type:code id:8bc00cf4 tags:
``` python ``` python
# compute and store the answer in the variable 'inter_students_by_country_more_than_1000', then display it # compute and store the answer in the variable 'inter_students_by_country_more_than_1000', then display it
``` ```
%% Cell type:code id:a1c5be56 tags: %% Cell type:code id:a1c5be56 tags:
``` python ``` python
# run this cell to confirm that your variable has been defined properly # run this cell to confirm that your variable has been defined properly
assert len(inter_students_by_country_more_than_1000) == 4 assert len(inter_students_by_country_more_than_1000) == 4
assert "Australia" in list(inter_students_by_country_more_than_1000["country"]) assert "Australia" in list(inter_students_by_country_more_than_1000["country"])
assert "Germany" in list(inter_students_by_country_more_than_1000["country"]) assert "Germany" in list(inter_students_by_country_more_than_1000["country"])
assert "United Kingdom" in list(inter_students_by_country_more_than_1000["country"]) assert "United Kingdom" in list(inter_students_by_country_more_than_1000["country"])
assert "United States" in list(inter_students_by_country_more_than_1000["country"]) assert "United States" in list(inter_students_by_country_more_than_1000["country"])
``` ```
%% Cell type:markdown id:d83309db tags: %% Cell type:markdown id:d83309db tags:
# Segment 3: Plotting # Segment 3: Plotting
SQL provides powerful tools to manipulate and organize data. Now we might be interested in plotting the data to engage in data exploration and visualize our results. SQL provides powerful tools to manipulate and organize data. Now we might be interested in plotting the data to engage in data exploration and visualize our results.
Before starting this segment, it is recommended that you go through the relevant lecture code: Before starting this segment, it is recommended that you go through the relevant lecture code:
* [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/22_Plotting1/lec_22_plotting1_bar_plots_notes.ipynb) (Bar and scatter plots) and [here]() (Line plots - this is what we will talk about in the Wednesday 8/9 lecture) * [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/22_Plotting1/lec_22_plotting1_bar_plots_notes.ipynb) (Bar and scatter plots) and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/sum23/lecture_materials/23_Plotting2) (Line plots - this is what we will talk about in the Wednesday 8/9 lecture)
%% Cell type:markdown id:d27b7c2c tags: %% Cell type:markdown id:d27b7c2c tags:
### Task 3.1: Use a bar plot to plot the data from Task 2.7 ### Task 3.1: Use a bar plot to plot the data from Task 2.7
Your plot should look like this: Your plot should look like this:
<div><img src="attachment:bar_plot.png" width="400"/></div> <div><img src="attachment:bar_plot.png" width="400"/></div>
Make sure that the plot is labelled exactly as in the image here. Make sure that the plot is labelled exactly as in the image here.
%% Cell type:code id:5e4dc5d2 tags: %% Cell type:code id:5e4dc5d2 tags:
``` python ``` python
# instead of specifically plotting just the DataFrame 'inter_students_by_country_more_than_1000', # instead of specifically plotting just the DataFrame 'inter_students_by_country_more_than_1000',
# create a general function to create bar plots # create a general function to create bar plots
def bar_plot(df, x, y): def bar_plot(df, x, y):
"""bar_plot(df, x, y) takes in a DataFrame 'df' and displays """bar_plot(df, x, y) takes in a DataFrame 'df' and displays
a bar plot with the column 'x' as the x-axis, and the column a bar plot with the column 'x' as the x-axis, and the column
'y' as the y-axis""" 'y' as the y-axis"""
pass # replace with your code pass # replace with your code
# TODO: set dataframe index to 'x' # TODO: set dataframe index to 'x'
# TODO: use df.plot.bar to plot the data in black with no legend # TODO: use df.plot.bar to plot the data in black with no legend
# TODO: set x as the x label # TODO: set x as the x label
# TODO: set y as the y label # TODO: set y as the y label
``` ```
%% Cell type:code id:e21ed94a tags: %% Cell type:code id:e21ed94a tags:
``` python ``` python
# run this cell to plot the data from Task 2.7 # run this cell to plot the data from Task 2.7
# verify that this plot matches exactly with the image shown above # verify that this plot matches exactly with the image shown above
bar_plot(inter_students_by_country_more_than_1000, 'country', 'total_international_students') bar_plot(inter_students_by_country_more_than_1000, 'country', 'total_international_students')
``` ```
%% Cell type:markdown id:0adf3bdd tags: %% Cell type:markdown id:0adf3bdd tags:
### Task 3.2: Use a scatter plot to plot the relationship between employer_reputation and academic_reputation in 2019 ### Task 3.2: Use a scatter plot to plot the relationship between employer_reputation and academic_reputation in 2019
Your plot should look like this: Your plot should look like this:
<div><img src="attachment:scatter_plot.png" width="500"/></div> <div><img src="attachment:scatter_plot.png" width="500"/></div>
Make sure that the plot is labelled exactly as in the image here. Make sure that the plot is labelled exactly as in the image here.
%% Cell type:code id:8eb6036d tags: %% Cell type:code id:8eb6036d tags:
``` python ``` python
# create a general function to create scatter plots # create a general function to create scatter plots
def scatter_plot(df, x, y): def scatter_plot(df, x, y):
"""scatter_plot(df, x, y) takes in a DataFrame 'df' and displays """scatter_plot(df, x, y) takes in a DataFrame 'df' and displays
a scatter plot with the column 'x' as the x-axis, and the column a scatter plot with the column 'x' as the x-axis, and the column
'y' as the y-axis""" 'y' as the y-axis"""
pass # replace with your code pass # replace with your code
# TODO: use df.plot.scatter to plot the data in black with no legend # TODO: use df.plot.scatter to plot the data in black with no legend
# TODO: set x as the x label # TODO: set x as the x label
# TODO: set y as the y label # TODO: set y as the y label
``` ```
%% Cell type:markdown id:d77b0f09 tags: %% Cell type:markdown id:d77b0f09 tags:
With the `scatter_plot` function defined, you are ready to create the required plot. With the `scatter_plot` function defined, you are ready to create the required plot.
* Write a SQL query to select rows from the database where the `year` is *2019*. * Write a SQL query to select rows from the database where the `year` is *2019*.
* Save the resulting **DataFrame** with **two** columns: `employer_reputation` and `academic_reputation`. * Save the resulting **DataFrame** with **two** columns: `employer_reputation` and `academic_reputation`.
* Call `scatter_plot`, passing in `employer_reputation` and `academic_reputation` as the `x` and `y` arguments. * Call `scatter_plot`, passing in `employer_reputation` and `academic_reputation` as the `x` and `y` arguments.
%% Cell type:code id:2ef617ff tags: %% Cell type:code id:2ef617ff tags:
``` python ``` python
# first compute and store the DataFrame # first compute and store the DataFrame
# then create the scatter plot using the DataFrame # then create the scatter plot using the DataFrame
# verify that this plot matches exactly with the image shown above # verify that this plot matches exactly with the image shown above
``` ```
%% Cell type:markdown id:d144417b tags: %% Cell type:markdown id:d144417b tags:
### Task 3.3: Make a Horizontal Bar plot of average employer_reputation and average faculty_student_score across all years ### Task 3.3: Make a Horizontal Bar plot of average employer_reputation and average faculty_student_score across all years
Your plot should look like this: Your plot should look like this:
<div><img src="attachment:horizontal_bar_plot.png" width="600"/></div> <div><img src="attachment:horizontal_bar_plot.png" width="600"/></div>
Make sure that the plot is labelled exactly as in the image here. Make sure that the plot is labelled exactly as in the image here.
%% Cell type:code id:78e21b0b tags: %% Cell type:code id:78e21b0b tags:
``` python ``` python
# we have done this one for you # we have done this one for you
def horizontal_bar_plot(df, x): def horizontal_bar_plot(df, x):
"""horizontal_bar_plot(df, x) takes in a DataFrame 'df' and displays """horizontal_bar_plot(df, x) takes in a DataFrame 'df' and displays
a horizontal bar plot with the column 'x' as the x-axis, and all a horizontal bar plot with the column 'x' as the x-axis, and all
other columns of 'df' on the y-axis""" other columns of 'df' on the y-axis"""
df = df.set_index(x) df = df.set_index(x)
ax = df.plot.barh() ax = df.plot.barh()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.9)) ax.legend(loc='center left', bbox_to_anchor=(1, 0.9))
``` ```
%% Cell type:markdown id:7cbdaa9f tags: %% Cell type:markdown id:7cbdaa9f tags:
Use the `horizontal_bar_plot` function to create the required plot. Use the `horizontal_bar_plot` function to create the required plot.
* Write a SQL query to select `year`, **average** `employer_reputation`, and **average** `faculty_student_score` grouped by `year`. * Write a SQL query to select `year`, **average** `employer_reputation`, and **average** `faculty_student_score` grouped by `year`.
* Save the resulting **DataFrame** with **three** columns: `year`, the **average** of the `employer_reputation` and the **average** of the `faculty_student_score`. * Save the resulting **DataFrame** with **three** columns: `year`, the **average** of the `employer_reputation` and the **average** of the `faculty_student_score`.
* Call `horizontal_bar_plot`, passing in `year` as the `x` argument. * Call `horizontal_bar_plot`, passing in `year` as the `x` argument.
%% Cell type:code id:bc779e0b tags: %% Cell type:code id:bc779e0b tags:
``` python ``` python
# first compute and store the DataFrame # first compute and store the DataFrame
# then create the horizontal bar plot using the DataFrame # then create the horizontal bar plot using the DataFrame
# verify that this plot matches exactly with the image shown above # verify that this plot matches exactly with the image shown above
``` ```
%% Cell type:markdown id:aaeeebe7 tags: %% Cell type:markdown id:aaeeebe7 tags:
### Task 3.4 Display a Pie Chart of the average overall score of the top 10 countries in descending order ### Task 3.4 Display a Pie Chart of the average overall score of the top 10 countries in descending order
Your plot should look like this: Your plot should look like this:
<div><img src="attachment:pie_plot.png" width="400"/></div> <div><img src="attachment:pie_plot.png" width="400"/></div>
Make sure that the plot is labelled exactly as in the image here. Make sure that the plot is labelled exactly as in the image here.
%% Cell type:code id:aedb58d2 tags: %% Cell type:code id:aedb58d2 tags:
``` python ``` python
# we have done this one for you # we have done this one for you
def pie_plot(df, x, y, title=None): def pie_plot(df, x, y, title=None):
"""pie_plot(df, x, y, title) takes in a DataFrame 'df' and displays """pie_plot(df, x, y, title) takes in a DataFrame 'df' and displays
a pie plot with the column 'x' as the x-axis, the (numeric) column a pie plot with the column 'x' as the x-axis, the (numeric) column
'y' as the y-axis, and the 'title' as the title of the plot""" 'y' as the y-axis, and the 'title' as the title of the plot"""
df = df.set_index(x) df = df.set_index(x)
ax = df.plot.pie(y=y, legend=False) ax = df.plot.pie(y=y, legend=False)
ax.set_ylabel(None) ax.set_ylabel(None)
ax.set_title(title) ax.set_title(title)
``` ```
%% Cell type:markdown id:805c89c1 tags: %% Cell type:markdown id:805c89c1 tags:
Use the `pie_plot` function to create the required plot. Use the `pie_plot` function to create the required plot.
* Write a SQL query to select the **top** *10* countries based on **average** `overall_score`. * Write a SQL query to select the **top** *10* countries based on **average** `overall_score`.
* Save the resulting **DataFrame** with **two** columns: `country`, and the **average** of the `overall_score`. * Save the resulting **DataFrame** with **two** columns: `country`, and the **average** of the `overall_score`.
* Call `pie_plot`, passing in `country` as the `x` argument, and the **average** of the `overall_score` as the `y` argument. * Call `pie_plot`, passing in `country` as the `x` argument, and the **average** of the `overall_score` as the `y` argument.
* Your plot must also have the **title** `Countries with top 10 overall scores` as in the image. * Your plot must also have the **title** `Countries with top 10 overall scores` as in the image.
**Hint:** If you are having trouble writing the SQL query, take a look at Task 2.3 **Hint:** If you are having trouble writing the SQL query, take a look at Task 2.3
%% Cell type:code id:777d3b49 tags: %% Cell type:code id:777d3b49 tags:
``` python ``` python
# first compute and store the DataFrame # first compute and store the DataFrame
# then create the pie plot using the DataFrame # then create the pie plot using the DataFrame
# verify that this plot matches exactly with the image shown above # verify that this plot matches exactly with the image shown above
``` ```
%% Cell type:markdown id:de3777de tags: %% Cell type:markdown id:de3777de tags:
### Task 3.5: Fit a regression line to the data from Task 3.2 ### Task 3.5: Fit a regression line to the data from Task 3.2
Your line of best fit should look like this: Your line of best fit should look like this:
<div><img src="attachment:regression_line_plot.png" width="500"/></div> <div><img src="attachment:regression_line_plot.png" width="500"/></div>
Make sure that the plot is labelled exactly as in the image here. Make sure that the plot is labelled exactly as in the image here.
%% Cell type:code id:68941bde tags: %% Cell type:code id:68941bde tags:
``` python ``` python
# we have defined this function for you # we have defined this function for you
def get_regression_coeff(df, x, y): def get_regression_coeff(df, x, y):
"""get_regression_coeff(df, x, y) takes in a DataFrame 'df' and returns """get_regression_coeff(df, x, y) takes in a DataFrame 'df' and returns
the slope (m) and the y-intercept (b) of the line of best fit in the the slope (m) and the y-intercept (b) of the line of best fit in the
plot with the column 'x' as the x-axis, and the column 'y' as the y-axis""" plot with the column 'x' as the x-axis, and the column 'y' as the y-axis"""
df["1"] = 1 df["1"] = 1
res = np.linalg.lstsq(df[[x, "1"]], df[y], rcond=None) res = np.linalg.lstsq(df[[x, "1"]], df[y], rcond=None)
coefficients = res[0] coefficients = res[0]
m = coefficients[0] m = coefficients[0]
b = coefficients[1] b = coefficients[1]
return (m, b) return (m, b)
``` ```
%% Cell type:code id:fb427287 tags: %% Cell type:code id:fb427287 tags:
``` python ``` python
# you must define this function to compute the best fit line # you must define this function to compute the best fit line
def get_regression_line(df, x, y): def get_regression_line(df, x, y):
"""get_regression_line(df, x, y) takes in a DataFrame 'df' and returns """get_regression_line(df, x, y) takes in a DataFrame 'df' and returns
a DataFrame with an additional column "fit" of the line of best fit in the a DataFrame with an additional column "fit" of the line of best fit in the
plot with the column 'x' as the x-axis, and the column 'y' as the y-axis""" plot with the column 'x' as the x-axis, and the column 'y' as the y-axis"""
pass # replace with your code pass # replace with your code
# TODO: use the 'get_regression_coeff' function to get the slope and # TODO: use the 'get_regression_coeff' function to get the slope and
# intercept of the line of best fit # intercept of the line of best fit
# TODO: save them into variables m and b respectively # TODO: save them into variables m and b respectively
# TODO: create a new column in the dataframe called 'fit', which is # TODO: create a new column in the dataframe called 'fit', which is
# is calculated as df['fit'] = m * df[x] + b # is calculated as df['fit'] = m * df[x] + b
# TODO: return the DataFrame df # TODO: return the DataFrame df
``` ```
%% Cell type:code id:0a70404d tags: %% Cell type:code id:0a70404d tags:
``` python ``` python
# you must define this function to plot the best fit line on the scatter plot # you must define this function to plot the best fit line on the scatter plot
def regression_line_plot(df, x, y): def regression_line_plot(df, x, y):
"""regression_line_plot(df, x, y) takes in a DataFrame 'df' and displays """regression_line_plot(df, x, y) takes in a DataFrame 'df' and displays
a scatter plot with the column 'x' as the x-axis, and the column a scatter plot with the column 'x' as the x-axis, and the column
'y' as the y-axis, as well as the best fit line for the plot""" 'y' as the y-axis, as well as the best fit line for the plot"""
pass # replace with your code pass # replace with your code
# TODO: use 'get_regression_line' to get the data for the best fit line. # TODO: use 'get_regression_line' to get the data for the best fit line.
# TODO: use df.plot.scatter (not scatter_plot) to plot the x and y columns # TODO: use df.plot.scatter (not scatter_plot) to plot the x and y columns
# of 'df' in black color. # of 'df' in black color.
# TODO: save the return value of df.plot.scatter to a variable called 'ax' # TODO: save the return value of df.plot.scatter to a variable called 'ax'
# TODO: use df.plot.line to plot the fitted line in red, # TODO: use df.plot.line to plot the fitted line in red,
# using ax=ax as a keyword argument. # using ax=ax as a keyword argument.
# this ensures that both the scatter plot and line end up on the same plot # this ensures that both the scatter plot and line end up on the same plot
# play careful attention to what the 'x' and 'y' arguments ought to be # play careful attention to what the 'x' and 'y' arguments ought to be
``` ```
%% Cell type:markdown id:ef4b46de tags: %% Cell type:markdown id:ef4b46de tags:
Now, use the `regression_line_plot` function to create the required plot. Now, use the `regression_line_plot` function to create the required plot.
* Call `regression_line_plot` on your data from Task 3.2 to show the correlation between `employer_reputation` and `academic_reputation`. * Call `regression_line_plot` on your data from Task 3.2 to show the correlation between `employer_reputation` and `academic_reputation`.
%% Cell type:code id:065d0ef5 tags: %% Cell type:code id:065d0ef5 tags:
``` python ``` python
# create the scatter plot with the best fit line using the DataFrame from Task 3.2 # create the scatter plot with the best fit line using the DataFrame from Task 3.2
# verify that this plot matches exactly with the image shown above # verify that this plot matches exactly with the image shown above
``` ```
%% Cell type:markdown id:bdb5cdb7 tags: %% Cell type:markdown id:bdb5cdb7 tags:
### Task 4: Closing the connection ### Task 4: Closing the connection
Now that you are done with your database, it is very important to close it. Now that you are done with your database, it is very important to close it.
%% Cell type:code id:65557b40 tags: %% Cell type:code id:65557b40 tags:
``` python ``` python
# close your connection here # close your connection here
# we have done this one for you # we have done this one for you
conn.close() conn.close()
``` ```
%% Cell type:markdown id:0f20a99c tags: %% Cell type:markdown id:0f20a99c tags:
### Congratulations, you are now ready to start P13! ### Congratulations, you are now ready to start P13!
......
%% Cell type:code id:5e33d91e tags: %% Cell type:code id:5e33d91e tags:
``` python ``` python
import otter import otter
# nb_name should be the name of your notebook without the .ipynb extension # nb_name should be the name of your notebook without the .ipynb extension
nb_name = "p13" nb_name = "p13"
py_filename = nb_name + ".py" py_filename = nb_name + ".py"
grader = otter.Notebook(nb_name + ".ipynb") grader = otter.Notebook(nb_name + ".ipynb")
``` ```
%% Cell type:code id:0611fe14 tags: %% Cell type:code id:0611fe14 tags:
``` python ``` python
import p13_test import p13_test
``` ```
%% Cell type:code id:2bcd01a8 tags: %% Cell type:code id:2bcd01a8 tags:
``` python ``` python
# PLEASE FILL IN THE DETAILS # PLEASE FILL IN THE DETAILS
# enter none if you don't have a project partner # enter none if you don't have a project partner
# you will have to add your partner as a group member on Gradescope even after you fill this # you will have to add your partner as a group member on Gradescope even after you fill this
# project: p13 # project: p13
# submitter: NETID1 # submitter: NETID1
# partner: NETID2 # partner: NETID2
``` ```
%% Cell type:markdown id:372ed345 tags: %% Cell type:markdown id:372ed345 tags:
# Project 13: World University Rankings # Project 13: World University Rankings
%% Cell type:markdown id:b30c2df0 tags: %% Cell type:markdown id:b30c2df0 tags:
## Learning Objectives: ## Learning Objectives:
In this project, you will demonstrate how to: In this project, you will demonstrate how to:
* query a database using SQL, * query a database using SQL,
* process data using `pandas` **DataFrames**, * process data using `pandas` **DataFrames**,
* create different types of plots. * create different types of plots.
Please go through [Lab 13](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/labs/lab13) before working on this project. The lab introduces some useful techniques related to this project. Please go through [Lab 13](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/labs/lab13) before working on this project. The lab introduces some useful techniques related to this project.
%% Cell type:markdown id:479785c7 tags: %% Cell type:markdown id:479785c7 tags:
## Note on Academic Misconduct: ## Note on Academic Misconduct:
**IMPORTANT**: P12 and P13 are two parts of the same data analysis. You **cannot** switch project partners between these two projects. That is if you partnered up with someone for P12, you have to sustain that partnership until end of P13. Now may be a good time to review [our course policies](https://canvas.wisc.edu/courses/355767/pages/syllabus?module_item_id=6048035). **IMPORTANT**: P12 and P13 are two parts of the same data analysis. You **cannot** switch project partners between these two projects. That is if you partnered up with someone for P12, you have to sustain that partnership until end of P13. Now may be a good time to review [our course policies](https://canvas.wisc.edu/courses/355767/pages/syllabus?module_item_id=6048035).
%% Cell type:markdown id:3e0e04f5 tags: %% Cell type:markdown id:3e0e04f5 tags:
## Testing your code: ## Testing your code:
Along with this notebook, you must have downloaded the file `p13_test.py`. If you are curious about how we test your code, you can explore this file, and specifically the value of the variable `expected_json`, to understand the expected answers to the questions. Along with this notebook, you must have downloaded the file `p13_test.py`. If you are curious about how we test your code, you can explore this file, and specifically the value of the variable `expected_json`, to understand the expected answers to the questions.
For answers involving DataFrames, `p13_test.py` compares your tables to those in `p13_expected.html`, so take a moment to open that file on a web browser (from Finder/Explorer). For answers involving DataFrames, `p13_test.py` compares your tables to those in `p13_expected.html`, so take a moment to open that file on a web browser (from Finder/Explorer).
For answers involving plots, `p13_test.py` can **only** check that the **DataFrames** are correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. Your plots will be **manually graded**, and you will **lose points** if your plot is not visible, or if it is not properly labelled. For answers involving plots, `p13_test.py` can **only** check that the **DataFrames** are correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. Your plots will be **manually graded**, and you will **lose points** if your plot is not visible, or if it is not properly labelled.
**IMPORTANT Warning:** Do **not** download the dataset `QSranking.json` **manually**. Use the `download` function from P12 to download it. When we run the autograder, this file `QSranking.json` will **not** be in the directory. So, unless your `p13.ipynb` downloads this file, you will get a **zero score** on the project. Also, make sure your `download` function includes code to check if the file already exists. Otherwise, you will **lose** points for **hardcoding**. **IMPORTANT Warning:** Do **not** download the dataset `QSranking.json` **manually**. Use the `download` function from P12 to download it. When we run the autograder, this file `QSranking.json` will **not** be in the directory. So, unless your `p13.ipynb` downloads this file, you will get a **zero score** on the project. Also, make sure your `download` function includes code to check if the file already exists. Otherwise, you will **lose** points for **hardcoding**.
%% Cell type:markdown id:aad1951a tags: %% Cell type:markdown id:aad1951a tags:
## Project Description: ## Project Description:
For your final CS220 project, you're going to continue analyzing world university rankings. However, we will be using a different dataset this time. The data for this project has been extracted from [here](https://www.topuniversities.com/university-rankings/world-university-rankings/2023). Unlike the CWUR rankings we used in P12, the QS rankings dataset has various scores for the universities, and not just the rankings. This makes the QS rankings dataset more suitable for plotting (which you will be doing a lot of!). For your final CS220 project, you're going to continue analyzing world university rankings. However, we will be using a different dataset this time. The data for this project has been extracted from [here](https://www.topuniversities.com/university-rankings/world-university-rankings/2023). Unlike the CWUR rankings we used in P12, the QS rankings dataset has various scores for the universities, and not just the rankings. This makes the QS rankings dataset more suitable for plotting (which you will be doing a lot of!).
In this project, you'll have to dump your DataFrame to a SQLite database. You'll answer questions by doing queries on that database. Often, your answers will be in the form of a plot. Check these carefully, as the tests only verify that a plot has been created, not that it looks correct (the Gradescope autograder will manually deduct points for plotting mistakes). In this project, you'll have to dump your DataFrame to a SQLite database. You'll answer questions by doing queries on that database. Often, your answers will be in the form of a plot. Check these carefully, as the tests only verify that a plot has been created, not that it looks correct (the Gradescope autograder will manually deduct points for plotting mistakes).
%% Cell type:markdown id:48aad11e tags: %% Cell type:markdown id:48aad11e tags:
## Project Requirements: ## Project Requirements:
You **may not** hardcode indices in your code. You **may not** manually download **any** files for this project, unless you are **explicitly** told to do so. For all other files, you **must** use the `download` function to download the files. You **may not** hardcode indices in your code. You **may not** manually download **any** files for this project, unless you are **explicitly** told to do so. For all other files, you **must** use the `download` function to download the files.
**Store** your final answer for each question in the **variable specified for each question**. This step is important because Otter grades your work by comparing the value of this variable against the correct answer. **Store** your final answer for each question in the **variable specified for each question**. This step is important because Otter grades your work by comparing the value of this variable against the correct answer.
For some of the questions, we'll ask you to write (then use) a function to compute the answer. If you compute the answer **without** creating the function we ask you to write, we'll **manually deduct** points from your autograder score on Gradescope, even if the way you did it produced the correct answer. For some of the questions, we'll ask you to write (then use) a function to compute the answer. If you compute the answer **without** creating the function we ask you to write, we'll **manually deduct** points from your autograder score on Gradescope, even if the way you did it produced the correct answer.
Required Functions: Required Functions:
- `bar_plot` - `bar_plot`
- `scatter_plot` - `scatter_plot`
- `horizontal_bar_plot` - `horizontal_bar_plot`
- `pie_plot` - `pie_plot`
- `get_regression_coeff` - `get_regression_coeff`
- `get_regression_line` - `get_regression_line`
- `regression_line_plot` - `regression_line_plot`
- `download` - `download`
In this project, you will also be required to define certain **data structures**. If you do not create these data structures exactly as specified, we'll **manually deduct** points from your autograder score on Gradescope, even if the way you did it produced the correct answer. In this project, you will also be required to define certain **data structures**. If you do not create these data structures exactly as specified, we'll **manually deduct** points from your autograder score on Gradescope, even if the way you did it produced the correct answer.
Required Data Structures: Required Data Structures:
- `conn` - `conn`
You **must** write SQL queries to solve the questions in this project, unless you are **explicitly** told otherwise. You will **not get any credit** if you use `pandas` operations to extract data. We will give you **specific** instructions for any questions where `pandas` operations are allowed. In addition, you are also **required** to follow the requirements below: You **must** write SQL queries to solve the questions in this project, unless you are **explicitly** told otherwise. You will **not get any credit** if you use `pandas` operations to extract data. We will give you **specific** instructions for any questions where `pandas` operations are allowed. In addition, you are also **required** to follow the requirements below:
* You **must** close the connection to `conn` at the end of your notebook. * You **must** close the connection to `conn` at the end of your notebook.
* Do **not** use **absolute** paths such as `C://ms//cs220//p13`. You may **only** use **relative paths**. * Do **not** use **absolute** paths such as `C://ms//cs220//p13`. You may **only** use **relative paths**.
* Do **not** hardcode `//` or `\` in any of your paths. You **must** use `os.path.join` to create paths. * Do **not** hardcode `//` or `\` in any of your paths. You **must** use `os.path.join` to create paths.
* Do **not** leave irrelevant output or test code that we didn't ask for. * Do **not** leave irrelevant output or test code that we didn't ask for.
* **Avoid** calling **slow** functions multiple times within a loop. * **Avoid** calling **slow** functions multiple times within a loop.
* Do **not** define multiple functions with the same name or define multiple versions of one function with different names. Just keep the best version. * Do **not** define multiple functions with the same name or define multiple versions of one function with different names. Just keep the best version.
For more details on what will cause you to lose points during code review and specific requirements, please take a look at the [Grading rubric](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/rubric.md). For more details on what will cause you to lose points during code review and specific requirements, please take a look at the [Grading rubric](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/rubric.md).
%% Cell type:markdown id:e04f805e tags: %% Cell type:markdown id:e04f805e tags:
## Questions and Functions: ## Questions and Functions:
Let us start by importing all the modules we will need for this project. Let us start by importing all the modules we will need for this project.
%% Cell type:code id:b1363e20 tags: %% Cell type:code id:b1363e20 tags:
``` python ``` python
# it is considered a good coding practice to place all import statements at the top of the notebook # it is considered a good coding practice to place all import statements at the top of the notebook
# please place all your import statements in this cell if you need to import any more modules for this project # please place all your import statements in this cell if you need to import any more modules for this project
``` ```
%% Cell type:markdown id:995a9ea8 tags: %% Cell type:markdown id:995a9ea8 tags:
Now, you may copy/paste some of the functions and data structures you defined in Lab 13 and P12, which will be useful for this project. Now, you may copy/paste some of the functions and data structures you defined in Lab 13 and P12, which will be useful for this project.
%% Cell type:code id:a4fab7ea tags: %% Cell type:code id:a4fab7ea tags:
``` python ``` python
# this ensures that font.size setting remains uniform # this ensures that font.size setting remains uniform
%matplotlib inline %matplotlib inline
pd.set_option('display.max_colwidth', None) pd.set_option('display.max_colwidth', None)
matplotlib.rcParams["font.size"] = 13 # don't use value > 13! Otherwise your y-axis tick labels will be different. matplotlib.rcParams["font.size"] = 13 # don't use value > 13! Otherwise your y-axis tick labels will be different.
``` ```
%% Cell type:code id:e4eac640 tags: %% Cell type:code id:e4eac640 tags:
``` python ``` python
# copy/paste the definition of the function 'bar_plot' from lab-p13 here # copy/paste the definition of the function 'bar_plot' from lab-p13 here
``` ```
%% Cell type:code id:71c71935 tags: %% Cell type:code id:71c71935 tags:
``` python ``` python
# copy/paste the definition of the function 'scatter_plot' from lab-p13 here # copy/paste the definition of the function 'scatter_plot' from lab-p13 here
``` ```
%% Cell type:code id:153b23ad tags: %% Cell type:code id:153b23ad tags:
``` python ``` python
# copy/paste the definition of the function 'horizontal_bar_plot' from lab-p13 here # copy/paste the definition of the function 'horizontal_bar_plot' from lab-p13 here
``` ```
%% Cell type:code id:1f6d37df tags: %% Cell type:code id:1f6d37df tags:
``` python ``` python
# copy/paste the definition of the function 'pie_plot' from lab-p13 here # copy/paste the definition of the function 'pie_plot' from lab-p13 here
``` ```
%% Cell type:code id:88255766 tags: %% Cell type:code id:88255766 tags:
``` python ``` python
# copy/paste the definition of the function 'get_regression_coeff' from lab-p13 here # copy/paste the definition of the function 'get_regression_coeff' from lab-p13 here
``` ```
%% Cell type:code id:8119a0ec tags: %% Cell type:code id:8119a0ec tags:
``` python ``` python
# copy/paste the definition of the function 'get_regression_line' from lab-p13 here # copy/paste the definition of the function 'get_regression_line' from lab-p13 here
``` ```
%% Cell type:code id:13851f7d tags: %% Cell type:code id:13851f7d tags:
``` python ``` python
# copy/paste the definition of the function 'regression_line_plot' from lab-p13 here # copy/paste the definition of the function 'regression_line_plot' from lab-p13 here
``` ```
%% Cell type:code id:c12776a3 tags: %% Cell type:code id:c12776a3 tags:
``` python ``` python
# copy/paste the definition of the function 'download' from p12 here # copy/paste the definition of the function 'download' from p12 here
``` ```
%% Cell type:code id:f4fbd661 tags: %% Cell type:code id:f4fbd661 tags:
``` python ``` python
# use the 'download' function to download the data from the webpage # use the 'download' function to download the data from the webpage
# 'https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/QSranking.json' # 'https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/raw/main/sum23/projects/p13/QSranking.json'
# to the file 'QSranking.json' # to the file 'QSranking.json'
``` ```
%% Cell type:markdown id:40f76941 tags: %% Cell type:markdown id:40f76941 tags:
### Data Structure 1: `conn` ### Data Structure 1: `conn`
You **must** now create a **database** called `rankings.db` out of `QSranking.json`, connect to it, and save it in a variable called `conn`. You **must** use this connection to the database `rankings.db` to answer the questions that follow. You **must** now create a **database** called `rankings.db` out of `QSranking.json`, connect to it, and save it in a variable called `conn`. You **must** use this connection to the database `rankings.db` to answer the questions that follow.
%% Cell type:code id:8de4b158 tags: %% Cell type:code id:8de4b158 tags:
``` python ``` python
# create a database called 'rankings.db' out of 'QSranking.json' # create a database called 'rankings.db' out of 'QSranking.json'
# TODO: load the data from 'QSranking.json' into a variable called 'qs_ranking' using pandas' 'read_json' function # TODO: load the data from 'QSranking.json' into a variable called 'qs_ranking' using pandas' 'read_json' function
# TODO: connect to 'rankings.db' and save it to a variable called 'conn' # TODO: connect to 'rankings.db' and save it to a variable called 'conn'
# TODO: write the contents of the DataFrame 'qs_ranking' to the sqlite database # TODO: write the contents of the DataFrame 'qs_ranking' to the sqlite database
``` ```
%% Cell type:code id:9f28e183 tags: %% Cell type:code id:9f28e183 tags:
``` python ``` python
# run this cell and confirm that you have defined the variables correctly # run this cell and confirm that you have defined the variables correctly
pd.read_sql("SELECT * FROM rankings LIMIT 5", conn) pd.read_sql("SELECT * FROM rankings LIMIT 5", conn)
``` ```
%% Cell type:markdown id:d31f5dd9 tags: %% Cell type:markdown id:d31f5dd9 tags:
**Question 1:** List **all** the statistics of the institution with the `institution_name` *University Of Wisconsin-Madison*. **Question 1:** List **all** the statistics of the institution with the `institution_name` *University Of Wisconsin-Madison*.
You **must** display **all** the columns. The rows **must** be in *ascending* order of `year`. You **must** display **all** the columns. The rows **must** be in *ascending* order of `year`.
Your output **must** be a **DataFrame** that looks like this: Your output **must** be a **DataFrame** that looks like this:
||**rank**|**year**|**institution_name**|**country**|**academic_reputation**|**employer_reputation**|**faculty_student_score**|**citations_per_faculty**|**international_faculty**|**international_students**|**overall_score**| ||**rank**|**year**|**institution_name**|**country**|**academic_reputation**|**employer_reputation**|**faculty_student_score**|**citations_per_faculty**|**international_faculty**|**international_students**|**overall_score**|
|---|---|---|---|---|---|---|---|---|---|---|---| |---|---|---|---|---|---|---|---|---|---|---|---|
|**0**|55|2018|University Of Wisconsin-Madison|United States|94.0|62.1|84.0|54.2|53.2|30.9|75.8| |**0**|55|2018|University Of Wisconsin-Madison|United States|94.0|62.1|84.0|54.2|53.2|30.9|75.8|
|**1**|53|2019|University Of Wisconsin-Madison|United States|88.5|51.2|87.4|52.6|58.8|30.6|73.2| |**1**|53|2019|University Of Wisconsin-Madison|United States|88.5|51.2|87.4|52.6|58.8|30.6|73.2|
|**2**|56|2020|University Of Wisconsin-Madison|United States|87.8|49.7|85.5|50.0|57.2|30.9|71.8| |**2**|56|2020|University Of Wisconsin-Madison|United States|87.8|49.7|85.5|50.0|57.2|30.9|71.8|
%% Cell type:code id:8eefb54f tags: %% Cell type:code id:8eefb54f tags:
``` python ``` python
# compute and store the answer in the variable 'uw_rating', then display it # compute and store the answer in the variable 'uw_rating', then display it
``` ```
%% Cell type:code id:6a51b275 tags: %% Cell type:code id:6a51b275 tags:
``` python ``` python
grader.check("q1") grader.check("q1")
``` ```
%% Cell type:markdown id:587fd6d2 tags: %% Cell type:markdown id:587fd6d2 tags:
**Question 2:** What are the **top** *10* institutions in *Japan* which had the **highest** score of `international_students` in the `year` *2020*? **Question 2:** What are the **top** *10* institutions in *Japan* which had the **highest** score of `international_students` in the `year` *2020*?
You **must** display the columns `institution_name` and `international_students`. The rows **must** be in *descending* order of `international_students`. You **must** display the columns `institution_name` and `international_students`. The rows **must** be in *descending* order of `international_students`.
Your output **must** be a **DataFrame** that looks like this: Your output **must** be a **DataFrame** that looks like this:
||**institution_name**|**international_students**| ||**institution_name**|**international_students**|
|---------|------|---------| |---------|------|---------|
|**0**|Waseda University|35.8| |**0**|Waseda University|35.8|
|**1**|Tokyo Institute Of Technology|31.3| |**1**|Tokyo Institute Of Technology|31.3|
|**2**|University Of Tsukuba|30.4| |**2**|University Of Tsukuba|30.4|
|**3**|The University of Tokyo|26.2| |**3**|The University of Tokyo|26.2|
|**4**|Kyushu University|21.5| |**4**|Kyushu University|21.5|
|**5**|Nagoya University|21.3| |**5**|Nagoya University|21.3|
|**6**|Tohoku University|17.6| |**6**|Tohoku University|17.6|
|**7**|Kyoto University|17.5| |**7**|Kyoto University|17.5|
|**8**|Hiroshima University|17.1| |**8**|Hiroshima University|17.1|
|**9**|Tokyo Medical and Dental University|16.7| |**9**|Tokyo Medical and Dental University|16.7|
%% Cell type:code id:b72f2999 tags: %% Cell type:code id:b72f2999 tags:
``` python ``` python
# compute and store the answer in the variable 'japan_top_10_inter', then display it # compute and store the answer in the variable 'japan_top_10_inter', then display it
``` ```
%% Cell type:code id:f06aaae0 tags: %% Cell type:code id:f06aaae0 tags:
``` python ``` python
grader.check("q2") grader.check("q2")
``` ```
%% Cell type:markdown id:341ac4b8 tags: %% Cell type:markdown id:341ac4b8 tags:
**Question 3:** What are the **top** *10* institutions in the *United States* which had the **highest** *reputation* in the `year` *2019*? **Question 3:** What are the **top** *10* institutions in the *United States* which had the **highest** *reputation* in the `year` *2019*?
The `reputation` of an institution is defined as the sum of `academic_reputation` and `employer_reputation`. You **must** display the columns `institution_name` and `reputation`. The rows **must** be in *descending* order of `reputation`. In case the `reputation` is tied, the rows must be in *alphabetical* order of `institution_name`. The `reputation` of an institution is defined as the sum of `academic_reputation` and `employer_reputation`. You **must** display the columns `institution_name` and `reputation`. The rows **must** be in *descending* order of `reputation`. In case the `reputation` is tied, the rows must be in *alphabetical* order of `institution_name`.
Your output **must** be a **DataFrame** that looks like this: Your output **must** be a **DataFrame** that looks like this:
||**institution_name**|**reputation**| ||**institution_name**|**reputation**|
|---------|------|---------| |---------|------|---------|
|**0**|Harvard University|200.0| |**0**|Harvard University|200.0|
|**1**|Massachusetts Institute Of Technology|200.0| |**1**|Massachusetts Institute Of Technology|200.0|
|**2**|Stanford University|200.0| |**2**|Stanford University|200.0|
|**3**|University Of California, Berkeley|199.8| |**3**|University Of California, Berkeley|199.8|
|**4**|Yale University|199.6| |**4**|Yale University|199.6|
|**5**|University Of California, Los Angeles|199.1| |**5**|University Of California, Los Angeles|199.1|
|**6**|Columbia University|197.1| |**6**|Columbia University|197.1|
|**7**|Princeton University|196.6| |**7**|Princeton University|196.6|
|**8**|University Of Chicago|190.3| |**8**|University Of Chicago|190.3|
|**9**|Cornell University|189.2| |**9**|Cornell University|189.2|
**Hint:** You can use mathematical expressions in your **SELECT** clause. For example, if you wish to add the `academic_reputation` and `employer_reputation` for each institution, you could use the following query: **Hint:** You can use mathematical expressions in your **SELECT** clause. For example, if you wish to add the `academic_reputation` and `employer_reputation` for each institution, you could use the following query:
```sql ```sql
SELECT (`academic_reputation` + `employer_reputation`) FROM rankings SELECT (`academic_reputation` + `employer_reputation`) FROM rankings
``` ```
%% Cell type:code id:271b86d7 tags: %% Cell type:code id:271b86d7 tags:
``` python ``` python
# compute and store the answer in the variable 'us_top_10_rep', then display it # compute and store the answer in the variable 'us_top_10_rep', then display it
``` ```
%% Cell type:code id:96cacdd4 tags: %% Cell type:code id:96cacdd4 tags:
``` python ``` python
grader.check("q3") grader.check("q3")
``` ```
%% Cell type:markdown id:21ba8c82 tags: %% Cell type:markdown id:21ba8c82 tags:
**Question 4:** What are the **top** *10* countries which had the **most** *institutions* listed in the `year` *2020*? **Question 4:** What are the **top** *10* countries which had the **most** *institutions* listed in the `year` *2020*?
You **must** display the columns `country` and `num_of_institutions`. The `num_of_institutions` of a country is defined as the number of institutions from that country. The rows **must** be in *descending* order of `num_of_institutions`. In case the `num_of_institutions` is tied, the rows must be in *alphabetical* order of `country`. You **must** display the columns `country` and `num_of_institutions`. The `num_of_institutions` of a country is defined as the number of institutions from that country. The rows **must** be in *descending* order of `num_of_institutions`. In case the `num_of_institutions` is tied, the rows must be in *alphabetical* order of `country`.
**Hint:** You **must** use the `COUNT` SQL function to answer this question. **Hint:** You **must** use the `COUNT` SQL function to answer this question.
Your output **must** be a **DataFrame** that looks like this: Your output **must** be a **DataFrame** that looks like this:
||**country**|**num_of_institutions**| ||**country**|**num_of_institutions**|
|---------|------|---------| |---------|------|---------|
|**0**|United States|74| |**0**|United States|74|
|**1**|United Kingdom|45| |**1**|United Kingdom|45|
|**2**|Germany|23| |**2**|Germany|23|
|**3**|Australia|21| |**3**|Australia|21|
|**4**|Canada|14| |**4**|Canada|14|
|**5**|China|14| |**5**|China|14|
|**6**|France|14| |**6**|France|14|
|**7**|Japan|14| |**7**|Japan|14|
|**8**|Netherlands|13| |**8**|Netherlands|13|
|**9**|Russia|13| |**9**|Russia|13|
%% Cell type:code id:1991dc45 tags: %% Cell type:code id:1991dc45 tags:
``` python ``` python
# compute and store the answer in the variable 'top_10_countries', then display it # compute and store the answer in the variable 'top_10_countries', then display it
``` ```
%% Cell type:code id:3e878347 tags: %% Cell type:code id:3e878347 tags:
``` python ``` python
grader.check("q4") grader.check("q4")
``` ```
%% Cell type:markdown id:6ef62b90 tags: %% Cell type:markdown id:6ef62b90 tags:
**Question 5:** Create a **bar plot** using the data from Question 4 with the `country` on the **x-axis** and the `num_of_institutions` on the **y-axis**. **Question 5:** Create a **bar plot** using the data from Question 4 with the `country` on the **x-axis** and the `num_of_institutions` on the **y-axis**.
In addition to the top ten countries, you **must** also aggregate the data for **all** the **other** countries, and represent that number in the column `Other`. You are **allowed** do this using any combination of SQL queries and pandas operations. In addition to the top ten countries, you **must** also aggregate the data for **all** the **other** countries, and represent that number in the column `Other`. You are **allowed** do this using any combination of SQL queries and pandas operations.
You **must** first compute a **DataFrame** `num_institutions` containing the **country**, and the **num_of_institutions** data. You **must** first compute a **DataFrame** `num_institutions` containing the **country**, and the **num_of_institutions** data.
**Hint**: You can use the `append` function of a DataFrame to add a single row to the end of your **DataFrame** from Question 4. You'll also need the keyword argument `ignore_index=True`. For example: **Hint**: You can use the `append` function of a DataFrame to add a single row to the end of your **DataFrame** from Question 4. You'll also need the keyword argument `ignore_index=True`. For example:
```python ```python
my_new_dataframe = my_dataframe.append({"country": "CS220", "num_of_institutions": 22}, ignore_index=True) my_new_dataframe = my_dataframe.append({"country": "CS220", "num_of_institutions": 22}, ignore_index=True)
``` ```
will create a *new* **DataFrame** `my_new_dataframe` which contains all the rows from `my_dataframe`, along with the **additional row** which has been appended. You can **ignore** any warnings about `append` being deprecated. will create a *new* **DataFrame** `my_new_dataframe` which contains all the rows from `my_dataframe`, along with the **additional row** which has been appended. You can **ignore** any warnings about `append` being deprecated.
%% Cell type:code id:a0b3223c tags: %% Cell type:code id:a0b3223c tags:
``` python ``` python
# first compute and store the DataFrame 'num_institutions', then display it # first compute and store the DataFrame 'num_institutions', then display it
# do NOT plot just yet # do NOT plot just yet
# TODO: use a SQL query similar to Question 4 to get the number of institutions of all countries # TODO: use a SQL query similar to Question 4 to get the number of institutions of all countries
# (not just the top 10), ordered by the number of institutions, and store in a DataFrame # (not just the top 10), ordered by the number of institutions, and store in a DataFrame
# TODO: Use pandas to find the sum of the institutions in all countries except the top 10 # TODO: Use pandas to find the sum of the institutions in all countries except the top 10
# TODO: create a new dictionary with the data about the new row that needs to be added # TODO: create a new dictionary with the data about the new row that needs to be added
# TODO: properly append this new dictionary to 'num_institutions' and update 'num_institutions' # TODO: properly append this new dictionary to 'num_institutions' and update 'num_institutions'
``` ```
%% Cell type:code id:c95611c9 tags: %% Cell type:code id:c95611c9 tags:
``` python ``` python
grader.check("q5") grader.check("q5")
``` ```
%% Cell type:markdown id:51a82c7e tags: %% Cell type:markdown id:51a82c7e tags:
Now, **plot** `num_institutions` as **bar plot** with the **x-axis** labelled *country* and the **y-axis** labelled *num_of_institutions*. Now, **plot** `num_institutions` as **bar plot** with the **x-axis** labelled *country* and the **y-axis** labelled *num_of_institutions*.
You **must** use the `bar_plot` function to create the plot. You **must** use the `bar_plot` function to create the plot.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**. **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
Your plot should look like this: Your plot should look like this:
%% Cell type:markdown id:b7e7e295 tags: %% Cell type:markdown id:b7e7e295 tags:
<div><img src="attachment:q5.png" width="400"/></div> <div><img src="attachment:q5.png" width="400"/></div>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center> <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:4cd92732 tags: %% Cell type:code id:4cd92732 tags:
``` python ``` python
# create the bar plot using the DataFrame 'num_institutions' with the x-axis labelled "country" # create the bar plot using the DataFrame 'num_institutions' with the x-axis labelled "country"
# and the y-axis labelled "num_of_institutions" # and the y-axis labelled "num_of_institutions"
``` ```
%% Cell type:markdown id:6617e42c tags: %% Cell type:markdown id:6617e42c tags:
**Question 6:** Create a **bar plot** of the **top** *10* countries with the **highest** *total* `overall_score` listed in the `year` *2019*. **Question 6:** Create a **bar plot** of the **top** *10* countries with the **highest** *total* `overall_score` listed in the `year` *2019*.
The `total_score` of a `country` is defined as the **sum** of `overall_score` of **all** institutions in that `country`. You **must** display the columns `country` and `total_score`. The rows **must** be in *descending* order of `total_score`. The `total_score` of a `country` is defined as the **sum** of `overall_score` of **all** institutions in that `country`. You **must** display the columns `country` and `total_score`. The rows **must** be in *descending* order of `total_score`.
You **must** first compute a **DataFrame** `top_10_total_score` containing the **country**, and the **total_score** data. You **must** first compute a **DataFrame** `top_10_total_score` containing the **country**, and the **total_score** data.
Your **DataFrame** should looks like this: Your **DataFrame** should looks like this:
||**country**|**total_score**| ||**country**|**total_score**|
|---------|------|---------| |---------|------|---------|
|**0**|United States|4298.4| |**0**|United States|4298.4|
|**1**|United Kingdom|2539.2| |**1**|United Kingdom|2539.2|
|**2**|Germany|1098.2| |**2**|Germany|1098.2|
|**3**|Australia|1093.8| |**3**|Australia|1093.8|
|**4**|Japan|752.9| |**4**|Japan|752.9|
|**5**|China|743.4| |**5**|China|743.4|
|**6**|Canada|705.3| |**6**|Canada|705.3|
|**7**|Netherlands|674.9| |**7**|Netherlands|674.9|
|**8**|South Korea|612.8| |**8**|South Korea|612.8|
|**9**|France|595.2| |**9**|France|595.2|
%% Cell type:code id:f7cf3887 tags: %% Cell type:code id:f7cf3887 tags:
``` python ``` python
# compute and store the answer in the variable 'top_10_total_score', then display it # compute and store the answer in the variable 'top_10_total_score', then display it
# do NOT plot just yet # do NOT plot just yet
``` ```
%% Cell type:code id:64d40c82 tags: %% Cell type:code id:64d40c82 tags:
``` python ``` python
grader.check("q6") grader.check("q6")
``` ```
%% Cell type:markdown id:2e7b11bc tags: %% Cell type:markdown id:2e7b11bc tags:
Now, **plot** `top_10_total_score` as **bar plot** with the **x-axis** labelled *country* and the **y-axis** labelled *total_score*. Now, **plot** `top_10_total_score` as **bar plot** with the **x-axis** labelled *country* and the **y-axis** labelled *total_score*.
You **must** use the `bar_plot` function to create the plot. You **must** use the `bar_plot` function to create the plot.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**. **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
Your plot should look like this: Your plot should look like this:
%% Cell type:markdown id:033d0733 tags: %% Cell type:markdown id:033d0733 tags:
<div><img src="attachment:q6.png" width="400"/></div> <div><img src="attachment:q6.png" width="400"/></div>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center> <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:2192b4e4 tags: %% Cell type:code id:2192b4e4 tags:
``` python ``` python
# create the bar plot using the DataFrame 'top_10_total_score' with the x-axis labelled "country" # create the bar plot using the DataFrame 'top_10_total_score' with the x-axis labelled "country"
# and the y-axis labelled "total_score" # and the y-axis labelled "total_score"
``` ```
%% Cell type:markdown id:88cbb812 tags: %% Cell type:markdown id:88cbb812 tags:
**Question 7:** What are the **top** *10* institutions in the *United States* which had the **highest** *international_score* in the `year` *2020*? **Question 7:** What are the **top** *10* institutions in the *United States* which had the **highest** *international_score* in the `year` *2020*?
The *international_score* of an institution is defined as the **sum** of `international_faculty` and `international_students` scores of that institution. You **must** display the columns `institution_name` and `international_score`. The rows **must** be in *descending* order of `international_score`. The *international_score* of an institution is defined as the **sum** of `international_faculty` and `international_students` scores of that institution. You **must** display the columns `institution_name` and `international_score`. The rows **must** be in *descending* order of `international_score`.
Your output **must** be a **DataFrame** that looks like this: Your output **must** be a **DataFrame** that looks like this:
||**institution_name**|**international_score**| ||**institution_name**|**international_score**|
|---------|------|---------| |---------|------|---------|
|**0**|Massachusetts Institute Of Technology|194.1| |**0**|Massachusetts Institute Of Technology|194.1|
|**1**|California Institute Of Technology|186.7| |**1**|California Institute Of Technology|186.7|
|**2**|Carnegie Mellon University|183.5| |**2**|Carnegie Mellon University|183.5|
|**3**|Rice University|180.4| |**3**|Rice University|180.4|
|**4**|Northeastern University|179.1| |**4**|Northeastern University|179.1|
|**5**|Stanford University|167.5| |**5**|Stanford University|167.5|
|**6**|Cornell University|166.1| |**6**|Cornell University|166.1|
|**7**|Purdue University|158.2| |**7**|Purdue University|158.2|
|**8**|University Of Rochester|157.9| |**8**|University Of Rochester|157.9|
|**9**|University Of Chicago|151.2| |**9**|University Of Chicago|151.2|
%% Cell type:code id:af3589cd tags: %% Cell type:code id:af3589cd tags:
``` python ``` python
# compute and store the answer in the variable 'top_10_inter_score', then display it # compute and store the answer in the variable 'top_10_inter_score', then display it
``` ```
%% Cell type:code id:41ee5bff tags: %% Cell type:code id:41ee5bff tags:
``` python ``` python
grader.check("q7") grader.check("q7")
``` ```
%% Cell type:markdown id:4794b1a5 tags: %% Cell type:markdown id:4794b1a5 tags:
**Question 8:** Create a **scatter plot** representing the `citations_per_faculty` (on the **x-axis**) against the `overall_score` (on the **y-axis**) of each institution in the `year` *2018*. **Question 8:** Create a **scatter plot** representing the `citations_per_faculty` (on the **x-axis**) against the `overall_score` (on the **y-axis**) of each institution in the `year` *2018*.
You **must** first compute a **DataFrame** `citations_overall` containing the **citations_per_faculty**, and the **overall_score** data from the `year` *2018*, of each **institution**. You **must** first compute a **DataFrame** `citations_overall` containing the **citations_per_faculty**, and the **overall_score** data from the `year` *2018*, of each **institution**.
%% Cell type:code id:92a32a11 tags: %% Cell type:code id:92a32a11 tags:
``` python ``` python
# first compute and store the DataFrame 'citations_overall', then display its head # first compute and store the DataFrame 'citations_overall', then display its head
# do NOT plot just yet # do NOT plot just yet
``` ```
%% Cell type:code id:c9a2b1ba tags: %% Cell type:code id:c9a2b1ba tags:
``` python ``` python
grader.check("q8") grader.check("q8")
``` ```
%% Cell type:markdown id:68165402 tags: %% Cell type:markdown id:68165402 tags:
Now, **plot** `citations_overall` as **scatter plot** with the **x-axis** labelled *citations_per_faculty* and the **y-axis** labelled *overall_score*. Now, **plot** `citations_overall` as **scatter plot** with the **x-axis** labelled *citations_per_faculty* and the **y-axis** labelled *overall_score*.
You **must** use the `scatter_plot` function to create the plot. You **must** use the `scatter_plot` function to create the plot.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**. **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
Your plot should look like this: Your plot should look like this:
%% Cell type:markdown id:667b4025 tags: %% Cell type:markdown id:667b4025 tags:
<div><img src="attachment:q8.png" width="400"/></div> <div><img src="attachment:q8.png" width="400"/></div>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center> <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:0e0b8a7d tags: %% Cell type:code id:0e0b8a7d tags:
``` python ``` python
# create the scatter plot using the DataFrame 'citations_overall' with the x-axis labelled "citations_per_faculty" # create the scatter plot using the DataFrame 'citations_overall' with the x-axis labelled "citations_per_faculty"
# and the y-axis labelled "overall_score" # and the y-axis labelled "overall_score"
``` ```
%% Cell type:markdown id:8ba5ed8c tags: %% Cell type:markdown id:8ba5ed8c tags:
**Question 9:** Create a **scatter plot** representing the `academic_reputation` (on the **x-axis**) against the `employer_reputation` (on the **y-axis**) of each institution from the *United States* in the `year` *2019*. **Question 9:** Create a **scatter plot** representing the `academic_reputation` (on the **x-axis**) against the `employer_reputation` (on the **y-axis**) of each institution from the *United States* in the `year` *2019*.
You **must** first compute a **DataFrame** `reputations_usa` containing the **academic_reputation**, and the **employer_reputation** data from the `year` *2019*, of each **institution** in the `country` *United States*. You **must** first compute a **DataFrame** `reputations_usa` containing the **academic_reputation**, and the **employer_reputation** data from the `year` *2019*, of each **institution** in the `country` *United States*.
%% Cell type:code id:b04f767f tags: %% Cell type:code id:b04f767f tags:
``` python ``` python
# first compute and store the DataFrame 'reputations_usa', then display its head # first compute and store the DataFrame 'reputations_usa', then display its head
# do NOT plot just yet # do NOT plot just yet
``` ```
%% Cell type:code id:05490b0c tags: %% Cell type:code id:05490b0c tags:
``` python ``` python
grader.check("q9") grader.check("q9")
``` ```
%% Cell type:markdown id:5f8fcce5 tags: %% Cell type:markdown id:5f8fcce5 tags:
Now, **plot** `reputations_usa` as **scatter plot** with the **x-axis** labelled *academic_reputation* and the **y-axis** labelled *employer_reputation*. Now, **plot** `reputations_usa` as **scatter plot** with the **x-axis** labelled *academic_reputation* and the **y-axis** labelled *employer_reputation*.
You **must** use the `scatter_plot` function to create the plot. You **must** use the `scatter_plot` function to create the plot.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**. **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
Your plot should look like this: Your plot should look like this:
%% Cell type:markdown id:0295c09c tags: %% Cell type:markdown id:0295c09c tags:
<div><img src="attachment:q9.png" width="400"/></div> <div><img src="attachment:q9.png" width="400"/></div>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center> <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:29894cd8 tags: %% Cell type:code id:29894cd8 tags:
``` python ``` python
# create the scatter plot using the DataFrame 'reputations_usa' with the x-axis labelled "academic_reputation" # create the scatter plot using the DataFrame 'reputations_usa' with the x-axis labelled "academic_reputation"
# and the y-axis labelled "employer_reputation" # and the y-axis labelled "employer_reputation"
``` ```
%% Cell type:markdown id:2e739c41 tags: %% Cell type:markdown id:2e739c41 tags:
**Question 10:** Create a **scatter plot** representing the `international_students` (on the **x-axis**) against the `faculty_student_score` (on the **y-axis**) for the **top ranked** institution of **each** `country` in the `year` *2020*. **Question 10:** Create a **scatter plot** representing the `international_students` (on the **x-axis**) against the `faculty_student_score` (on the **y-axis**) for the **top ranked** institution of **each** `country` in the `year` *2020*.
You **must** first compute a **DataFrame** `top_ranked_inter_faculty` containing the **international_students**, and the **faculty_student_score** data from the `year` *2020*, of the **top** ranked **institution** (i.e., the institution with the **least** `rank`) of each **country**. You **must** first compute a **DataFrame** `top_ranked_inter_faculty` containing the **international_students**, and the **faculty_student_score** data from the `year` *2020*, of the **top** ranked **institution** (i.e., the institution with the **least** `rank`) of each **country**.
**Hint:** You can use the `MIN` SQL function to return the least value of a selected column. However, there are a few things to keep in mind while using this function. **Hint:** You can use the `MIN` SQL function to return the least value of a selected column. However, there are a few things to keep in mind while using this function.
* The function must be in **uppercase** (i.e., you must use `MIN`, and **not** `min`). * The function must be in **uppercase** (i.e., you must use `MIN`, and **not** `min`).
* The column you are finding the minimum of must be inside backticks (``` ` ```). For example, if you want to find the minimum `rank`, you need to say ```MIN(`rank`)```. * The column you are finding the minimum of must be inside backticks (``` ` ```). For example, if you want to find the minimum `rank`, you need to say ```MIN(`rank`)```.
If you do not follow the syntax above, your code will likely fail. If you do not follow the syntax above, your code will likely fail.
%% Cell type:code id:fa9e1b6f tags: %% Cell type:code id:fa9e1b6f tags:
``` python ``` python
# first compute and store the DataFrame 'top_ranked_inter_faculty', then display its head # first compute and store the DataFrame 'top_ranked_inter_faculty', then display its head
# do NOT plot just yet # do NOT plot just yet
``` ```
%% Cell type:code id:a4831be1 tags: %% Cell type:code id:a4831be1 tags:
``` python ``` python
grader.check("q10") grader.check("q10")
``` ```
%% Cell type:markdown id:59b40839 tags: %% Cell type:markdown id:59b40839 tags:
Now, **plot** `top_ranked_inter_faculty` as **scatter plot** with the **x-axis** labelled *international_students* and the **y-axis** labelled *faculty_student_score*. Now, **plot** `top_ranked_inter_faculty` as **scatter plot** with the **x-axis** labelled *international_students* and the **y-axis** labelled *faculty_student_score*.
You **must** use the `scatter_plot` function to create the plot. You **must** use the `scatter_plot` function to create the plot.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**. **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
Your plot should look like this: Your plot should look like this:
%% Cell type:markdown id:0ffca4e3 tags: %% Cell type:markdown id:0ffca4e3 tags:
<div><img src="attachment:q10.png" width="400"/></div> <div><img src="attachment:q10.png" width="400"/></div>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center> <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:2f17934b tags: %% Cell type:code id:2f17934b tags:
``` python ``` python
# create the scatter plot using the DataFrame 'top_ranked_inter_faculty' with the x-axis labelled "international_students" # create the scatter plot using the DataFrame 'top_ranked_inter_faculty' with the x-axis labelled "international_students"
# and the y-axis labelled "faculty_student_score" # and the y-axis labelled "faculty_student_score"
``` ```
%% Cell type:markdown id:9dab472c tags: %% Cell type:markdown id:9dab472c tags:
### Correlations: ### Correlations:
You can use the `.corr()` method on a **DataFrame** that has **two** columns to get the *correlation* between those two columns. You can use the `.corr()` method on a **DataFrame** that has **two** columns to get the *correlation* between those two columns.
For example, if we have a **DataFrame** `df` with the two columns `citations_per_faculty` and `overall_score`, `df.corr()` would return For example, if we have a **DataFrame** `df` with the two columns `citations_per_faculty` and `overall_score`, `df.corr()` would return
||**citations_per_faculty**|**overall_score**| ||**citations_per_faculty**|**overall_score**|
|---------|------|---------| |---------|------|---------|
|citations_per_faculty|1.000000|0.574472| |citations_per_faculty|1.000000|0.574472|
|overall_score|0.574472|1.000000| |overall_score|0.574472|1.000000|
You can use `.loc` here to **extract** the *correlation* between the two columns (`0.574472` in this case). You can use `.loc` here to **extract** the *correlation* between the two columns (`0.574472` in this case).
%% Cell type:markdown id:f09ade4a tags: %% Cell type:markdown id:f09ade4a tags:
**Question 11:** Find the **correlation** between `international_students` and `overall_score` for institutions from the `country` *United Kingdom* that were ranked in the **top** *100* in the `year` *2020*. **Question 11:** Find the **correlation** between `international_students` and `overall_score` for institutions from the `country` *United Kingdom* that were ranked in the **top** *100* in the `year` *2020*.
Your output **must** be a **float** representing the absolute correlations. The **only** `pandas` operations you are **allowed** to use are: `.corr`, `.loc` and `.iloc`. You **must** use SQL to gather all other data. Your output **must** be a **float** representing the absolute correlations. The **only** `pandas` operations you are **allowed** to use are: `.corr`, `.loc` and `.iloc`. You **must** use SQL to gather all other data.
%% Cell type:code id:706db815 tags: %% Cell type:code id:706db815 tags:
``` python ``` python
# compute and store the answer in the variable 'uk_inter_score_corr', then display it # compute and store the answer in the variable 'uk_inter_score_corr', then display it
``` ```
%% Cell type:code id:ea738710 tags: %% Cell type:code id:ea738710 tags:
``` python ``` python
grader.check("q11") grader.check("q11")
``` ```
%% Cell type:markdown id:314d22d6 tags: %% Cell type:markdown id:314d22d6 tags:
Let us now define a new score called `citations_per_international` as follows: Let us now define a new score called `citations_per_international` as follows:
$$\texttt{citations}\_\texttt{per}\_\texttt{international} = \frac{\texttt{citations}\_\texttt{per}\_\texttt{faculty} \times \texttt{international}\_\texttt{faculty}}{100}.$$ $$\texttt{citations}\_\texttt{per}\_\texttt{international} = \frac{\texttt{citations}\_\texttt{per}\_\texttt{faculty} \times \texttt{international}\_\texttt{faculty}}{100}.$$
%% Cell type:markdown id:cef190f0 tags: %% Cell type:markdown id:cef190f0 tags:
**Question 12:** Find the **correlation** between `citations_per_international` and `overall_score` for **all** institutions in the `year` *2019*. **Question 12:** Find the **correlation** between `citations_per_international` and `overall_score` for **all** institutions in the `year` *2019*.
Your output **must** be a **float** representing the absolute correlations. The **only** `pandas` operations you are **allowed** to use are: `.corr`, `.loc` and `.iloc`. You **must** use SQL to gather all other data. Your output **must** be a **float** representing the absolute correlations. The **only** `pandas` operations you are **allowed** to use are: `.corr`, `.loc` and `.iloc`. You **must** use SQL to gather all other data.
%% Cell type:code id:777d001c tags: %% Cell type:code id:777d001c tags:
``` python ``` python
# compute and store the answer in the variable 'cit_per_inter_score_corr', then display it # compute and store the answer in the variable 'cit_per_inter_score_corr', then display it
``` ```
%% Cell type:code id:ee14b0ac tags: %% Cell type:code id:ee14b0ac tags:
``` python ``` python
grader.check("q12") grader.check("q12")
``` ```
%% Cell type:markdown id:cc72c981 tags: %% Cell type:markdown id:cc72c981 tags:
**Question 13:** What are the **top** *15* countries with the **highest** *total* of `citations_per_international` in the `year` *2019*. **Question 13:** What are the **top** *15* countries with the **highest** *total* of `citations_per_international` in the `year` *2019*.
The *total* `citations_per_international` of a `country` is defined as the **sum** of `citations_per_international` scores of **all** institutions in that `country`. You **must** display the columns `country` and `sum_inter_citations`. The rows **must** be in *descending* order of `sum_inter_citations`. The *total* `citations_per_international` of a `country` is defined as the **sum** of `citations_per_international` scores of **all** institutions in that `country`. You **must** display the columns `country` and `sum_inter_citations`. The rows **must** be in *descending* order of `sum_inter_citations`.
Your output **must** be a **DataFrame** that looks like this: Your output **must** be a **DataFrame** that looks like this:
||**country**|**sum_inter_citations**| ||**country**|**sum_inter_citations**|
|----|-----------|-----------------------| |----|-----------|-----------------------|
|**0**|United States|2623.8207| |**0**|United States|2623.8207|
|**1**|United Kingdom|2347.1602| |**1**|United Kingdom|2347.1602|
|**2**|Australia|1255.5530| |**2**|Australia|1255.5530|
|**3**|Netherlands|748.4268| |**3**|Netherlands|748.4268|
|**4**|Canada|724.5029| |**4**|Canada|724.5029|
|**5**|Switzerland|561.8790| |**5**|Switzerland|561.8790|
|**6**|China|482.2577| |**6**|China|482.2577|
|**7**|Germany|455.5466| |**7**|Germany|455.5466|
|**8**|Hong Kong|375.3032| |**8**|Hong Kong|375.3032|
|**9**|New Zealand|327.3357| |**9**|New Zealand|327.3357|
|**10**|Sweden|305.3745| |**10**|Sweden|305.3745|
|**11**|Belgium|255.0750| |**11**|Belgium|255.0750|
|**12**|France|198.0860| |**12**|France|198.0860|
|**13**|Denmark|186.4904| |**13**|Denmark|186.4904|
|**14**|Singapore|160.3000| |**14**|Singapore|160.3000|
%% Cell type:code id:14aaad72 tags: %% Cell type:code id:14aaad72 tags:
``` python ``` python
# compute and store the answer in the variable 'top_cit_per_inter', then display it # compute and store the answer in the variable 'top_cit_per_inter', then display it
``` ```
%% Cell type:code id:b44e985d tags: %% Cell type:code id:b44e985d tags:
``` python ``` python
grader.check("q13") grader.check("q13")
``` ```
%% Cell type:markdown id:59a993ce tags: %% Cell type:markdown id:59a993ce tags:
**Question 14:** Among the institutions ranked within the **top** *300*, find the **average** `citations_per_international` for **each** `country` in the `year` *2019*. **Question 14:** Among the institutions ranked within the **top** *300*, find the **average** `citations_per_international` for **each** `country` in the `year` *2019*.
You **must** display the columns `country` and `avg_inter_citations` representing the **average** of `citations_per_international` for **each** country. The rows **must** be in *descending* order of `avg_inter_citations`. You **must** display the columns `country` and `avg_inter_citations` representing the **average** of `citations_per_international` for **each** country. The rows **must** be in *descending* order of `avg_inter_citations`.
**Hint:** To find the **average**, you can use `SUM()` and `COUNT()` or you can simply use `AVG()`. **Hint:** To find the **average**, you can use `SUM()` and `COUNT()` or you can simply use `AVG()`.
Your output **must** be a **DataFrame** whose **first ten rows** look like this: Your output **must** be a **DataFrame** whose **first ten rows** look like this:
||**country**|**avg_inter_citations**| ||**country**|**avg_inter_citations**|
|----|-----------|----------------------| |----|-----------|----------------------|
|**0**|Singapore|80.150000| |**0**|Singapore|80.150000|
|**1**|Switzerland|75.497000| |**1**|Switzerland|75.497000|
|**2**|Hong Kong|62.550533| |**2**|Hong Kong|62.550533|
|**3**|Australia|61.362388| |**3**|Australia|61.362388|
|**4**|Netherlands|56.166733| |**4**|Netherlands|56.166733|
|**5**|New Zealand|53.226220| |**5**|New Zealand|53.226220|
|**6**|United Kingdom|52.889084| |**6**|United Kingdom|52.889084|
|**7**|Canada|50.779723| |**7**|Canada|50.779723|
|**8**|Denmark|46.196200| |**8**|Denmark|46.196200|
|**9**|Norway|46.083300| |**9**|Norway|46.083300|
%% Cell type:code id:dac3e940 tags: %% Cell type:code id:dac3e940 tags:
``` python ``` python
# compute and store the answer in the variable 'avg_cit_per_inter', then display it # compute and store the answer in the variable 'avg_cit_per_inter', then display it
``` ```
%% Cell type:code id:946bb83c tags: %% Cell type:code id:946bb83c tags:
``` python ``` python
grader.check("q14") grader.check("q14")
``` ```
%% Cell type:markdown id:bfded4bf tags: %% Cell type:markdown id:bfded4bf tags:
**Question 15** Find the **institution** with the **highest** value of `citations_per_international` for **each** `country` in the `year` *2020*. **Question 15** Find the **institution** with the **highest** value of `citations_per_international` for **each** `country` in the `year` *2020*.
Your output **must** be a **DataFrame** with the columns `country`, `institution_name`, and a new column `max_inter_citations` representing the **maximum** value of `citations_per_international` for that country. The rows **must** be in *descending* order of `max_inter_citations`. You **must** **omit** rows where `max_inter_citations` is **missing** by using the clause: Your output **must** be a **DataFrame** with the columns `country`, `institution_name`, and a new column `max_inter_citations` representing the **maximum** value of `citations_per_international` for that country. The rows **must** be in *descending* order of `max_inter_citations`. You **must** **omit** rows where `max_inter_citations` is **missing** by using the clause:
```sql ```sql
HAVING `max_inter_citations` IS NOT NULL HAVING `max_inter_citations` IS NOT NULL
``` ```
**Hint:** You can use the `MAX()` function to return the largest value within a group. **Hint:** You can use the `MAX()` function to return the largest value within a group.
Your output **must** be a **DataFrame** whose **first ten rows** look like this: Your output **must** be a **DataFrame** whose **first ten rows** look like this:
||**country**|**institution_name**|**max_inter_citations**| ||**country**|**institution_name**|**max_inter_citations**|
|----|-----------|--------------------|----------------------| |----|-----------|--------------------|----------------------|
|**0**|United States|Massachusetts Institute Of Technology|99.8000| |**0**|United States|Massachusetts Institute Of Technology|99.8000|
|**1**|Switzerland|Ecole Polytechnique Fédérale De Lausanne|98.9000| |**1**|Switzerland|Ecole Polytechnique Fédérale De Lausanne|98.9000|
|**2**|Netherlands|Eindhoven University Of Technology|95.4493| |**2**|Netherlands|Eindhoven University Of Technology|95.4493|
|**3**|United Kingdom|London School Of Economics And Political Science|91.1000| |**3**|United Kingdom|London School Of Economics And Political Science|91.1000|
|**4**|Hong Kong|The Hong Kong University Of Science And Technology|89.5000| |**4**|Hong Kong|The Hong Kong University Of Science And Technology|89.5000|
|**5**|Singapore|Nanyang Technological University|88.8000| |**5**|Singapore|Nanyang Technological University|88.8000|
|**6**|Australia|The University Of Western Australia|88.3000| |**6**|Australia|The University Of Western Australia|88.3000|
|**7**|Belgium|Katholieke Universiteit Leuven|76.7700| |**7**|Belgium|Katholieke Universiteit Leuven|76.7700|
|**8**|New Zealand|University Of Waikato|73.6434| |**8**|New Zealand|University Of Waikato|73.6434|
|**9**|Canada|Western University|72.3240| |**9**|Canada|Western University|72.3240|
%% Cell type:code id:fba4a1c2 tags: %% Cell type:code id:fba4a1c2 tags:
``` python ``` python
# compute and store the answer in the variable 'max_cit_per_inter', then display it # compute and store the answer in the variable 'max_cit_per_inter', then display it
``` ```
%% Cell type:code id:9c4db997 tags: %% Cell type:code id:9c4db997 tags:
``` python ``` python
grader.check("q15") grader.check("q15")
``` ```
%% Cell type:markdown id:da9cb13f tags: %% Cell type:markdown id:da9cb13f tags:
**Question 16**: Among the institutions ranked within the **top** *50*, create a **horizontal bar plot** representing the **average** of both the`citations_per_faculty` and `international_faculty` scores for **all** institutions in **each** `country` in the `year` *2018*. **Question 16**: Among the institutions ranked within the **top** *50*, create a **horizontal bar plot** representing the **average** of both the`citations_per_faculty` and `international_faculty` scores for **all** institutions in **each** `country` in the `year` *2018*.
You **must** first create a **DataFrame** `country_citations_inter` with **three** columns: `country`, `avg_citations` and `avg_inter_faculty` representing the name, the average value of `citations_per_faculty` and the average value of `international_faculty` for each country respectively. You **must** first create a **DataFrame** `country_citations_inter` with **three** columns: `country`, `avg_citations` and `avg_inter_faculty` representing the name, the average value of `citations_per_faculty` and the average value of `international_faculty` for each country respectively.
You **must** ensure that the countries in the **DataFrame** are **ordered** in **increasing** order of the **difference** between `avg_citations` and `avg_inter_faculty`. You **must** ensure that the countries in the **DataFrame** are **ordered** in **increasing** order of the **difference** between `avg_citations` and `avg_inter_faculty`.
%% Cell type:code id:e9e566a5 tags: %% Cell type:code id:e9e566a5 tags:
``` python ``` python
# first compute and store the DataFrame 'country_citations_inter', then display it # first compute and store the DataFrame 'country_citations_inter', then display it
# do NOT plot just yet # do NOT plot just yet
``` ```
%% Cell type:code id:60d1c6f7 tags: %% Cell type:code id:60d1c6f7 tags:
``` python ``` python
grader.check("q16") grader.check("q16")
``` ```
%% Cell type:markdown id:3e859552 tags: %% Cell type:markdown id:3e859552 tags:
Now, **plot** `country_citations_inter` as **horizontal bar plot** with the **x-axis** labelled *country*. Now, **plot** `country_citations_inter` as **horizontal bar plot** with the **x-axis** labelled *country*.
Then, you **must** use the `horizontal_bar_plot` function to plot this data. Verify that the countries are **ordered** in **decreasing** order of the **difference** between `avg_citations` and `avg_inter_faculty`. Verify that the **legend** appears on your plot. Then, you **must** use the `horizontal_bar_plot` function to plot this data. Verify that the countries are **ordered** in **decreasing** order of the **difference** between `avg_citations` and `avg_inter_faculty`. Verify that the **legend** appears on your plot.
**Hint:** If you want the countries in the plot to be ordered in **decreasing** order of the difference, you will need to make sure that in the DataFrame, they are ordered in the **increasing** order. **Hint:** If you want the countries in the plot to be ordered in **decreasing** order of the difference, you will need to make sure that in the DataFrame, they are ordered in the **increasing** order.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**. **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
Your plot should look like this: Your plot should look like this:
%% Cell type:markdown id:fb3e7670 tags: %% Cell type:markdown id:fb3e7670 tags:
<div><img src="attachment:q16.png" width="400"/></div> <div><img src="attachment:q16.png" width="400"/></div>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center> <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:259af611 tags: %% Cell type:code id:259af611 tags:
``` python ``` python
# create the horizontal bar plot using the DataFrame 'country_citations_inter' with the x-axis labelled "country" # create the horizontal bar plot using the DataFrame 'country_citations_inter' with the x-axis labelled "country"
``` ```
%% Cell type:markdown id:1a5d4543 tags: %% Cell type:markdown id:1a5d4543 tags:
**Question 17:** Create a **scatter plot** representing the `overall_score` (on the **x-axis**) against the `rank` (on the **y-axis**) for **all** institutions in the `year` *2020*. Additionally, **plot** a **regression line** within the same plot. **Question 17:** Create a **scatter plot** representing the `overall_score` (on the **x-axis**) against the `rank` (on the **y-axis**) for **all** institutions in the `year` *2020*. Additionally, **plot** a **regression line** within the same plot.
You **must** first compute a **DataFrame** containing the **overall_score**, and the **rank** data from the `year` *2020*. You **must** use the `get_regression_line` function to compute the best fit line. You **must** first compute a **DataFrame** containing the **overall_score**, and the **rank** data from the `year` *2020*. You **must** use the `get_regression_line` function to compute the best fit line.
%% Cell type:code id:d51299b8 tags: %% Cell type:code id:d51299b8 tags:
``` python ``` python
# first compute and store the DataFrame 'overall_rank', then display its head # first compute and store the DataFrame 'overall_rank', then display its head
# do NOT plot just yet # do NOT plot just yet
``` ```
%% Cell type:code id:a422be6a tags: %% Cell type:code id:a422be6a tags:
``` python ``` python
grader.check("q17") grader.check("q17")
``` ```
%% Cell type:markdown id:4c062dae tags: %% Cell type:markdown id:4c062dae tags:
Now, **plot** `overall_rank` as **scatter plot** with a **regression line** with the **x-axis** labelled *overall_score* and the **y-axis** labelled *rank*. Now, **plot** `overall_rank` as **scatter plot** with a **regression line** with the **x-axis** labelled *overall_score* and the **y-axis** labelled *rank*.
You **must** use the `regression_line_plot` function to plot this data. You **must** use the `regression_line_plot` function to plot this data.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**. **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
Your plot should look like this: Your plot should look like this:
%% Cell type:markdown id:aee08178 tags: %% Cell type:markdown id:aee08178 tags:
<div><img src="attachment:q17.png" width="400"/></div> <div><img src="attachment:q17.png" width="400"/></div>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center> <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:6c914693 tags: %% Cell type:code id:6c914693 tags:
``` python ``` python
# create the scatter plot and the regression line using the DataFrame 'overall_rank' with the x-axis labelled "overall_score" # create the scatter plot and the regression line using the DataFrame 'overall_rank' with the x-axis labelled "overall_score"
# and the y-axis labelled "rank" # and the y-axis labelled "rank"
``` ```
%% Cell type:markdown id:effa2591 tags: %% Cell type:markdown id:effa2591 tags:
**Food for thought:** Does our linear regression model fit the points well? It looks like the relationship between the `overall_score` and `rank` is **not quite linear**. In fact, a cursory look at the data suggests that the relationship is in fact, inverse. **Food for thought:** Does our linear regression model fit the points well? It looks like the relationship between the `overall_score` and `rank` is **not quite linear**. In fact, a cursory look at the data suggests that the relationship is in fact, inverse.
%% Cell type:code id:9f1de243 tags: %% Cell type:code id:9f1de243 tags:
``` python ``` python
# Food for thought is an entirely OPTIONAL exercise # Food for thought is an entirely OPTIONAL exercise
# you may leave your thoughts here as a comment if you wish to # you may leave your thoughts here as a comment if you wish to
``` ```
%% Cell type:markdown id:26e4e3c1 tags: %% Cell type:markdown id:26e4e3c1 tags:
**Question 18:** Create a **scatter plot** representing the **inverse** of the `overall_score` (on the **x-axis**) against the `rank` (on the **y-axis**) for **all** institutions in the `year` *2020*. Additionally, **plot** a **regression line** within the same plot. **Question 18:** Create a **scatter plot** representing the **inverse** of the `overall_score` (on the **x-axis**) against the `rank` (on the **y-axis**) for **all** institutions in the `year` *2020*. Additionally, **plot** a **regression line** within the same plot.
The `inverse_overall_score` for each institution is simply defined as `1/overall_score` for that institution. You **must** first compute a **DataFrame** containing the **inverse_overall_score**, and the **rank** data from the `year` *2020*. You **must** use the `get_regression_line` function to compute the best fit line. The `inverse_overall_score` for each institution is simply defined as `1/overall_score` for that institution. You **must** first compute a **DataFrame** containing the **inverse_overall_score**, and the **rank** data from the `year` *2020*. You **must** use the `get_regression_line` function to compute the best fit line.
%% Cell type:code id:6c535d83 tags: %% Cell type:code id:6c535d83 tags:
``` python ``` python
# first compute and store the DataFrame 'inverse_overall_rank', then display its head # first compute and store the DataFrame 'inverse_overall_rank', then display its head
# do NOT plot just yet # do NOT plot just yet
``` ```
%% Cell type:code id:22a6a736 tags: %% Cell type:code id:22a6a736 tags:
``` python ``` python
grader.check("q18") grader.check("q18")
``` ```
%% Cell type:markdown id:e64a0040 tags: %% Cell type:markdown id:e64a0040 tags:
Now, **plot** `inverse_overall_rank` as **scatter plot** with a **regression line** with the **x-axis** labelled *inverse_overall_score* and the **y-axis** labelled *rank*. Now, **plot** `inverse_overall_rank` as **scatter plot** with a **regression line** with the **x-axis** labelled *inverse_overall_score* and the **y-axis** labelled *rank*.
You **must** use the `regression_line_plot` function to plot this data. You **must** use the `regression_line_plot` function to plot this data.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**. **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
Your plot should look like this: Your plot should look like this:
%% Cell type:markdown id:baeb0d40 tags: %% Cell type:markdown id:baeb0d40 tags:
<div><img src="attachment:q18.png" width="400"/></div> <div><img src="attachment:q18.png" width="400"/></div>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center> <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:dd8efd5b tags: %% Cell type:code id:dd8efd5b tags:
``` python ``` python
# create the scatter plot and the regression line using the DataFrame 'inverse_overall_rank' # create the scatter plot and the regression line using the DataFrame 'inverse_overall_rank'
# with the x-axis labelled "inverse_overall_score" and the y-axis labelled "rank" # with the x-axis labelled "inverse_overall_score" and the y-axis labelled "rank"
``` ```
%% Cell type:markdown id:9f9f2089 tags: %% Cell type:markdown id:9f9f2089 tags:
This seems to be much better! Let us now use this **regression line** to **estimate** the `rank` of an institution given its `overall_score`. This seems to be much better! Let us now use this **regression line** to **estimate** the `rank` of an institution given its `overall_score`.
%% Cell type:markdown id:0849a83f tags: %% Cell type:markdown id:0849a83f tags:
**Question 19:** Use the regression line to **estimate** the `rank` of an institution with an `overall_score` of *72*. **Question 19:** Use the regression line to **estimate** the `rank` of an institution with an `overall_score` of *72*.
Your output **must** be an **int**. If your **estimate** is a **float**, *round it up* using `math.ceil`. Your output **must** be an **int**. If your **estimate** is a **float**, *round it up* using `math.ceil`.
**Hints:** **Hints:**
1. Call the `get_regression_coeff` function to get the coefficients `m` and `b`. 1. Call the `get_regression_coeff` function to get the coefficients `m` and `b`.
2. Recall that the equation of a line is `y = m * x + b`. What are `x` and `y` here? 2. Recall that the equation of a line is `y = m * x + b`. What are `x` and `y` here?
%% Cell type:code id:7f3fa177 tags: %% Cell type:code id:7f3fa177 tags:
``` python ``` python
# compute and store the answer in the variable 'rank_score_72', then display it # compute and store the answer in the variable 'rank_score_72', then display it
``` ```
%% Cell type:code id:c1559986 tags: %% Cell type:code id:c1559986 tags:
``` python ``` python
grader.check("q19") grader.check("q19")
``` ```
%% Cell type:markdown id:547f4135 tags: %% Cell type:markdown id:547f4135 tags:
**Food for thought:** Can you find out the `overall_score` of the university with this `rank` in the `year` *2020*? Does it match your prediction? **Food for thought:** Can you find out the `overall_score` of the university with this `rank` in the `year` *2020*? Does it match your prediction?
%% Cell type:code id:60915e12 tags: %% Cell type:code id:60915e12 tags:
``` python ``` python
# Food for thought is an entirely OPTIONAL exercise # Food for thought is an entirely OPTIONAL exercise
# you may leave your thoughts here as a comment if you wish to # you may leave your thoughts here as a comment if you wish to
``` ```
%% Cell type:markdown id:53ab4005 tags: %% Cell type:markdown id:53ab4005 tags:
**Question 20:** Using the data from Question 5, create a **pie plot** representing the number of institutions from each country. **Question 20:** Using the data from Question 5, create a **pie plot** representing the number of institutions from each country.
You **have** already computed a **DataFrame** `num_institutions` (in Question 5) containing the **country**, and the **num_of_institutions** data. Run the following cell just to confirm that the variable has not changed its values since you defined it in Question 5. You **have** already computed a **DataFrame** `num_institutions` (in Question 5) containing the **country**, and the **num_of_institutions** data. Run the following cell just to confirm that the variable has not changed its values since you defined it in Question 5.
%% Cell type:code id:2a86a546 tags: %% Cell type:code id:2a86a546 tags:
``` python ``` python
grader.check("q20") grader.check("q20")
``` ```
%% Cell type:markdown id:d95601d7 tags: %% Cell type:markdown id:d95601d7 tags:
Now, **plot** `num_institutions` as **pie plot** with the **title** *Number of institutions*. Now, **plot** `num_institutions` as **pie plot** with the **title** *Number of institutions*.
Now, you **must** use the `pie_plot` function to create the **pie plot**. The **colors** do **not** matter, but the plot **must** be titled `Number of institutions`, and **must** be labelled as in the sample output below. Now, you **must** use the `pie_plot` function to create the **pie plot**. The **colors** do **not** matter, but the plot **must** be titled `Number of institutions`, and **must** be labelled as in the sample output below.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**. **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
Your plot should look like this: Your plot should look like this:
%% Cell type:markdown id:76ce5db5 tags: %% Cell type:markdown id:76ce5db5 tags:
<div><img src="attachment:q20.png" width="400"/></div> <div><img src="attachment:q20.png" width="400"/></div>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center> <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:0fdcbe48 tags: %% Cell type:code id:0fdcbe48 tags:
``` python ``` python
# create the pie plot using the DataFrame 'num_institutions' titled "Number of institutions" # create the pie plot using the DataFrame 'num_institutions' titled "Number of institutions"
``` ```
%% Cell type:markdown id:6bce0354 tags: %% Cell type:markdown id:6bce0354 tags:
**Food for thought:** It seems that we'll run out of colors! How can we make it so that **no two neighbors share a color**? You'll probably have to look online. **Food for thought:** It seems that we'll run out of colors! How can we make it so that **no two neighbors share a color**? You'll probably have to look online.
%% Cell type:code id:7bd4d538 tags: %% Cell type:code id:7bd4d538 tags:
``` python ``` python
# Food for thought is an entirely OPTIONAL exercise # Food for thought is an entirely OPTIONAL exercise
# you may leave your thoughts here as a comment if you wish to # you may leave your thoughts here as a comment if you wish to
``` ```
%% Cell type:markdown id:936abcda tags: %% Cell type:markdown id:936abcda tags:
### Closing the database connection: ### Closing the database connection:
Now, before you **submit** your notebook, you **must** **close** your connection `conn`. Not doing this might make **Gradescope fail**. Additionally, **delete** the example images provided with plot questions to save space, if your notebook file is too large for submission. You can **delete** any cell by selecting the cell, hitting the `Esc` key once, and then hitting the `d` key **twice**. Now, before you **submit** your notebook, you **must** **close** your connection `conn`. Not doing this might make **Gradescope fail**. Additionally, **delete** the example images provided with plot questions to save space, if your notebook file is too large for submission. You can **delete** any cell by selecting the cell, hitting the `Esc` key once, and then hitting the `d` key **twice**.
%% Cell type:code id:9515f232 tags: %% Cell type:code id:9515f232 tags:
``` python ``` python
# close your connection here # close your connection here
``` ```
%% Cell type:markdown id:27a5f70c tags: %% Cell type:markdown id:27a5f70c tags:
## Submission ## Submission
Make sure you have run all cells in your notebook in order before running the following cells, so that all images/graphs appear in the output. The following cells will generate a zip file for you to submit. Make sure you have run all cells in your notebook in order before running the following cells, so that all images/graphs appear in the output. The following cells will generate a zip file for you to submit.
**SUBMISSION INSTRUCTIONS**: **SUBMISSION INSTRUCTIONS**:
1. **Upload** the zipfile to Gradescope. 1. **Upload** the zipfile to Gradescope.
2. Check **Gradescope otter** results as soon as the auto-grader execution gets completed. Don't worry about the score showing up as -/100.0. You only need to check that the test cases passed. 2. Check **Gradescope otter** results as soon as the auto-grader execution gets completed. Don't worry about the score showing up as -/100.0. You only need to check that the test cases passed.
%% Cell type:code id:9419c771 tags: %% Cell type:code id:9419c771 tags:
``` python ``` python
from IPython.display import display, Javascript from IPython.display import display, Javascript
display(Javascript('IPython.notebook.save_checkpoint();')) display(Javascript('IPython.notebook.save_checkpoint();'))
``` ```
%% Cell type:code id:b54d6127 tags: %% Cell type:code id:b54d6127 tags:
``` python ``` python
!jupytext --to py p13.ipynb !jupytext --to py p13.ipynb
``` ```
%% Cell type:code id:11da7246 tags: %% Cell type:code id:11da7246 tags:
``` python ``` python
p13_test.check_file_size("p13.ipynb") p13_test.check_file_size("p13.ipynb")
grader.export(pdf=False, run_tests=True, files=[py_filename]) grader.export(pdf=False, run_tests=True, files=[py_filename])
``` ```
%% Cell type:markdown id:a44ca87a tags: %% Cell type:markdown id:a44ca87a tags:
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment