"### Task 1.3: Create a database called 'rankings.db' out of 'QSRankings.json'\n",
"### Task 1.3: Create a database called 'rankings.db' out of 'QSRankings.json'\n",
"\n",
"\n",
"You can review the relevant lecture code [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/s23/Michael_lecture_notes/32_Database-1) and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/s23/Gurmail_lecture_notes/32_Database-1)."
"You can review the relevant lecture code [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database1_notes.ipynb)."
]
]
},
},
{
{
...
@@ -454,7 +453,7 @@
...
@@ -454,7 +453,7 @@
"\n",
"\n",
"Before starting this segment, it is recommended that you go through the relevant lecture code:\n",
"Before starting this segment, it is recommended that you go through the relevant lecture code:\n",
"\n",
"\n",
"* [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/22_Plotting1/lec_22_plotting1_bar_plots_notes.ipynb) (Bar and scatter plots) and [here]() (Line plots - this is what we will talk about in the Wednesday 8/9 lecture)"
"* [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/22_Plotting1/lec_22_plotting1_bar_plots_notes.ipynb) (Bar and scatter plots) and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/sum23/lecture_materials/23_Plotting2) (Line plots - this is what we will talk about in the Wednesday 8/9 lecture)"
]
]
},
},
{
{
...
...
%% Cell type:markdown id:115889c5 tags:
%% Cell type:markdown id:115889c5 tags:
# Lab 13: Analyzing World Data with SQL
# Lab 13: Analyzing World Data with SQL
In this lab, you will practice how to:
In this lab, you will practice how to:
* write SQL queries,
* write SQL queries,
* create your own plots.
* create your own plots.
%% Cell type:markdown id:daed65a3 tags:
%% Cell type:markdown id:daed65a3 tags:
# Segment 1: Setup
# Segment 1: Setup
### Task 1.1: Import the required modules
### Task 1.1: Import the required modules
We will first import some important modules
We will first import some important modules
%% Cell type:code id:e59b7bdb tags:
%% Cell type:code id:e59b7bdb tags:
``` python
``` python
# it is considered a good coding practice to place all import statements at the top of the notebook
# it is considered a good coding practice to place all import statements at the top of the notebook
# please place all your import statements in this cell if you need to import any more modules for this project
# please place all your import statements in this cell if you need to import any more modules for this project
importsqlite3
importsqlite3
importpandasaspd
importpandasaspd
importmatplotlib
importmatplotlib
importmath
importmath
importnumpyasnp# this is *only* for the function get_regression_coeff - do NOT use this module elsewhere
importnumpyasnp# this is *only* for the function get_regression_coeff - do NOT use this module elsewhere
```
```
%% Cell type:code id:97a3f1e8 tags:
%% Cell type:code id:97a3f1e8 tags:
``` python
``` python
# this ensures that font.size setting remains uniform
# this ensures that font.size setting remains uniform
%matplotlibinline
%matplotlibinline
pd.set_option('display.max_colwidth',None)
pd.set_option('display.max_colwidth',None)
matplotlib.rcParams["font.size"]=13# don't use value > 13! Otherwise your y-axis tick labels will be different.
matplotlib.rcParams["font.size"]=13# don't use value > 13! Otherwise your y-axis tick labels will be different.
```
```
%% Cell type:markdown id:75adca21 tags:
%% Cell type:markdown id:75adca21 tags:
### Task 1.2: Use the `download` function to download `QSranking.json`
### Task 1.2: Use the `download` function to download `QSranking.json`
Warning: For the lab and the project, do **not** download the dataset `QSranking.json` manually (you **must** write Python code to download this, as in P12). When we run the autograder, this file `QSranking.json` will not be in the directory. So, unless your `p13.ipynb` downloads this file, you will get a **zero score** on the project. Also, make sure your `download` function includes code to check if the file already exists. The Gradescope autograder will **deduct points** otherwise.
Warning: For the lab and the project, do **not** download the dataset `QSranking.json` manually (you **must** write Python code to download this, as in P12). When we run the autograder, this file `QSranking.json` will not be in the directory. So, unless your `p13.ipynb` downloads this file, you will get a **zero score** on the project. Also, make sure your `download` function includes code to check if the file already exists. The Gradescope autograder will **deduct points** otherwise.
%% Cell type:code id:2bb742ed tags:
%% Cell type:code id:2bb742ed tags:
``` python
``` python
# copy the definition of your 'download' function from P12 here - remember to import the necessary modules
# copy the definition of your 'download' function from P12 here - remember to import the necessary modules
```
```
%% Cell type:code id:fe96e53b tags:
%% Cell type:code id:fe96e53b tags:
``` python
``` python
# use the 'download' function to download the data from the webpage
# use the 'download' function to download the data from the webpage
### Task 1.3: Create a database called 'rankings.db' out of 'QSRankings.json'
### Task 1.3: Create a database called 'rankings.db' out of 'QSRankings.json'
You can review the relevant lecture code [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/s23/Michael_lecture_notes/32_Database-1) and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/s23/Gurmail_lecture_notes/32_Database-1).
You can review the relevant lecture code [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database1_notes.ipynb).
%% Cell type:code id:270d8da5 tags:
%% Cell type:code id:270d8da5 tags:
``` python
``` python
# create a database called 'rankings.db' out of 'QSranking.json'
# create a database called 'rankings.db' out of 'QSranking.json'
# TODO: load the data from 'QSranking.json' into a variable called 'qs_ranking' using pandas' 'read_json' function
# TODO: load the data from 'QSranking.json' into a variable called 'qs_ranking' using pandas' 'read_json' function
# TODO: connect to 'rankings.db' and save it to a variable called 'conn'
# TODO: connect to 'rankings.db' and save it to a variable called 'conn'
# write the contents of 'qs_ranking' to the table 'rankings' in the database
# write the contents of 'qs_ranking' to the table 'rankings' in the database
In practice, we often are more interested in writing more specific queries about our data. For example, we might be interested in finding institutions in the *United States*, or data collected in the `year`*2018*, or both. With **SQL**, **WHERE** and **AND** clauses can help filter the data accordingly.
In practice, we often are more interested in writing more specific queries about our data. For example, we might be interested in finding institutions in the *United States*, or data collected in the `year`*2018*, or both. With **SQL**, **WHERE** and **AND** clauses can help filter the data accordingly.
Before proceeding with this segment, it is **recommended** that you **review** the relevant lecture code:
Before proceeding with this segment, it is **recommended** that you **review** the relevant lecture code:
*[here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database2_notes.ipynb)(Databases part 2)
*[here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database2_notes.ipynb)(Databases part 2)
and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database3_notes.ipynb)(Databases part 3)
and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database3_notes.ipynb)(Databases part 3)
%% Cell type:markdown id:9cebe083 tags:
%% Cell type:markdown id:9cebe083 tags:
### Task 2.1: Use WHERE to find institutions in the United States
### Task 2.1: Use WHERE to find institutions in the United States
* Write a query to select the rows from the database with *United States* as the `country`.
* Write a query to select the rows from the database with *United States* as the `country`.
* Keep only the `institution_name` column.
* Keep only the `institution_name` column.
* Save these institution names to a **list**.
* Save these institution names to a **list**.
**Hint:** You will need to use **quotes** (`'`) around the **strings** in your query and **backticks** (``` ` ```) around **column names** as in the example below. The **quotes** and **backticks*** are only **required** when the string or column name contains special characters or spaces. But even otherwise, it is a good idea to use them to be on the safe side.
**Hint:** You will need to use **quotes** (`'`) around the **strings** in your query and **backticks** (``` ` ```) around **column names** as in the example below. The **quotes** and **backticks*** are only **required** when the string or column name contains special characters or spaces. But even otherwise, it is a good idea to use them to be on the safe side.
%% Cell type:code id:64012949 tags:
%% Cell type:code id:64012949 tags:
``` python
``` python
# we have done this one for you
# we have done this one for you
us_institutions_df=pd.read_sql("SELECT `institution_name` FROM rankings WHERE `country` = 'United States'",conn)
us_institutions_df=pd.read_sql("SELECT `institution_name` FROM rankings WHERE `country` = 'United States'",conn)
assert"University of Connecticut"notingood_us_institutions
assert"University of Connecticut"notingood_us_institutions
```
```
%% Cell type:markdown id:cf715227 tags:
%% Cell type:markdown id:cf715227 tags:
### Task 2.3: Use an ORDER BY clause to display the top 5 institutions by academic reputation in 2019
### Task 2.3: Use an ORDER BY clause to display the top 5 institutions by academic reputation in 2019
In addition to **WHERE** and **AND**, the **ORDER BY** keyword helps organize data even further. Much like the `sort_values()` function in `pandas`, the **ORDER BY** clause can be used to organize the result of the query in *increasing* (**ASC**) or *decreasing* (**DESC**) order based on a column's values.
In addition to **WHERE** and **AND**, the **ORDER BY** keyword helps organize data even further. Much like the `sort_values()` function in `pandas`, the **ORDER BY** clause can be used to organize the result of the query in *increasing* (**ASC**) or *decreasing* (**DESC**) order based on a column's values.
* Write a new query to select rows in rankings where the `year` is *2019*.
* Write a new query to select rows in rankings where the `year` is *2019*.
* Use **ORDER BY** and **LIMIT** to select the top 5 rows with the **highest**`academic_reputation`.
* Use **ORDER BY** and **LIMIT** to select the top 5 rows with the **highest**`academic_reputation`.
* Save these institution names to a **list**.
* Save these institution names to a **list**.
%% Cell type:code id:763304e0 tags:
%% Cell type:code id:763304e0 tags:
``` python
``` python
# compute and store the answer in the variable 'top_5_institutions', then display it
# compute and store the answer in the variable 'top_5_institutions', then display it
```
```
%% Cell type:code id:404fa832 tags:
%% Cell type:code id:404fa832 tags:
``` python
``` python
# run this cell to confirm that your variable has been defined properly
# run this cell to confirm that your variable has been defined properly
assertlen(top_5_institutions)==5
assertlen(top_5_institutions)==5
asserttop_5_institutions[0]=="Massachusetts Institute Of Technology"
asserttop_5_institutions[0]=="Massachusetts Institute Of Technology"
asserttop_5_institutions[-1]=="University Of Cambridge"
asserttop_5_institutions[-1]=="University Of Cambridge"
```
```
%% Cell type:markdown id:13e1803b tags:
%% Cell type:markdown id:13e1803b tags:
### Task 2.4: Order by multiple columns
### Task 2.4: Order by multiple columns
If you print out the resulting dataframe from your query, you might notice that all 5 rows have the same academic reputation. This makes it hard to compare the universities, so we will add some **tiebreaking** rules. If two universities have the same `academic_reputation`, then we should order them by their `citations_per_faculty` (with the **highest** appearing first). You can do this by ordering by multiple columns.
If you print out the resulting dataframe from your query, you might notice that all 5 rows have the same academic reputation. This makes it hard to compare the universities, so we will add some **tiebreaking** rules. If two universities have the same `academic_reputation`, then we should order them by their `citations_per_faculty` (with the **highest** appearing first). You can do this by ordering by multiple columns.
* Copy your query from Task 2.3.
* Copy your query from Task 2.3.
* Update the **ORDER BY** clause to add this tiebreaking behavior.
* Update the **ORDER BY** clause to add this tiebreaking behavior.
* Save these institution names to a **list**.
* Save these institution names to a **list**.
%% Cell type:code id:26f5a433 tags:
%% Cell type:code id:26f5a433 tags:
``` python
``` python
# compute and store the answer in the variable 'top_5_with_tiebreak', then display it
# compute and store the answer in the variable 'top_5_with_tiebreak', then display it
```
```
%% Cell type:code id:c5b2382b tags:
%% Cell type:code id:c5b2382b tags:
``` python
``` python
# run this cell to confirm that your variable has been defined properly
# run this cell to confirm that your variable has been defined properly
asserttop_5_with_tiebreak[0]=="University Of California, Berkeley"
asserttop_5_with_tiebreak[0]=="University Of California, Berkeley"
asserttop_5_with_tiebreak[-1]=="University Of California, Los Angeles"
asserttop_5_with_tiebreak[-1]=="University Of California, Los Angeles"
```
```
%% Cell type:markdown id:9b991dcf tags:
%% Cell type:markdown id:9b991dcf tags:
### Task 2.5: Use GROUP BY clause and SUM aggregate function to get the total number of international_students for each country in 2019
### Task 2.5: Use GROUP BY clause and SUM aggregate function to get the total number of international_students for each country in 2019
The **GROUP BY** keyword groups rows that have the same value. It is often used with aggregate functions, such as **COUNT**, **SUM**, **AVG**, etc. to obtain a summary about groups in the data.
The **GROUP BY** keyword groups rows that have the same value. It is often used with aggregate functions, such as **COUNT**, **SUM**, **AVG**, etc. to obtain a summary about groups in the data.
For example, to answer the question "What is the average rank of each country's institutions?", we could **GROUP BY** the `country` and use the **AVG** aggregate function to get the average rank of each country.
For example, to answer the question "What is the average rank of each country's institutions?", we could **GROUP BY** the `country` and use the **AVG** aggregate function to get the average rank of each country.
* Write a new query that uses **GROUP BY** and **SUM** to get the total number of international students in each country, using **WHERE** to filter by the `year`.
* Write a new query that uses **GROUP BY** and **SUM** to get the total number of international students in each country, using **WHERE** to filter by the `year`.
* Save the resulting **DataFrame** with **two** columns: `country` and the **sum** of the `international_students` for that country.
* Save the resulting **DataFrame** with **two** columns: `country` and the **sum** of the `international_students` for that country.
%% Cell type:code id:f31786c4 tags:
%% Cell type:code id:f31786c4 tags:
``` python
``` python
# compute and store the answer in the variable 'inter_students_by_country', then display its head
# compute and store the answer in the variable 'inter_students_by_country', then display its head
```
```
%% Cell type:code id:9c84f12c tags:
%% Cell type:code id:9c84f12c tags:
``` python
``` python
# run this cell to confirm that your variable has been defined properly
# run this cell to confirm that your variable has been defined properly
### Task 2.6: Use the AS keyword to rename the new column from Task 2.5 to total_international_students
### Task 2.6: Use the AS keyword to rename the new column from Task 2.5 to total_international_students
Although the dataframe does have a column for the sum of international students for each country, the name of the column looks strange:
Although the dataframe does have a column for the sum of international students for each country, the name of the column looks strange:
```sql
```sql
SUM(`international_students`)
SUM(`international_students`)
```
```
In SQL, the **AS** keyword allows us to create an simpler alias for the columns we create with our queries to make the resulting **DataFrame** easier to understand.
In SQL, the **AS** keyword allows us to create an simpler alias for the columns we create with our queries to make the resulting **DataFrame** easier to understand.
* Paste your query from Task 2.5 and modify it so the **SUM** column has the name `total_international_students`.
* Paste your query from Task 2.5 and modify it so the **SUM** column has the name `total_international_students`.
* Save the resulting **DataFrame** with **two** columns: `country` and `total_international_students`.
* Save the resulting **DataFrame** with **two** columns: `country` and `total_international_students`.
%% Cell type:code id:3947be0d tags:
%% Cell type:code id:3947be0d tags:
``` python
``` python
# compute and store the answer in the variable 'inter_students_by_country_renamed', then display its head
# compute and store the answer in the variable 'inter_students_by_country_renamed', then display its head
```
```
%% Cell type:code id:9e114959 tags:
%% Cell type:code id:9e114959 tags:
``` python
``` python
# run this cell to confirm that your variable has been defined properly
# run this cell to confirm that your variable has been defined properly
### Task 2.7: Use the HAVING keyword to only keep countries with more than 1000 international students
### Task 2.7: Use the HAVING keyword to only keep countries with more than 1000 international students
In addition to **WHERE**, the **HAVING** keyword is useful for filtering **GROUP BY** queries. Whereas **WHERE** filters the number of rows, **HAVING** filters the number of groups.
In addition to **WHERE**, the **HAVING** keyword is useful for filtering **GROUP BY** queries. Whereas **WHERE** filters the number of rows, **HAVING** filters the number of groups.
* Paste your query from Task 2.6 and modify it so that it only returns countries (`country`) and `total_international_students` with **more than***1000* international students.
* Paste your query from Task 2.6 and modify it so that it only returns countries (`country`) and `total_international_students` with **more than***1000* international students.
* Save the resulting **DataFrame** with **two** columns: `country` and `total_international_students`.
* Save the resulting **DataFrame** with **two** columns: `country` and `total_international_students`.
%% Cell type:code id:8bc00cf4 tags:
%% Cell type:code id:8bc00cf4 tags:
``` python
``` python
# compute and store the answer in the variable 'inter_students_by_country_more_than_1000', then display it
# compute and store the answer in the variable 'inter_students_by_country_more_than_1000', then display it
```
```
%% Cell type:code id:a1c5be56 tags:
%% Cell type:code id:a1c5be56 tags:
``` python
``` python
# run this cell to confirm that your variable has been defined properly
# run this cell to confirm that your variable has been defined properly
SQL provides powerful tools to manipulate and organize data. Now we might be interested in plotting the data to engage in data exploration and visualize our results.
SQL provides powerful tools to manipulate and organize data. Now we might be interested in plotting the data to engage in data exploration and visualize our results.
Before starting this segment, it is recommended that you go through the relevant lecture code:
Before starting this segment, it is recommended that you go through the relevant lecture code:
*[here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/22_Plotting1/lec_22_plotting1_bar_plots_notes.ipynb)(Bar and scatter plots) and [here]()(Line plots - this is what we will talk about in the Wednesday 8/9 lecture)
*[here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/22_Plotting1/lec_22_plotting1_bar_plots_notes.ipynb)(Bar and scatter plots) and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/sum23/lecture_materials/23_Plotting2)(Line plots - this is what we will talk about in the Wednesday 8/9 lecture)
%% Cell type:markdown id:d27b7c2c tags:
%% Cell type:markdown id:d27b7c2c tags:
### Task 3.1: Use a bar plot to plot the data from Task 2.7
### Task 3.1: Use a bar plot to plot the data from Task 2.7
Use the `horizontal_bar_plot` function to create the required plot.
Use the `horizontal_bar_plot` function to create the required plot.
* Write a SQL query to select `year`, **average**`employer_reputation`, and **average**`faculty_student_score` grouped by `year`.
* Write a SQL query to select `year`, **average**`employer_reputation`, and **average**`faculty_student_score` grouped by `year`.
* Save the resulting **DataFrame** with **three** columns: `year`, the **average** of the `employer_reputation` and the **average** of the `faculty_student_score`.
* Save the resulting **DataFrame** with **three** columns: `year`, the **average** of the `employer_reputation` and the **average** of the `faculty_student_score`.
* Call `horizontal_bar_plot`, passing in `year` as the `x` argument.
* Call `horizontal_bar_plot`, passing in `year` as the `x` argument.
%% Cell type:code id:bc779e0b tags:
%% Cell type:code id:bc779e0b tags:
``` python
``` python
# first compute and store the DataFrame
# first compute and store the DataFrame
# then create the horizontal bar plot using the DataFrame
# then create the horizontal bar plot using the DataFrame
# verify that this plot matches exactly with the image shown above
# verify that this plot matches exactly with the image shown above
```
```
%% Cell type:markdown id:aaeeebe7 tags:
%% Cell type:markdown id:aaeeebe7 tags:
### Task 3.4 Display a Pie Chart of the average overall score of the top 10 countries in descending order
### Task 3.4 Display a Pie Chart of the average overall score of the top 10 countries in descending order
# nb_name should be the name of your notebook without the .ipynb extension
# nb_name should be the name of your notebook without the .ipynb extension
nb_name="p13"
nb_name="p13"
py_filename=nb_name+".py"
py_filename=nb_name+".py"
grader=otter.Notebook(nb_name+".ipynb")
grader=otter.Notebook(nb_name+".ipynb")
```
```
%% Cell type:code id:0611fe14 tags:
%% Cell type:code id:0611fe14 tags:
``` python
``` python
importp13_test
importp13_test
```
```
%% Cell type:code id:2bcd01a8 tags:
%% Cell type:code id:2bcd01a8 tags:
``` python
``` python
# PLEASE FILL IN THE DETAILS
# PLEASE FILL IN THE DETAILS
# enter none if you don't have a project partner
# enter none if you don't have a project partner
# you will have to add your partner as a group member on Gradescope even after you fill this
# you will have to add your partner as a group member on Gradescope even after you fill this
# project: p13
# project: p13
# submitter: NETID1
# submitter: NETID1
# partner: NETID2
# partner: NETID2
```
```
%% Cell type:markdown id:372ed345 tags:
%% Cell type:markdown id:372ed345 tags:
# Project 13: World University Rankings
# Project 13: World University Rankings
%% Cell type:markdown id:b30c2df0 tags:
%% Cell type:markdown id:b30c2df0 tags:
## Learning Objectives:
## Learning Objectives:
In this project, you will demonstrate how to:
In this project, you will demonstrate how to:
* query a database using SQL,
* query a database using SQL,
* process data using `pandas`**DataFrames**,
* process data using `pandas`**DataFrames**,
* create different types of plots.
* create different types of plots.
Please go through [Lab 13](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/labs/lab13) before working on this project. The lab introduces some useful techniques related to this project.
Please go through [Lab 13](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/labs/lab13) before working on this project. The lab introduces some useful techniques related to this project.
%% Cell type:markdown id:479785c7 tags:
%% Cell type:markdown id:479785c7 tags:
## Note on Academic Misconduct:
## Note on Academic Misconduct:
**IMPORTANT**: P12 and P13 are two parts of the same data analysis. You **cannot** switch project partners between these two projects. That is if you partnered up with someone for P12, you have to sustain that partnership until end of P13. Now may be a good time to review [our course policies](https://canvas.wisc.edu/courses/355767/pages/syllabus?module_item_id=6048035).
**IMPORTANT**: P12 and P13 are two parts of the same data analysis. You **cannot** switch project partners between these two projects. That is if you partnered up with someone for P12, you have to sustain that partnership until end of P13. Now may be a good time to review [our course policies](https://canvas.wisc.edu/courses/355767/pages/syllabus?module_item_id=6048035).
%% Cell type:markdown id:3e0e04f5 tags:
%% Cell type:markdown id:3e0e04f5 tags:
## Testing your code:
## Testing your code:
Along with this notebook, you must have downloaded the file `p13_test.py`. If you are curious about how we test your code, you can explore this file, and specifically the value of the variable `expected_json`, to understand the expected answers to the questions.
Along with this notebook, you must have downloaded the file `p13_test.py`. If you are curious about how we test your code, you can explore this file, and specifically the value of the variable `expected_json`, to understand the expected answers to the questions.
For answers involving DataFrames, `p13_test.py` compares your tables to those in `p13_expected.html`, so take a moment to open that file on a web browser (from Finder/Explorer).
For answers involving DataFrames, `p13_test.py` compares your tables to those in `p13_expected.html`, so take a moment to open that file on a web browser (from Finder/Explorer).
For answers involving plots, `p13_test.py` can **only** check that the **DataFrames** are correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. Your plots will be **manually graded**, and you will **lose points** if your plot is not visible, or if it is not properly labelled.
For answers involving plots, `p13_test.py` can **only** check that the **DataFrames** are correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. Your plots will be **manually graded**, and you will **lose points** if your plot is not visible, or if it is not properly labelled.
**IMPORTANT Warning:** Do **not** download the dataset `QSranking.json`**manually**. Use the `download` function from P12 to download it. When we run the autograder, this file `QSranking.json` will **not** be in the directory. So, unless your `p13.ipynb` downloads this file, you will get a **zero score** on the project. Also, make sure your `download` function includes code to check if the file already exists. Otherwise, you will **lose** points for **hardcoding**.
**IMPORTANT Warning:** Do **not** download the dataset `QSranking.json`**manually**. Use the `download` function from P12 to download it. When we run the autograder, this file `QSranking.json` will **not** be in the directory. So, unless your `p13.ipynb` downloads this file, you will get a **zero score** on the project. Also, make sure your `download` function includes code to check if the file already exists. Otherwise, you will **lose** points for **hardcoding**.
%% Cell type:markdown id:aad1951a tags:
%% Cell type:markdown id:aad1951a tags:
## Project Description:
## Project Description:
For your final CS220 project, you're going to continue analyzing world university rankings. However, we will be using a different dataset this time. The data for this project has been extracted from [here](https://www.topuniversities.com/university-rankings/world-university-rankings/2023). Unlike the CWUR rankings we used in P12, the QS rankings dataset has various scores for the universities, and not just the rankings. This makes the QS rankings dataset more suitable for plotting (which you will be doing a lot of!).
For your final CS220 project, you're going to continue analyzing world university rankings. However, we will be using a different dataset this time. The data for this project has been extracted from [here](https://www.topuniversities.com/university-rankings/world-university-rankings/2023). Unlike the CWUR rankings we used in P12, the QS rankings dataset has various scores for the universities, and not just the rankings. This makes the QS rankings dataset more suitable for plotting (which you will be doing a lot of!).
In this project, you'll have to dump your DataFrame to a SQLite database. You'll answer questions by doing queries on that database. Often, your answers will be in the form of a plot. Check these carefully, as the tests only verify that a plot has been created, not that it looks correct (the Gradescope autograder will manually deduct points for plotting mistakes).
In this project, you'll have to dump your DataFrame to a SQLite database. You'll answer questions by doing queries on that database. Often, your answers will be in the form of a plot. Check these carefully, as the tests only verify that a plot has been created, not that it looks correct (the Gradescope autograder will manually deduct points for plotting mistakes).
%% Cell type:markdown id:48aad11e tags:
%% Cell type:markdown id:48aad11e tags:
## Project Requirements:
## Project Requirements:
You **may not** hardcode indices in your code. You **may not** manually download **any** files for this project, unless you are **explicitly** told to do so. For all other files, you **must** use the `download` function to download the files.
You **may not** hardcode indices in your code. You **may not** manually download **any** files for this project, unless you are **explicitly** told to do so. For all other files, you **must** use the `download` function to download the files.
**Store** your final answer for each question in the **variable specified for each question**. This step is important because Otter grades your work by comparing the value of this variable against the correct answer.
**Store** your final answer for each question in the **variable specified for each question**. This step is important because Otter grades your work by comparing the value of this variable against the correct answer.
For some of the questions, we'll ask you to write (then use) a function to compute the answer. If you compute the answer **without** creating the function we ask you to write, we'll **manually deduct** points from your autograder score on Gradescope, even if the way you did it produced the correct answer.
For some of the questions, we'll ask you to write (then use) a function to compute the answer. If you compute the answer **without** creating the function we ask you to write, we'll **manually deduct** points from your autograder score on Gradescope, even if the way you did it produced the correct answer.
Required Functions:
Required Functions:
-`bar_plot`
-`bar_plot`
-`scatter_plot`
-`scatter_plot`
-`horizontal_bar_plot`
-`horizontal_bar_plot`
-`pie_plot`
-`pie_plot`
-`get_regression_coeff`
-`get_regression_coeff`
-`get_regression_line`
-`get_regression_line`
-`regression_line_plot`
-`regression_line_plot`
-`download`
-`download`
In this project, you will also be required to define certain **data structures**. If you do not create these data structures exactly as specified, we'll **manually deduct** points from your autograder score on Gradescope, even if the way you did it produced the correct answer.
In this project, you will also be required to define certain **data structures**. If you do not create these data structures exactly as specified, we'll **manually deduct** points from your autograder score on Gradescope, even if the way you did it produced the correct answer.
Required Data Structures:
Required Data Structures:
-`conn`
-`conn`
You **must** write SQL queries to solve the questions in this project, unless you are **explicitly** told otherwise. You will **not get any credit** if you use `pandas` operations to extract data. We will give you **specific** instructions for any questions where `pandas` operations are allowed. In addition, you are also **required** to follow the requirements below:
You **must** write SQL queries to solve the questions in this project, unless you are **explicitly** told otherwise. You will **not get any credit** if you use `pandas` operations to extract data. We will give you **specific** instructions for any questions where `pandas` operations are allowed. In addition, you are also **required** to follow the requirements below:
* You **must** close the connection to `conn` at the end of your notebook.
* You **must** close the connection to `conn` at the end of your notebook.
* Do **not** use **absolute** paths such as `C://ms//cs220//p13`. You may **only** use **relative paths**.
* Do **not** use **absolute** paths such as `C://ms//cs220//p13`. You may **only** use **relative paths**.
* Do **not** hardcode `//` or `\` in any of your paths. You **must** use `os.path.join` to create paths.
* Do **not** hardcode `//` or `\` in any of your paths. You **must** use `os.path.join` to create paths.
* Do **not** leave irrelevant output or test code that we didn't ask for.
* Do **not** leave irrelevant output or test code that we didn't ask for.
* **Avoid** calling **slow** functions multiple times within a loop.
* **Avoid** calling **slow** functions multiple times within a loop.
* Do **not** define multiple functions with the same name or define multiple versions of one function with different names. Just keep the best version.
* Do **not** define multiple functions with the same name or define multiple versions of one function with different names. Just keep the best version.
For more details on what will cause you to lose points during code review and specific requirements, please take a look at the [Grading rubric](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/rubric.md).
For more details on what will cause you to lose points during code review and specific requirements, please take a look at the [Grading rubric](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/rubric.md).
%% Cell type:markdown id:e04f805e tags:
%% Cell type:markdown id:e04f805e tags:
## Questions and Functions:
## Questions and Functions:
Let us start by importing all the modules we will need for this project.
Let us start by importing all the modules we will need for this project.
%% Cell type:code id:b1363e20 tags:
%% Cell type:code id:b1363e20 tags:
``` python
``` python
# it is considered a good coding practice to place all import statements at the top of the notebook
# it is considered a good coding practice to place all import statements at the top of the notebook
# please place all your import statements in this cell if you need to import any more modules for this project
# please place all your import statements in this cell if you need to import any more modules for this project
```
```
%% Cell type:markdown id:995a9ea8 tags:
%% Cell type:markdown id:995a9ea8 tags:
Now, you may copy/paste some of the functions and data structures you defined in Lab 13 and P12, which will be useful for this project.
Now, you may copy/paste some of the functions and data structures you defined in Lab 13 and P12, which will be useful for this project.
%% Cell type:code id:a4fab7ea tags:
%% Cell type:code id:a4fab7ea tags:
``` python
``` python
# this ensures that font.size setting remains uniform
# this ensures that font.size setting remains uniform
%matplotlib inline
%matplotlib inline
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_colwidth', None)
matplotlib.rcParams["font.size"] = 13 # don't use value > 13! Otherwise your y-axis tick labels will be different.
matplotlib.rcParams["font.size"] = 13 # don't use value > 13! Otherwise your y-axis tick labels will be different.
```
```
%% Cell type:code id:e4eac640 tags:
%% Cell type:code id:e4eac640 tags:
``` python
``` python
# copy/paste the definition of the function 'bar_plot' from lab-p13 here
# copy/paste the definition of the function 'bar_plot' from lab-p13 here
```
```
%% Cell type:code id:71c71935 tags:
%% Cell type:code id:71c71935 tags:
``` python
``` python
# copy/paste the definition of the function 'scatter_plot' from lab-p13 here
# copy/paste the definition of the function 'scatter_plot' from lab-p13 here
```
```
%% Cell type:code id:153b23ad tags:
%% Cell type:code id:153b23ad tags:
``` python
``` python
# copy/paste the definition of the function 'horizontal_bar_plot' from lab-p13 here
# copy/paste the definition of the function 'horizontal_bar_plot' from lab-p13 here
```
```
%% Cell type:code id:1f6d37df tags:
%% Cell type:code id:1f6d37df tags:
``` python
``` python
# copy/paste the definition of the function 'pie_plot' from lab-p13 here
# copy/paste the definition of the function 'pie_plot' from lab-p13 here
```
```
%% Cell type:code id:88255766 tags:
%% Cell type:code id:88255766 tags:
``` python
``` python
# copy/paste the definition of the function 'get_regression_coeff' from lab-p13 here
# copy/paste the definition of the function 'get_regression_coeff' from lab-p13 here
```
```
%% Cell type:code id:8119a0ec tags:
%% Cell type:code id:8119a0ec tags:
``` python
``` python
# copy/paste the definition of the function 'get_regression_line' from lab-p13 here
# copy/paste the definition of the function 'get_regression_line' from lab-p13 here
```
```
%% Cell type:code id:13851f7d tags:
%% Cell type:code id:13851f7d tags:
``` python
``` python
# copy/paste the definition of the function 'regression_line_plot' from lab-p13 here
# copy/paste the definition of the function 'regression_line_plot' from lab-p13 here
```
```
%% Cell type:code id:c12776a3 tags:
%% Cell type:code id:c12776a3 tags:
``` python
``` python
# copy/paste the definition of the function 'download' from p12 here
# copy/paste the definition of the function 'download' from p12 here
```
```
%% Cell type:code id:f4fbd661 tags:
%% Cell type:code id:f4fbd661 tags:
``` python
``` python
# use the 'download' function to download the data from the webpage
# use the 'download' function to download the data from the webpage
You **must** now create a **database** called `rankings.db` out of `QSranking.json`, connect to it, and save it in a variable called `conn`. You **must** use this connection to the database `rankings.db` to answer the questions that follow.
You **must** now create a **database** called `rankings.db` out of `QSranking.json`, connect to it, and save it in a variable called `conn`. You **must** use this connection to the database `rankings.db` to answer the questions that follow.
%% Cell type:code id:8de4b158 tags:
%% Cell type:code id:8de4b158 tags:
``` python
``` python
# create a database called 'rankings.db' out of 'QSranking.json'
# create a database called 'rankings.db' out of 'QSranking.json'
# TODO: load the data from 'QSranking.json' into a variable called 'qs_ranking' using pandas' 'read_json' function
# TODO: load the data from 'QSranking.json' into a variable called 'qs_ranking' using pandas' 'read_json' function
# TODO: connect to 'rankings.db' and save it to a variable called 'conn'
# TODO: connect to 'rankings.db' and save it to a variable called 'conn'
# TODO: write the contents of the DataFrame 'qs_ranking' to the sqlite database
# TODO: write the contents of the DataFrame 'qs_ranking' to the sqlite database
```
```
%% Cell type:code id:9f28e183 tags:
%% Cell type:code id:9f28e183 tags:
``` python
``` python
# run this cell and confirm that you have defined the variables correctly
# run this cell and confirm that you have defined the variables correctly
pd.read_sql("SELECT * FROM rankings LIMIT 5", conn)
pd.read_sql("SELECT * FROM rankings LIMIT 5", conn)
```
```
%% Cell type:markdown id:d31f5dd9 tags:
%% Cell type:markdown id:d31f5dd9 tags:
**Question 1:** List **all** the statistics of the institution with the `institution_name` *University Of Wisconsin-Madison*.
**Question 1:** List **all** the statistics of the institution with the `institution_name` *University Of Wisconsin-Madison*.
You **must** display **all** the columns. The rows **must** be in *ascending* order of `year`.
You **must** display **all** the columns. The rows **must** be in *ascending* order of `year`.
Your output **must** be a **DataFrame** that looks like this:
Your output **must** be a **DataFrame** that looks like this:
|**0**|55|2018|University Of Wisconsin-Madison|United States|94.0|62.1|84.0|54.2|53.2|30.9|75.8|
|**0**|55|2018|University Of Wisconsin-Madison|United States|94.0|62.1|84.0|54.2|53.2|30.9|75.8|
|**1**|53|2019|University Of Wisconsin-Madison|United States|88.5|51.2|87.4|52.6|58.8|30.6|73.2|
|**1**|53|2019|University Of Wisconsin-Madison|United States|88.5|51.2|87.4|52.6|58.8|30.6|73.2|
|**2**|56|2020|University Of Wisconsin-Madison|United States|87.8|49.7|85.5|50.0|57.2|30.9|71.8|
|**2**|56|2020|University Of Wisconsin-Madison|United States|87.8|49.7|85.5|50.0|57.2|30.9|71.8|
%% Cell type:code id:8eefb54f tags:
%% Cell type:code id:8eefb54f tags:
``` python
``` python
# compute and store the answer in the variable 'uw_rating', then display it
# compute and store the answer in the variable 'uw_rating', then display it
```
```
%% Cell type:code id:6a51b275 tags:
%% Cell type:code id:6a51b275 tags:
``` python
``` python
grader.check("q1")
grader.check("q1")
```
```
%% Cell type:markdown id:587fd6d2 tags:
%% Cell type:markdown id:587fd6d2 tags:
**Question 2:** What are the **top** *10* institutions in *Japan* which had the **highest** score of `international_students` in the `year` *2020*?
**Question 2:** What are the **top** *10* institutions in *Japan* which had the **highest** score of `international_students` in the `year` *2020*?
You **must** display the columns `institution_name` and `international_students`. The rows **must** be in *descending* order of `international_students`.
You **must** display the columns `institution_name` and `international_students`. The rows **must** be in *descending* order of `international_students`.
Your output **must** be a **DataFrame** that looks like this:
Your output **must** be a **DataFrame** that looks like this:
# compute and store the answer in the variable 'japan_top_10_inter', then display it
# compute and store the answer in the variable 'japan_top_10_inter', then display it
```
```
%% Cell type:code id:f06aaae0 tags:
%% Cell type:code id:f06aaae0 tags:
``` python
``` python
grader.check("q2")
grader.check("q2")
```
```
%% Cell type:markdown id:341ac4b8 tags:
%% Cell type:markdown id:341ac4b8 tags:
**Question 3:** What are the **top** *10* institutions in the *United States* which had the **highest** *reputation* in the `year` *2019*?
**Question 3:** What are the **top** *10* institutions in the *United States* which had the **highest** *reputation* in the `year` *2019*?
The `reputation` of an institution is defined as the sum of `academic_reputation` and `employer_reputation`. You **must** display the columns `institution_name` and `reputation`. The rows **must** be in *descending* order of `reputation`. In case the `reputation` is tied, the rows must be in *alphabetical* order of `institution_name`.
The `reputation` of an institution is defined as the sum of `academic_reputation` and `employer_reputation`. You **must** display the columns `institution_name` and `reputation`. The rows **must** be in *descending* order of `reputation`. In case the `reputation` is tied, the rows must be in *alphabetical* order of `institution_name`.
Your output **must** be a **DataFrame** that looks like this:
Your output **must** be a **DataFrame** that looks like this:
||**institution_name**|**reputation**|
||**institution_name**|**reputation**|
|---------|------|---------|
|---------|------|---------|
|**0**|Harvard University|200.0|
|**0**|Harvard University|200.0|
|**1**|Massachusetts Institute Of Technology|200.0|
|**1**|Massachusetts Institute Of Technology|200.0|
|**2**|Stanford University|200.0|
|**2**|Stanford University|200.0|
|**3**|University Of California, Berkeley|199.8|
|**3**|University Of California, Berkeley|199.8|
|**4**|Yale University|199.6|
|**4**|Yale University|199.6|
|**5**|University Of California, Los Angeles|199.1|
|**5**|University Of California, Los Angeles|199.1|
|**6**|Columbia University|197.1|
|**6**|Columbia University|197.1|
|**7**|Princeton University|196.6|
|**7**|Princeton University|196.6|
|**8**|University Of Chicago|190.3|
|**8**|University Of Chicago|190.3|
|**9**|Cornell University|189.2|
|**9**|Cornell University|189.2|
**Hint:** You can use mathematical expressions in your **SELECT** clause. For example, if you wish to add the `academic_reputation` and `employer_reputation` for each institution, you could use the following query:
**Hint:** You can use mathematical expressions in your **SELECT** clause. For example, if you wish to add the `academic_reputation` and `employer_reputation` for each institution, you could use the following query:
```sql
```sql
SELECT (`academic_reputation` + `employer_reputation`) FROM rankings
SELECT (`academic_reputation` + `employer_reputation`) FROM rankings
```
```
%% Cell type:code id:271b86d7 tags:
%% Cell type:code id:271b86d7 tags:
``` python
``` python
# compute and store the answer in the variable 'us_top_10_rep', then display it
# compute and store the answer in the variable 'us_top_10_rep', then display it
```
```
%% Cell type:code id:96cacdd4 tags:
%% Cell type:code id:96cacdd4 tags:
``` python
``` python
grader.check("q3")
grader.check("q3")
```
```
%% Cell type:markdown id:21ba8c82 tags:
%% Cell type:markdown id:21ba8c82 tags:
**Question 4:** What are the **top** *10* countries which had the **most** *institutions* listed in the `year` *2020*?
**Question 4:** What are the **top** *10* countries which had the **most** *institutions* listed in the `year` *2020*?
You **must** display the columns `country` and `num_of_institutions`. The `num_of_institutions` of a country is defined as the number of institutions from that country. The rows **must** be in *descending* order of `num_of_institutions`. In case the `num_of_institutions` is tied, the rows must be in *alphabetical* order of `country`.
You **must** display the columns `country` and `num_of_institutions`. The `num_of_institutions` of a country is defined as the number of institutions from that country. The rows **must** be in *descending* order of `num_of_institutions`. In case the `num_of_institutions` is tied, the rows must be in *alphabetical* order of `country`.
**Hint:** You **must** use the `COUNT` SQL function to answer this question.
**Hint:** You **must** use the `COUNT` SQL function to answer this question.
Your output **must** be a **DataFrame** that looks like this:
Your output **must** be a **DataFrame** that looks like this:
||**country**|**num_of_institutions**|
||**country**|**num_of_institutions**|
|---------|------|---------|
|---------|------|---------|
|**0**|United States|74|
|**0**|United States|74|
|**1**|United Kingdom|45|
|**1**|United Kingdom|45|
|**2**|Germany|23|
|**2**|Germany|23|
|**3**|Australia|21|
|**3**|Australia|21|
|**4**|Canada|14|
|**4**|Canada|14|
|**5**|China|14|
|**5**|China|14|
|**6**|France|14|
|**6**|France|14|
|**7**|Japan|14|
|**7**|Japan|14|
|**8**|Netherlands|13|
|**8**|Netherlands|13|
|**9**|Russia|13|
|**9**|Russia|13|
%% Cell type:code id:1991dc45 tags:
%% Cell type:code id:1991dc45 tags:
``` python
``` python
# compute and store the answer in the variable 'top_10_countries', then display it
# compute and store the answer in the variable 'top_10_countries', then display it
```
```
%% Cell type:code id:3e878347 tags:
%% Cell type:code id:3e878347 tags:
``` python
``` python
grader.check("q4")
grader.check("q4")
```
```
%% Cell type:markdown id:6ef62b90 tags:
%% Cell type:markdown id:6ef62b90 tags:
**Question 5:** Create a **bar plot** using the data from Question 4 with the `country` on the **x-axis** and the `num_of_institutions` on the **y-axis**.
**Question 5:** Create a **bar plot** using the data from Question 4 with the `country` on the **x-axis** and the `num_of_institutions` on the **y-axis**.
In addition to the top ten countries, you **must** also aggregate the data for **all** the **other** countries, and represent that number in the column `Other`. You are **allowed** do this using any combination of SQL queries and pandas operations.
In addition to the top ten countries, you **must** also aggregate the data for **all** the **other** countries, and represent that number in the column `Other`. You are **allowed** do this using any combination of SQL queries and pandas operations.
You **must** first compute a **DataFrame** `num_institutions` containing the **country**, and the **num_of_institutions** data.
You **must** first compute a **DataFrame** `num_institutions` containing the **country**, and the **num_of_institutions** data.
**Hint**: You can use the `append` function of a DataFrame to add a single row to the end of your **DataFrame** from Question 4. You'll also need the keyword argument `ignore_index=True`. For example:
**Hint**: You can use the `append` function of a DataFrame to add a single row to the end of your **DataFrame** from Question 4. You'll also need the keyword argument `ignore_index=True`. For example:
will create a *new* **DataFrame** `my_new_dataframe` which contains all the rows from `my_dataframe`, along with the **additional row** which has been appended. You can **ignore** any warnings about `append` being deprecated.
will create a *new* **DataFrame** `my_new_dataframe` which contains all the rows from `my_dataframe`, along with the **additional row** which has been appended. You can **ignore** any warnings about `append` being deprecated.
%% Cell type:code id:a0b3223c tags:
%% Cell type:code id:a0b3223c tags:
``` python
``` python
# first compute and store the DataFrame 'num_institutions', then display it
# first compute and store the DataFrame 'num_institutions', then display it
# do NOT plot just yet
# do NOT plot just yet
# TODO: use a SQL query similar to Question 4 to get the number of institutions of all countries
# TODO: use a SQL query similar to Question 4 to get the number of institutions of all countries
# (not just the top 10), ordered by the number of institutions, and store in a DataFrame
# (not just the top 10), ordered by the number of institutions, and store in a DataFrame
# TODO: Use pandas to find the sum of the institutions in all countries except the top 10
# TODO: Use pandas to find the sum of the institutions in all countries except the top 10
# TODO: create a new dictionary with the data about the new row that needs to be added
# TODO: create a new dictionary with the data about the new row that needs to be added
# TODO: properly append this new dictionary to 'num_institutions' and update 'num_institutions'
# TODO: properly append this new dictionary to 'num_institutions' and update 'num_institutions'
```
```
%% Cell type:code id:c95611c9 tags:
%% Cell type:code id:c95611c9 tags:
``` python
``` python
grader.check("q5")
grader.check("q5")
```
```
%% Cell type:markdown id:51a82c7e tags:
%% Cell type:markdown id:51a82c7e tags:
Now, **plot** `num_institutions` as **bar plot** with the **x-axis** labelled *country* and the **y-axis** labelled *num_of_institutions*.
Now, **plot** `num_institutions` as **bar plot** with the **x-axis** labelled *country* and the **y-axis** labelled *num_of_institutions*.
You **must** use the `bar_plot` function to create the plot.
You **must** use the `bar_plot` function to create the plot.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:4cd92732 tags:
%% Cell type:code id:4cd92732 tags:
``` python
``` python
# create the bar plot using the DataFrame 'num_institutions' with the x-axis labelled "country"
# create the bar plot using the DataFrame 'num_institutions' with the x-axis labelled "country"
# and the y-axis labelled "num_of_institutions"
# and the y-axis labelled "num_of_institutions"
```
```
%% Cell type:markdown id:6617e42c tags:
%% Cell type:markdown id:6617e42c tags:
**Question 6:** Create a **bar plot** of the **top** *10* countries with the **highest** *total* `overall_score` listed in the `year` *2019*.
**Question 6:** Create a **bar plot** of the **top** *10* countries with the **highest** *total* `overall_score` listed in the `year` *2019*.
The `total_score` of a `country` is defined as the **sum** of `overall_score` of **all** institutions in that `country`. You **must** display the columns `country` and `total_score`. The rows **must** be in *descending* order of `total_score`.
The `total_score` of a `country` is defined as the **sum** of `overall_score` of **all** institutions in that `country`. You **must** display the columns `country` and `total_score`. The rows **must** be in *descending* order of `total_score`.
You **must** first compute a **DataFrame** `top_10_total_score` containing the **country**, and the **total_score** data.
You **must** first compute a **DataFrame** `top_10_total_score` containing the **country**, and the **total_score** data.
Your **DataFrame** should looks like this:
Your **DataFrame** should looks like this:
||**country**|**total_score**|
||**country**|**total_score**|
|---------|------|---------|
|---------|------|---------|
|**0**|United States|4298.4|
|**0**|United States|4298.4|
|**1**|United Kingdom|2539.2|
|**1**|United Kingdom|2539.2|
|**2**|Germany|1098.2|
|**2**|Germany|1098.2|
|**3**|Australia|1093.8|
|**3**|Australia|1093.8|
|**4**|Japan|752.9|
|**4**|Japan|752.9|
|**5**|China|743.4|
|**5**|China|743.4|
|**6**|Canada|705.3|
|**6**|Canada|705.3|
|**7**|Netherlands|674.9|
|**7**|Netherlands|674.9|
|**8**|South Korea|612.8|
|**8**|South Korea|612.8|
|**9**|France|595.2|
|**9**|France|595.2|
%% Cell type:code id:f7cf3887 tags:
%% Cell type:code id:f7cf3887 tags:
``` python
``` python
# compute and store the answer in the variable 'top_10_total_score', then display it
# compute and store the answer in the variable 'top_10_total_score', then display it
# do NOT plot just yet
# do NOT plot just yet
```
```
%% Cell type:code id:64d40c82 tags:
%% Cell type:code id:64d40c82 tags:
``` python
``` python
grader.check("q6")
grader.check("q6")
```
```
%% Cell type:markdown id:2e7b11bc tags:
%% Cell type:markdown id:2e7b11bc tags:
Now, **plot** `top_10_total_score` as **bar plot** with the **x-axis** labelled *country* and the **y-axis** labelled *total_score*.
Now, **plot** `top_10_total_score` as **bar plot** with the **x-axis** labelled *country* and the **y-axis** labelled *total_score*.
You **must** use the `bar_plot` function to create the plot.
You **must** use the `bar_plot` function to create the plot.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:2192b4e4 tags:
%% Cell type:code id:2192b4e4 tags:
``` python
``` python
# create the bar plot using the DataFrame 'top_10_total_score' with the x-axis labelled "country"
# create the bar plot using the DataFrame 'top_10_total_score' with the x-axis labelled "country"
# and the y-axis labelled "total_score"
# and the y-axis labelled "total_score"
```
```
%% Cell type:markdown id:88cbb812 tags:
%% Cell type:markdown id:88cbb812 tags:
**Question 7:** What are the **top** *10* institutions in the *United States* which had the **highest** *international_score* in the `year` *2020*?
**Question 7:** What are the **top** *10* institutions in the *United States* which had the **highest** *international_score* in the `year` *2020*?
The *international_score* of an institution is defined as the **sum** of `international_faculty` and `international_students` scores of that institution. You **must** display the columns `institution_name` and `international_score`. The rows **must** be in *descending* order of `international_score`.
The *international_score* of an institution is defined as the **sum** of `international_faculty` and `international_students` scores of that institution. You **must** display the columns `institution_name` and `international_score`. The rows **must** be in *descending* order of `international_score`.
Your output **must** be a **DataFrame** that looks like this:
Your output **must** be a **DataFrame** that looks like this:
||**institution_name**|**international_score**|
||**institution_name**|**international_score**|
|---------|------|---------|
|---------|------|---------|
|**0**|Massachusetts Institute Of Technology|194.1|
|**0**|Massachusetts Institute Of Technology|194.1|
|**1**|California Institute Of Technology|186.7|
|**1**|California Institute Of Technology|186.7|
|**2**|Carnegie Mellon University|183.5|
|**2**|Carnegie Mellon University|183.5|
|**3**|Rice University|180.4|
|**3**|Rice University|180.4|
|**4**|Northeastern University|179.1|
|**4**|Northeastern University|179.1|
|**5**|Stanford University|167.5|
|**5**|Stanford University|167.5|
|**6**|Cornell University|166.1|
|**6**|Cornell University|166.1|
|**7**|Purdue University|158.2|
|**7**|Purdue University|158.2|
|**8**|University Of Rochester|157.9|
|**8**|University Of Rochester|157.9|
|**9**|University Of Chicago|151.2|
|**9**|University Of Chicago|151.2|
%% Cell type:code id:af3589cd tags:
%% Cell type:code id:af3589cd tags:
``` python
``` python
# compute and store the answer in the variable 'top_10_inter_score', then display it
# compute and store the answer in the variable 'top_10_inter_score', then display it
```
```
%% Cell type:code id:41ee5bff tags:
%% Cell type:code id:41ee5bff tags:
``` python
``` python
grader.check("q7")
grader.check("q7")
```
```
%% Cell type:markdown id:4794b1a5 tags:
%% Cell type:markdown id:4794b1a5 tags:
**Question 8:** Create a **scatter plot** representing the `citations_per_faculty` (on the **x-axis**) against the `overall_score` (on the **y-axis**) of each institution in the `year` *2018*.
**Question 8:** Create a **scatter plot** representing the `citations_per_faculty` (on the **x-axis**) against the `overall_score` (on the **y-axis**) of each institution in the `year` *2018*.
You **must** first compute a **DataFrame** `citations_overall` containing the **citations_per_faculty**, and the **overall_score** data from the `year` *2018*, of each **institution**.
You **must** first compute a **DataFrame** `citations_overall` containing the **citations_per_faculty**, and the **overall_score** data from the `year` *2018*, of each **institution**.
%% Cell type:code id:92a32a11 tags:
%% Cell type:code id:92a32a11 tags:
``` python
``` python
# first compute and store the DataFrame 'citations_overall', then display its head
# first compute and store the DataFrame 'citations_overall', then display its head
# do NOT plot just yet
# do NOT plot just yet
```
```
%% Cell type:code id:c9a2b1ba tags:
%% Cell type:code id:c9a2b1ba tags:
``` python
``` python
grader.check("q8")
grader.check("q8")
```
```
%% Cell type:markdown id:68165402 tags:
%% Cell type:markdown id:68165402 tags:
Now, **plot** `citations_overall` as **scatter plot** with the **x-axis** labelled *citations_per_faculty* and the **y-axis** labelled *overall_score*.
Now, **plot** `citations_overall` as **scatter plot** with the **x-axis** labelled *citations_per_faculty* and the **y-axis** labelled *overall_score*.
You **must** use the `scatter_plot` function to create the plot.
You **must** use the `scatter_plot` function to create the plot.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:0e0b8a7d tags:
%% Cell type:code id:0e0b8a7d tags:
``` python
``` python
# create the scatter plot using the DataFrame 'citations_overall' with the x-axis labelled "citations_per_faculty"
# create the scatter plot using the DataFrame 'citations_overall' with the x-axis labelled "citations_per_faculty"
# and the y-axis labelled "overall_score"
# and the y-axis labelled "overall_score"
```
```
%% Cell type:markdown id:8ba5ed8c tags:
%% Cell type:markdown id:8ba5ed8c tags:
**Question 9:** Create a **scatter plot** representing the `academic_reputation` (on the **x-axis**) against the `employer_reputation` (on the **y-axis**) of each institution from the *United States* in the `year` *2019*.
**Question 9:** Create a **scatter plot** representing the `academic_reputation` (on the **x-axis**) against the `employer_reputation` (on the **y-axis**) of each institution from the *United States* in the `year` *2019*.
You **must** first compute a **DataFrame** `reputations_usa` containing the **academic_reputation**, and the **employer_reputation** data from the `year` *2019*, of each **institution** in the `country` *United States*.
You **must** first compute a **DataFrame** `reputations_usa` containing the **academic_reputation**, and the **employer_reputation** data from the `year` *2019*, of each **institution** in the `country` *United States*.
%% Cell type:code id:b04f767f tags:
%% Cell type:code id:b04f767f tags:
``` python
``` python
# first compute and store the DataFrame 'reputations_usa', then display its head
# first compute and store the DataFrame 'reputations_usa', then display its head
# do NOT plot just yet
# do NOT plot just yet
```
```
%% Cell type:code id:05490b0c tags:
%% Cell type:code id:05490b0c tags:
``` python
``` python
grader.check("q9")
grader.check("q9")
```
```
%% Cell type:markdown id:5f8fcce5 tags:
%% Cell type:markdown id:5f8fcce5 tags:
Now, **plot** `reputations_usa` as **scatter plot** with the **x-axis** labelled *academic_reputation* and the **y-axis** labelled *employer_reputation*.
Now, **plot** `reputations_usa` as **scatter plot** with the **x-axis** labelled *academic_reputation* and the **y-axis** labelled *employer_reputation*.
You **must** use the `scatter_plot` function to create the plot.
You **must** use the `scatter_plot` function to create the plot.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:29894cd8 tags:
%% Cell type:code id:29894cd8 tags:
``` python
``` python
# create the scatter plot using the DataFrame 'reputations_usa' with the x-axis labelled "academic_reputation"
# create the scatter plot using the DataFrame 'reputations_usa' with the x-axis labelled "academic_reputation"
# and the y-axis labelled "employer_reputation"
# and the y-axis labelled "employer_reputation"
```
```
%% Cell type:markdown id:2e739c41 tags:
%% Cell type:markdown id:2e739c41 tags:
**Question 10:** Create a **scatter plot** representing the `international_students` (on the **x-axis**) against the `faculty_student_score` (on the **y-axis**) for the **top ranked** institution of **each** `country` in the `year` *2020*.
**Question 10:** Create a **scatter plot** representing the `international_students` (on the **x-axis**) against the `faculty_student_score` (on the **y-axis**) for the **top ranked** institution of **each** `country` in the `year` *2020*.
You **must** first compute a **DataFrame** `top_ranked_inter_faculty` containing the **international_students**, and the **faculty_student_score** data from the `year` *2020*, of the **top** ranked **institution** (i.e., the institution with the **least** `rank`) of each **country**.
You **must** first compute a **DataFrame** `top_ranked_inter_faculty` containing the **international_students**, and the **faculty_student_score** data from the `year` *2020*, of the **top** ranked **institution** (i.e., the institution with the **least** `rank`) of each **country**.
**Hint:** You can use the `MIN` SQL function to return the least value of a selected column. However, there are a few things to keep in mind while using this function.
**Hint:** You can use the `MIN` SQL function to return the least value of a selected column. However, there are a few things to keep in mind while using this function.
* The function must be in **uppercase** (i.e., you must use `MIN`, and **not** `min`).
* The function must be in **uppercase** (i.e., you must use `MIN`, and **not** `min`).
* The column you are finding the minimum of must be inside backticks (``` ` ```). For example, if you want to find the minimum `rank`, you need to say ```MIN(`rank`)```.
* The column you are finding the minimum of must be inside backticks (``` ` ```). For example, if you want to find the minimum `rank`, you need to say ```MIN(`rank`)```.
If you do not follow the syntax above, your code will likely fail.
If you do not follow the syntax above, your code will likely fail.
%% Cell type:code id:fa9e1b6f tags:
%% Cell type:code id:fa9e1b6f tags:
``` python
``` python
# first compute and store the DataFrame 'top_ranked_inter_faculty', then display its head
# first compute and store the DataFrame 'top_ranked_inter_faculty', then display its head
# do NOT plot just yet
# do NOT plot just yet
```
```
%% Cell type:code id:a4831be1 tags:
%% Cell type:code id:a4831be1 tags:
``` python
``` python
grader.check("q10")
grader.check("q10")
```
```
%% Cell type:markdown id:59b40839 tags:
%% Cell type:markdown id:59b40839 tags:
Now, **plot** `top_ranked_inter_faculty` as **scatter plot** with the **x-axis** labelled *international_students* and the **y-axis** labelled *faculty_student_score*.
Now, **plot** `top_ranked_inter_faculty` as **scatter plot** with the **x-axis** labelled *international_students* and the **y-axis** labelled *faculty_student_score*.
You **must** use the `scatter_plot` function to create the plot.
You **must** use the `scatter_plot` function to create the plot.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:2f17934b tags:
%% Cell type:code id:2f17934b tags:
``` python
``` python
# create the scatter plot using the DataFrame 'top_ranked_inter_faculty' with the x-axis labelled "international_students"
# create the scatter plot using the DataFrame 'top_ranked_inter_faculty' with the x-axis labelled "international_students"
# and the y-axis labelled "faculty_student_score"
# and the y-axis labelled "faculty_student_score"
```
```
%% Cell type:markdown id:9dab472c tags:
%% Cell type:markdown id:9dab472c tags:
### Correlations:
### Correlations:
You can use the `.corr()` method on a **DataFrame** that has **two** columns to get the *correlation* between those two columns.
You can use the `.corr()` method on a **DataFrame** that has **two** columns to get the *correlation* between those two columns.
For example, if we have a **DataFrame** `df` with the two columns `citations_per_faculty` and `overall_score`, `df.corr()` would return
For example, if we have a **DataFrame** `df` with the two columns `citations_per_faculty` and `overall_score`, `df.corr()` would return
||**citations_per_faculty**|**overall_score**|
||**citations_per_faculty**|**overall_score**|
|---------|------|---------|
|---------|------|---------|
|citations_per_faculty|1.000000|0.574472|
|citations_per_faculty|1.000000|0.574472|
|overall_score|0.574472|1.000000|
|overall_score|0.574472|1.000000|
You can use `.loc` here to **extract** the *correlation* between the two columns (`0.574472` in this case).
You can use `.loc` here to **extract** the *correlation* between the two columns (`0.574472` in this case).
%% Cell type:markdown id:f09ade4a tags:
%% Cell type:markdown id:f09ade4a tags:
**Question 11:** Find the **correlation** between `international_students` and `overall_score` for institutions from the `country` *United Kingdom* that were ranked in the **top** *100* in the `year` *2020*.
**Question 11:** Find the **correlation** between `international_students` and `overall_score` for institutions from the `country` *United Kingdom* that were ranked in the **top** *100* in the `year` *2020*.
Your output **must** be a **float** representing the absolute correlations. The **only** `pandas` operations you are **allowed** to use are: `.corr`, `.loc` and `.iloc`. You **must** use SQL to gather all other data.
Your output **must** be a **float** representing the absolute correlations. The **only** `pandas` operations you are **allowed** to use are: `.corr`, `.loc` and `.iloc`. You **must** use SQL to gather all other data.
%% Cell type:code id:706db815 tags:
%% Cell type:code id:706db815 tags:
``` python
``` python
# compute and store the answer in the variable 'uk_inter_score_corr', then display it
# compute and store the answer in the variable 'uk_inter_score_corr', then display it
```
```
%% Cell type:code id:ea738710 tags:
%% Cell type:code id:ea738710 tags:
``` python
``` python
grader.check("q11")
grader.check("q11")
```
```
%% Cell type:markdown id:314d22d6 tags:
%% Cell type:markdown id:314d22d6 tags:
Let us now define a new score called `citations_per_international` as follows:
Let us now define a new score called `citations_per_international` as follows:
**Question 12:** Find the **correlation** between `citations_per_international` and `overall_score` for **all** institutions in the `year` *2019*.
**Question 12:** Find the **correlation** between `citations_per_international` and `overall_score` for **all** institutions in the `year` *2019*.
Your output **must** be a **float** representing the absolute correlations. The **only** `pandas` operations you are **allowed** to use are: `.corr`, `.loc` and `.iloc`. You **must** use SQL to gather all other data.
Your output **must** be a **float** representing the absolute correlations. The **only** `pandas` operations you are **allowed** to use are: `.corr`, `.loc` and `.iloc`. You **must** use SQL to gather all other data.
%% Cell type:code id:777d001c tags:
%% Cell type:code id:777d001c tags:
``` python
``` python
# compute and store the answer in the variable 'cit_per_inter_score_corr', then display it
# compute and store the answer in the variable 'cit_per_inter_score_corr', then display it
```
```
%% Cell type:code id:ee14b0ac tags:
%% Cell type:code id:ee14b0ac tags:
``` python
``` python
grader.check("q12")
grader.check("q12")
```
```
%% Cell type:markdown id:cc72c981 tags:
%% Cell type:markdown id:cc72c981 tags:
**Question 13:** What are the **top** *15* countries with the **highest** *total* of `citations_per_international` in the `year` *2019*.
**Question 13:** What are the **top** *15* countries with the **highest** *total* of `citations_per_international` in the `year` *2019*.
The *total* `citations_per_international` of a `country` is defined as the **sum** of `citations_per_international` scores of **all** institutions in that `country`. You **must** display the columns `country` and `sum_inter_citations`. The rows **must** be in *descending* order of `sum_inter_citations`.
The *total* `citations_per_international` of a `country` is defined as the **sum** of `citations_per_international` scores of **all** institutions in that `country`. You **must** display the columns `country` and `sum_inter_citations`. The rows **must** be in *descending* order of `sum_inter_citations`.
Your output **must** be a **DataFrame** that looks like this:
Your output **must** be a **DataFrame** that looks like this:
||**country**|**sum_inter_citations**|
||**country**|**sum_inter_citations**|
|----|-----------|-----------------------|
|----|-----------|-----------------------|
|**0**|United States|2623.8207|
|**0**|United States|2623.8207|
|**1**|United Kingdom|2347.1602|
|**1**|United Kingdom|2347.1602|
|**2**|Australia|1255.5530|
|**2**|Australia|1255.5530|
|**3**|Netherlands|748.4268|
|**3**|Netherlands|748.4268|
|**4**|Canada|724.5029|
|**4**|Canada|724.5029|
|**5**|Switzerland|561.8790|
|**5**|Switzerland|561.8790|
|**6**|China|482.2577|
|**6**|China|482.2577|
|**7**|Germany|455.5466|
|**7**|Germany|455.5466|
|**8**|Hong Kong|375.3032|
|**8**|Hong Kong|375.3032|
|**9**|New Zealand|327.3357|
|**9**|New Zealand|327.3357|
|**10**|Sweden|305.3745|
|**10**|Sweden|305.3745|
|**11**|Belgium|255.0750|
|**11**|Belgium|255.0750|
|**12**|France|198.0860|
|**12**|France|198.0860|
|**13**|Denmark|186.4904|
|**13**|Denmark|186.4904|
|**14**|Singapore|160.3000|
|**14**|Singapore|160.3000|
%% Cell type:code id:14aaad72 tags:
%% Cell type:code id:14aaad72 tags:
``` python
``` python
# compute and store the answer in the variable 'top_cit_per_inter', then display it
# compute and store the answer in the variable 'top_cit_per_inter', then display it
```
```
%% Cell type:code id:b44e985d tags:
%% Cell type:code id:b44e985d tags:
``` python
``` python
grader.check("q13")
grader.check("q13")
```
```
%% Cell type:markdown id:59a993ce tags:
%% Cell type:markdown id:59a993ce tags:
**Question 14:** Among the institutions ranked within the **top** *300*, find the **average** `citations_per_international` for **each** `country` in the `year` *2019*.
**Question 14:** Among the institutions ranked within the **top** *300*, find the **average** `citations_per_international` for **each** `country` in the `year` *2019*.
You **must** display the columns `country` and `avg_inter_citations` representing the **average** of `citations_per_international` for **each** country. The rows **must** be in *descending* order of `avg_inter_citations`.
You **must** display the columns `country` and `avg_inter_citations` representing the **average** of `citations_per_international` for **each** country. The rows **must** be in *descending* order of `avg_inter_citations`.
**Hint:** To find the **average**, you can use `SUM()` and `COUNT()` or you can simply use `AVG()`.
**Hint:** To find the **average**, you can use `SUM()` and `COUNT()` or you can simply use `AVG()`.
Your output **must** be a **DataFrame** whose **first ten rows** look like this:
Your output **must** be a **DataFrame** whose **first ten rows** look like this:
||**country**|**avg_inter_citations**|
||**country**|**avg_inter_citations**|
|----|-----------|----------------------|
|----|-----------|----------------------|
|**0**|Singapore|80.150000|
|**0**|Singapore|80.150000|
|**1**|Switzerland|75.497000|
|**1**|Switzerland|75.497000|
|**2**|Hong Kong|62.550533|
|**2**|Hong Kong|62.550533|
|**3**|Australia|61.362388|
|**3**|Australia|61.362388|
|**4**|Netherlands|56.166733|
|**4**|Netherlands|56.166733|
|**5**|New Zealand|53.226220|
|**5**|New Zealand|53.226220|
|**6**|United Kingdom|52.889084|
|**6**|United Kingdom|52.889084|
|**7**|Canada|50.779723|
|**7**|Canada|50.779723|
|**8**|Denmark|46.196200|
|**8**|Denmark|46.196200|
|**9**|Norway|46.083300|
|**9**|Norway|46.083300|
%% Cell type:code id:dac3e940 tags:
%% Cell type:code id:dac3e940 tags:
``` python
``` python
# compute and store the answer in the variable 'avg_cit_per_inter', then display it
# compute and store the answer in the variable 'avg_cit_per_inter', then display it
```
```
%% Cell type:code id:946bb83c tags:
%% Cell type:code id:946bb83c tags:
``` python
``` python
grader.check("q14")
grader.check("q14")
```
```
%% Cell type:markdown id:bfded4bf tags:
%% Cell type:markdown id:bfded4bf tags:
**Question 15** Find the **institution** with the **highest** value of `citations_per_international` for **each** `country` in the `year` *2020*.
**Question 15** Find the **institution** with the **highest** value of `citations_per_international` for **each** `country` in the `year` *2020*.
Your output **must** be a **DataFrame** with the columns `country`, `institution_name`, and a new column `max_inter_citations` representing the **maximum** value of `citations_per_international` for that country. The rows **must** be in *descending* order of `max_inter_citations`. You **must** **omit** rows where `max_inter_citations` is **missing** by using the clause:
Your output **must** be a **DataFrame** with the columns `country`, `institution_name`, and a new column `max_inter_citations` representing the **maximum** value of `citations_per_international` for that country. The rows **must** be in *descending* order of `max_inter_citations`. You **must** **omit** rows where `max_inter_citations` is **missing** by using the clause:
```sql
```sql
HAVING `max_inter_citations` IS NOT NULL
HAVING `max_inter_citations` IS NOT NULL
```
```
**Hint:** You can use the `MAX()` function to return the largest value within a group.
**Hint:** You can use the `MAX()` function to return the largest value within a group.
Your output **must** be a **DataFrame** whose **first ten rows** look like this:
Your output **must** be a **DataFrame** whose **first ten rows** look like this:
# compute and store the answer in the variable 'max_cit_per_inter', then display it
# compute and store the answer in the variable 'max_cit_per_inter', then display it
```
```
%% Cell type:code id:9c4db997 tags:
%% Cell type:code id:9c4db997 tags:
``` python
``` python
grader.check("q15")
grader.check("q15")
```
```
%% Cell type:markdown id:da9cb13f tags:
%% Cell type:markdown id:da9cb13f tags:
**Question 16**: Among the institutions ranked within the **top** *50*, create a **horizontal bar plot** representing the **average** of both the`citations_per_faculty` and `international_faculty` scores for **all** institutions in **each** `country` in the `year` *2018*.
**Question 16**: Among the institutions ranked within the **top** *50*, create a **horizontal bar plot** representing the **average** of both the`citations_per_faculty` and `international_faculty` scores for **all** institutions in **each** `country` in the `year` *2018*.
You **must** first create a **DataFrame** `country_citations_inter` with **three** columns: `country`, `avg_citations` and `avg_inter_faculty` representing the name, the average value of `citations_per_faculty` and the average value of `international_faculty` for each country respectively.
You **must** first create a **DataFrame** `country_citations_inter` with **three** columns: `country`, `avg_citations` and `avg_inter_faculty` representing the name, the average value of `citations_per_faculty` and the average value of `international_faculty` for each country respectively.
You **must** ensure that the countries in the **DataFrame** are **ordered** in **increasing** order of the **difference** between `avg_citations` and `avg_inter_faculty`.
You **must** ensure that the countries in the **DataFrame** are **ordered** in **increasing** order of the **difference** between `avg_citations` and `avg_inter_faculty`.
%% Cell type:code id:e9e566a5 tags:
%% Cell type:code id:e9e566a5 tags:
``` python
``` python
# first compute and store the DataFrame 'country_citations_inter', then display it
# first compute and store the DataFrame 'country_citations_inter', then display it
# do NOT plot just yet
# do NOT plot just yet
```
```
%% Cell type:code id:60d1c6f7 tags:
%% Cell type:code id:60d1c6f7 tags:
``` python
``` python
grader.check("q16")
grader.check("q16")
```
```
%% Cell type:markdown id:3e859552 tags:
%% Cell type:markdown id:3e859552 tags:
Now, **plot** `country_citations_inter` as **horizontal bar plot** with the **x-axis** labelled *country*.
Now, **plot** `country_citations_inter` as **horizontal bar plot** with the **x-axis** labelled *country*.
Then, you **must** use the `horizontal_bar_plot` function to plot this data. Verify that the countries are **ordered** in **decreasing** order of the **difference** between `avg_citations` and `avg_inter_faculty`. Verify that the **legend** appears on your plot.
Then, you **must** use the `horizontal_bar_plot` function to plot this data. Verify that the countries are **ordered** in **decreasing** order of the **difference** between `avg_citations` and `avg_inter_faculty`. Verify that the **legend** appears on your plot.
**Hint:** If you want the countries in the plot to be ordered in **decreasing** order of the difference, you will need to make sure that in the DataFrame, they are ordered in the **increasing** order.
**Hint:** If you want the countries in the plot to be ordered in **decreasing** order of the difference, you will need to make sure that in the DataFrame, they are ordered in the **increasing** order.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:259af611 tags:
%% Cell type:code id:259af611 tags:
``` python
``` python
# create the horizontal bar plot using the DataFrame 'country_citations_inter' with the x-axis labelled "country"
# create the horizontal bar plot using the DataFrame 'country_citations_inter' with the x-axis labelled "country"
```
```
%% Cell type:markdown id:1a5d4543 tags:
%% Cell type:markdown id:1a5d4543 tags:
**Question 17:** Create a **scatter plot** representing the `overall_score` (on the **x-axis**) against the `rank` (on the **y-axis**) for **all** institutions in the `year` *2020*. Additionally, **plot** a **regression line** within the same plot.
**Question 17:** Create a **scatter plot** representing the `overall_score` (on the **x-axis**) against the `rank` (on the **y-axis**) for **all** institutions in the `year` *2020*. Additionally, **plot** a **regression line** within the same plot.
You **must** first compute a **DataFrame** containing the **overall_score**, and the **rank** data from the `year` *2020*. You **must** use the `get_regression_line` function to compute the best fit line.
You **must** first compute a **DataFrame** containing the **overall_score**, and the **rank** data from the `year` *2020*. You **must** use the `get_regression_line` function to compute the best fit line.
%% Cell type:code id:d51299b8 tags:
%% Cell type:code id:d51299b8 tags:
``` python
``` python
# first compute and store the DataFrame 'overall_rank', then display its head
# first compute and store the DataFrame 'overall_rank', then display its head
# do NOT plot just yet
# do NOT plot just yet
```
```
%% Cell type:code id:a422be6a tags:
%% Cell type:code id:a422be6a tags:
``` python
``` python
grader.check("q17")
grader.check("q17")
```
```
%% Cell type:markdown id:4c062dae tags:
%% Cell type:markdown id:4c062dae tags:
Now, **plot** `overall_rank` as **scatter plot** with a **regression line** with the **x-axis** labelled *overall_score* and the **y-axis** labelled *rank*.
Now, **plot** `overall_rank` as **scatter plot** with a **regression line** with the **x-axis** labelled *overall_score* and the **y-axis** labelled *rank*.
You **must** use the `regression_line_plot` function to plot this data.
You **must** use the `regression_line_plot` function to plot this data.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:6c914693 tags:
%% Cell type:code id:6c914693 tags:
``` python
``` python
# create the scatter plot and the regression line using the DataFrame 'overall_rank' with the x-axis labelled "overall_score"
# create the scatter plot and the regression line using the DataFrame 'overall_rank' with the x-axis labelled "overall_score"
# and the y-axis labelled "rank"
# and the y-axis labelled "rank"
```
```
%% Cell type:markdown id:effa2591 tags:
%% Cell type:markdown id:effa2591 tags:
**Food for thought:** Does our linear regression model fit the points well? It looks like the relationship between the `overall_score` and `rank` is **not quite linear**. In fact, a cursory look at the data suggests that the relationship is in fact, inverse.
**Food for thought:** Does our linear regression model fit the points well? It looks like the relationship between the `overall_score` and `rank` is **not quite linear**. In fact, a cursory look at the data suggests that the relationship is in fact, inverse.
%% Cell type:code id:9f1de243 tags:
%% Cell type:code id:9f1de243 tags:
``` python
``` python
# Food for thought is an entirely OPTIONAL exercise
# Food for thought is an entirely OPTIONAL exercise
# you may leave your thoughts here as a comment if you wish to
# you may leave your thoughts here as a comment if you wish to
```
```
%% Cell type:markdown id:26e4e3c1 tags:
%% Cell type:markdown id:26e4e3c1 tags:
**Question 18:** Create a **scatter plot** representing the **inverse** of the `overall_score` (on the **x-axis**) against the `rank` (on the **y-axis**) for **all** institutions in the `year` *2020*. Additionally, **plot** a **regression line** within the same plot.
**Question 18:** Create a **scatter plot** representing the **inverse** of the `overall_score` (on the **x-axis**) against the `rank` (on the **y-axis**) for **all** institutions in the `year` *2020*. Additionally, **plot** a **regression line** within the same plot.
The `inverse_overall_score` for each institution is simply defined as `1/overall_score` for that institution. You **must** first compute a **DataFrame** containing the **inverse_overall_score**, and the **rank** data from the `year` *2020*. You **must** use the `get_regression_line` function to compute the best fit line.
The `inverse_overall_score` for each institution is simply defined as `1/overall_score` for that institution. You **must** first compute a **DataFrame** containing the **inverse_overall_score**, and the **rank** data from the `year` *2020*. You **must** use the `get_regression_line` function to compute the best fit line.
%% Cell type:code id:6c535d83 tags:
%% Cell type:code id:6c535d83 tags:
``` python
``` python
# first compute and store the DataFrame 'inverse_overall_rank', then display its head
# first compute and store the DataFrame 'inverse_overall_rank', then display its head
# do NOT plot just yet
# do NOT plot just yet
```
```
%% Cell type:code id:22a6a736 tags:
%% Cell type:code id:22a6a736 tags:
``` python
``` python
grader.check("q18")
grader.check("q18")
```
```
%% Cell type:markdown id:e64a0040 tags:
%% Cell type:markdown id:e64a0040 tags:
Now, **plot** `inverse_overall_rank` as **scatter plot** with a **regression line** with the **x-axis** labelled *inverse_overall_score* and the **y-axis** labelled *rank*.
Now, **plot** `inverse_overall_rank` as **scatter plot** with a **regression line** with the **x-axis** labelled *inverse_overall_score* and the **y-axis** labelled *rank*.
You **must** use the `regression_line_plot` function to plot this data.
You **must** use the `regression_line_plot` function to plot this data.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:dd8efd5b tags:
%% Cell type:code id:dd8efd5b tags:
``` python
``` python
# create the scatter plot and the regression line using the DataFrame 'inverse_overall_rank'
# create the scatter plot and the regression line using the DataFrame 'inverse_overall_rank'
# with the x-axis labelled "inverse_overall_score" and the y-axis labelled "rank"
# with the x-axis labelled "inverse_overall_score" and the y-axis labelled "rank"
```
```
%% Cell type:markdown id:9f9f2089 tags:
%% Cell type:markdown id:9f9f2089 tags:
This seems to be much better! Let us now use this **regression line** to **estimate** the `rank` of an institution given its `overall_score`.
This seems to be much better! Let us now use this **regression line** to **estimate** the `rank` of an institution given its `overall_score`.
%% Cell type:markdown id:0849a83f tags:
%% Cell type:markdown id:0849a83f tags:
**Question 19:** Use the regression line to **estimate** the `rank` of an institution with an `overall_score` of *72*.
**Question 19:** Use the regression line to **estimate** the `rank` of an institution with an `overall_score` of *72*.
Your output **must** be an **int**. If your **estimate** is a **float**, *round it up* using `math.ceil`.
Your output **must** be an **int**. If your **estimate** is a **float**, *round it up* using `math.ceil`.
**Hints:**
**Hints:**
1. Call the `get_regression_coeff` function to get the coefficients `m` and `b`.
1. Call the `get_regression_coeff` function to get the coefficients `m` and `b`.
2. Recall that the equation of a line is `y = m * x + b`. What are `x` and `y` here?
2. Recall that the equation of a line is `y = m * x + b`. What are `x` and `y` here?
%% Cell type:code id:7f3fa177 tags:
%% Cell type:code id:7f3fa177 tags:
``` python
``` python
# compute and store the answer in the variable 'rank_score_72', then display it
# compute and store the answer in the variable 'rank_score_72', then display it
```
```
%% Cell type:code id:c1559986 tags:
%% Cell type:code id:c1559986 tags:
``` python
``` python
grader.check("q19")
grader.check("q19")
```
```
%% Cell type:markdown id:547f4135 tags:
%% Cell type:markdown id:547f4135 tags:
**Food for thought:** Can you find out the `overall_score` of the university with this `rank` in the `year` *2020*? Does it match your prediction?
**Food for thought:** Can you find out the `overall_score` of the university with this `rank` in the `year` *2020*? Does it match your prediction?
%% Cell type:code id:60915e12 tags:
%% Cell type:code id:60915e12 tags:
``` python
``` python
# Food for thought is an entirely OPTIONAL exercise
# Food for thought is an entirely OPTIONAL exercise
# you may leave your thoughts here as a comment if you wish to
# you may leave your thoughts here as a comment if you wish to
```
```
%% Cell type:markdown id:53ab4005 tags:
%% Cell type:markdown id:53ab4005 tags:
**Question 20:** Using the data from Question 5, create a **pie plot** representing the number of institutions from each country.
**Question 20:** Using the data from Question 5, create a **pie plot** representing the number of institutions from each country.
You **have** already computed a **DataFrame** `num_institutions` (in Question 5) containing the **country**, and the **num_of_institutions** data. Run the following cell just to confirm that the variable has not changed its values since you defined it in Question 5.
You **have** already computed a **DataFrame** `num_institutions` (in Question 5) containing the **country**, and the **num_of_institutions** data. Run the following cell just to confirm that the variable has not changed its values since you defined it in Question 5.
%% Cell type:code id:2a86a546 tags:
%% Cell type:code id:2a86a546 tags:
``` python
``` python
grader.check("q20")
grader.check("q20")
```
```
%% Cell type:markdown id:d95601d7 tags:
%% Cell type:markdown id:d95601d7 tags:
Now, **plot** `num_institutions` as **pie plot** with the **title** *Number of institutions*.
Now, **plot** `num_institutions` as **pie plot** with the **title** *Number of institutions*.
Now, you **must** use the `pie_plot` function to create the **pie plot**. The **colors** do **not** matter, but the plot **must** be titled `Number of institutions`, and **must** be labelled as in the sample output below.
Now, you **must** use the `pie_plot` function to create the **pie plot**. The **colors** do **not** matter, but the plot **must** be titled `Number of institutions`, and **must** be labelled as in the sample output below.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
**Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
<center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
%% Cell type:code id:0fdcbe48 tags:
%% Cell type:code id:0fdcbe48 tags:
``` python
``` python
# create the pie plot using the DataFrame 'num_institutions' titled "Number of institutions"
# create the pie plot using the DataFrame 'num_institutions' titled "Number of institutions"
```
```
%% Cell type:markdown id:6bce0354 tags:
%% Cell type:markdown id:6bce0354 tags:
**Food for thought:** It seems that we'll run out of colors! How can we make it so that **no two neighbors share a color**? You'll probably have to look online.
**Food for thought:** It seems that we'll run out of colors! How can we make it so that **no two neighbors share a color**? You'll probably have to look online.
%% Cell type:code id:7bd4d538 tags:
%% Cell type:code id:7bd4d538 tags:
``` python
``` python
# Food for thought is an entirely OPTIONAL exercise
# Food for thought is an entirely OPTIONAL exercise
# you may leave your thoughts here as a comment if you wish to
# you may leave your thoughts here as a comment if you wish to
```
```
%% Cell type:markdown id:936abcda tags:
%% Cell type:markdown id:936abcda tags:
### Closing the database connection:
### Closing the database connection:
Now, before you **submit** your notebook, you **must** **close** your connection `conn`. Not doing this might make **Gradescope fail**. Additionally, **delete** the example images provided with plot questions to save space, if your notebook file is too large for submission. You can **delete** any cell by selecting the cell, hitting the `Esc` key once, and then hitting the `d` key **twice**.
Now, before you **submit** your notebook, you **must** **close** your connection `conn`. Not doing this might make **Gradescope fail**. Additionally, **delete** the example images provided with plot questions to save space, if your notebook file is too large for submission. You can **delete** any cell by selecting the cell, hitting the `Esc` key once, and then hitting the `d` key **twice**.
%% Cell type:code id:9515f232 tags:
%% Cell type:code id:9515f232 tags:
``` python
``` python
# close your connection here
# close your connection here
```
```
%% Cell type:markdown id:27a5f70c tags:
%% Cell type:markdown id:27a5f70c tags:
## Submission
## Submission
Make sure you have run all cells in your notebook in order before running the following cells, so that all images/graphs appear in the output. The following cells will generate a zip file for you to submit.
Make sure you have run all cells in your notebook in order before running the following cells, so that all images/graphs appear in the output. The following cells will generate a zip file for you to submit.
**SUBMISSION INSTRUCTIONS**:
**SUBMISSION INSTRUCTIONS**:
1. **Upload** the zipfile to Gradescope.
1. **Upload** the zipfile to Gradescope.
2. Check **Gradescope otter** results as soon as the auto-grader execution gets completed. Don't worry about the score showing up as -/100.0. You only need to check that the test cases passed.
2. Check **Gradescope otter** results as soon as the auto-grader execution gets completed. Don't worry about the score showing up as -/100.0. You only need to check that the test cases passed.