p13

f690dca5 · Anna Meyer · c4a6722e · f690dca5 · f690dca5
Commit f690dca5 authored 1 year ago by Anna Meyer
--- a/sum23/labs/lab13/practice.ipynb
+++ b/sum23/labs/lab13/practice.ipynb
@@ -83,10 +83,9 @@
   "outputs": [],
   "source": [
    "# use the 'download' function to download the data from the webpage\n",
-    "# 'https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/QSranking.json'\n",
+    "# 'https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/raw/main/sum23/projects/p13/QSranking.json'\n",
    "# to the file 'QSranking.json'\n",
-    "\n",
+    "\n"
-    "download(\"https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/QSranking.json\", \"QSranking.json\")"
   ]
  },
  {
@@ -96,7 +95,7 @@
   "source": [
    "### Task 1.3: Create a database called 'rankings.db' out of 'QSRankings.json'\n",
    "\n",
-    "You can review the relevant lecture code [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/s23/Michael_lecture_notes/32_Database-1) and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/s23/Gurmail_lecture_notes/32_Database-1)."
+    "You can review the relevant lecture code [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database1_notes.ipynb)."
   ]
  },
  {
@@ -454,7 +453,7 @@
    "\n",
    "Before starting this segment, it is recommended that you go through the relevant lecture code:\n",
    "\n",
-    "* [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/22_Plotting1/lec_22_plotting1_bar_plots_notes.ipynb)  (Bar and scatter plots) and [here]() (Line plots - this is what we will talk about in the Wednesday 8/9 lecture)"
+    "* [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/22_Plotting1/lec_22_plotting1_bar_plots_notes.ipynb)  (Bar and scatter plots) and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/sum23/lecture_materials/23_Plotting2) (Line plots - this is what we will talk about in the Wednesday 8/9 lecture)"
   ]
  },
  {

 %% Cell type:markdown id:115889c5 tags:
 # Lab 13: Analyzing World Data with SQL
 In this lab, you will practice how to:
 * write SQL queries,
 * create your own plots.
 %% Cell type:markdown id:daed65a3 tags:
 # Segment 1: Setup
 ### Task 1.1: Import the required modules
 We will first import some important modules
 %% Cell type:code id:e59b7bdb tags:
 ``` python
 # it is considered a good coding practice to place all import statements at the top of the notebook
 # please place all your import statements in this cell if you need to import any more modules for this project
 import sqlite3
 import pandas as pd
 import matplotlib
 import math
 import numpy as np # this is *only* for the function get_regression_coeff - do NOT use this module elsewhere
 ```
 %% Cell type:code id:97a3f1e8 tags:
 ``` python
 # this ensures that font.size setting remains uniform
 %matplotlib inline
 pd.set_option('display.max_colwidth', None)
 matplotlib.rcParams["font.size"] = 13 # don't use value > 13! Otherwise your y-axis tick labels will be different.
 ```
 %% Cell type:markdown id:75adca21 tags:
 ### Task 1.2: Use the `download` function to download `QSranking.json`
 Warning: For the lab and the project, do **not** download the dataset `QSranking.json` manually (you **must** write Python code to download this, as in P12). When we run the autograder, this file `QSranking.json` will not be in the directory. So, unless your `p13.ipynb` downloads this file, you will get a **zero score** on the project. Also, make sure your `download` function includes code to check if the file already exists. The Gradescope autograder will **deduct points** otherwise.
 %% Cell type:code id:2bb742ed tags:
 ``` python
 # copy the definition of your 'download' function from P12 here - remember to import the necessary modules
 ```
 %% Cell type:code id:fe96e53b tags:
 ``` python
 # use the 'download' function to download the data from the webpage
-# 'https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/QSranking.json'
+# 'https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/raw/main/sum23/projects/p13/QSranking.json'
 # to the file 'QSranking.json'
-download("https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/QSranking.json", "QSranking.json")
 ```
 %% Cell type:markdown id:0023581a tags:
 ### Task 1.3: Create a database called 'rankings.db' out of 'QSRankings.json'
-You can review the relevant lecture code [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/s23/Michael_lecture_notes/32_Database-1) and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/s23/Gurmail_lecture_notes/32_Database-1).
+You can review the relevant lecture code [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database1_notes.ipynb).
 %% Cell type:code id:270d8da5 tags:
 ``` python
 # create a database called 'rankings.db' out of 'QSranking.json'
 # TODO: load the data from 'QSranking.json' into a variable called 'qs_ranking' using pandas' 'read_json' function
 # TODO: connect to 'rankings.db' and save it to a variable called 'conn'
 # write the contents of 'qs_ranking' to the table 'rankings' in the database
 # we have done this one for you
 qs_ranking.to_sql("rankings", conn, if_exists="replace", index=False)
 ```
 %% Cell type:markdown id:84a77c79 tags:
 ### Task 1.4: Read all the rows in rankings (the database table)
 You'll have to use pandas's `read_sql` function to make a query.
 %% Cell type:code id:a300adde tags:
 ``` python
 # compute and store the answer in the variable 'rankings', display its head
 # remember to display ONLY the head and NOT the whole DataFrame
 # replace the ... with your code
 rankings = pd.read_sql("SELECT ... FROM ...", conn)
 rankings.head()
 ```
 %% Cell type:code id:3e4d16ee tags:
 ``` python
 # run this cell to confirm that your variable has been defined properly
 assert len(rankings) == 1201
 assert rankings.iloc[0]["country"] == "United States"
 assert rankings.iloc[-1]["institution_name"] == "Wake Forest University"
 ```
 %% Cell type:markdown id:7b09ee5a tags:
 # Segment 2: SQL Practice
 In practice, we often are more interested in writing more specific queries about our data. For example, we might be interested in finding institutions in the *United States*, or data collected in the `year` *2018*, or both. With **SQL**, **WHERE** and **AND** clauses can help filter the data accordingly.
 Before proceeding with this segment, it is **recommended** that you **review** the relevant lecture code:
 * [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database2_notes.ipynb) (Databases part 2)
 and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/20_Databases/lec_20_database3_notes.ipynb) (Databases part 3)
 %% Cell type:markdown id:9cebe083 tags:
 ### Task 2.1: Use WHERE to find institutions in the United States
 * Write a query to select the rows from the database with *United States* as the `country`.
 * Keep only the `institution_name` column.
 * Save these institution names to a **list**.
 **Hint:** You will need to use **quotes** (`'`) around the **strings** in your query and **backticks** (``` ` ```) around **column names** as in the example below. The **quotes** and **backticks*** are only **required** when the string or column name contains special characters or spaces. But even otherwise, it is a good idea to use them to be on the safe side.
 %% Cell type:code id:64012949 tags:
 ``` python
 # we have done this one for you
 us_institutions_df = pd.read_sql("SELECT `institution_name` FROM rankings WHERE `country` = 'United States'", conn)
 us_institutions = list(us_institutions_df['institution_name'])
 ```
 %% Cell type:code id:c035f899 tags:
 ``` python
 # run this cell to confirm that your variable has been defined properly
 assert "University Of Wisconsin-Madison" in us_institutions
 assert "Tampere University" in list(rankings["institution_name"])
 assert "Tampere University" not in us_institutions
 ```
 %% Cell type:markdown id:9fe4da4e tags:
 ### Task 2.2: Add an AND clause to find institutions in the United States with at least 70 overall score
 * Copy your query from Task 2.1.
 * Update it to only select rows with `overall_score` of **at least** *70*.
 %% Cell type:code id:12f341ad tags:
 ``` python
 # compute and store the answer in the variable 'good_us_institutions', but do NOT display it
 ```
 %% Cell type:code id:25e2d3cc tags:
 ``` python
 # run this cell to confirm that your variable has been defined properly
 assert "Massachusetts Institute Of Technology" in good_us_institutions
 assert "University Of Wisconsin-Madison" in good_us_institutions
 assert "Wake Forest University" not in good_us_institutions
 assert "University of Connecticut" not in good_us_institutions
 ```
 %% Cell type:markdown id:cf715227 tags:
 ### Task 2.3: Use an ORDER BY clause to display the top 5 institutions by academic reputation in 2019
 In addition to **WHERE** and **AND**, the **ORDER BY** keyword helps organize data even further. Much like the `sort_values()` function in `pandas`, the **ORDER BY** clause can be used to organize the result of the query in *increasing* (**ASC**) or *decreasing* (**DESC**) order based on a column's values.
 * Write a new query to select rows in rankings where the `year` is *2019*.
 * Use **ORDER BY** and **LIMIT** to select the top 5 rows with the **highest** `academic_reputation`.
 * Save these institution names to a **list**.
 %% Cell type:code id:763304e0 tags:
 ``` python
 # compute and store the answer in the variable 'top_5_institutions', then display it
 ```
 %% Cell type:code id:404fa832 tags:
 ``` python
 # run this cell to confirm that your variable has been defined properly
 assert len(top_5_institutions) == 5
 assert top_5_institutions[0] == "Massachusetts Institute Of Technology"
 assert top_5_institutions[-1] == "University Of Cambridge"
 ```
 %% Cell type:markdown id:13e1803b tags:
 ### Task 2.4: Order by multiple columns
 If you print out the resulting dataframe from your query, you might notice that all 5 rows have the same academic reputation. This makes it hard to compare the universities, so we will add some **tiebreaking** rules. If two universities have the same `academic_reputation`, then we should order them by their `citations_per_faculty` (with the **highest** appearing first). You can do this by ordering by multiple columns.
 * Copy your query from Task 2.3.
 * Update the **ORDER BY** clause to add this tiebreaking behavior.
 * Save these institution names to a **list**.
 %% Cell type:code id:26f5a433 tags:
 ``` python
 # compute and store the answer in the variable 'top_5_with_tiebreak', then display it
 ```
 %% Cell type:code id:c5b2382b tags:
 ``` python
 # run this cell to confirm that your variable has been defined properly
 assert top_5_with_tiebreak[0] == "University Of California, Berkeley"
 assert top_5_with_tiebreak[-1] == "University Of California, Los Angeles"
 ```
 %% Cell type:markdown id:9b991dcf tags:
 ### Task 2.5: Use GROUP BY clause and SUM aggregate function to get the total number of international_students for each country in 2019
 The **GROUP BY** keyword groups rows that have the same value. It is often used with aggregate functions, such as **COUNT**, **SUM**, **AVG**, etc. to obtain a summary about groups in the data.
 For example, to answer the question "What is the average rank of each country's institutions?", we could **GROUP BY** the `country` and use the **AVG** aggregate function to get the average rank of each country.
 * Write a new query that uses **GROUP BY** and **SUM** to get the total number of international students in each country, using **WHERE** to filter by the `year`.
 * Save the resulting **DataFrame** with **two** columns: `country` and the **sum** of the `international_students` for that country.
 %% Cell type:code id:f31786c4 tags:
 ``` python
 # compute and store the answer in the variable 'inter_students_by_country', then display its head
 ```
 %% Cell type:code id:9c84f12c tags:
 ``` python
 # run this cell to confirm that your variable has been defined properly
 assert math.isclose(inter_students_by_country[inter_students_by_country["country"] == "Japan"].iloc[0][1], 280.9)
 assert math.isclose(inter_students_by_country[inter_students_by_country["country"] == "Australia"].iloc[0][1], 1895.5)
 assert math.isclose(inter_students_by_country[inter_students_by_country["country"] == "United States"].iloc[0][1], 3675.0)
 ```
 %% Cell type:markdown id:06ecba29 tags:
 ### Task 2.6: Use the AS keyword to rename the new column from Task 2.5 to total_international_students
 Although the dataframe does have a column for the sum of international students for each country, the name of the column looks strange:
 ```sql
 SUM(`international_students`)
 ```
 In SQL, the **AS** keyword allows us to create an simpler alias for the columns we create with our queries to make the resulting **DataFrame** easier to understand.
 * Paste your query from Task 2.5 and modify it so the **SUM** column has the name `total_international_students`.
 * Save the resulting **DataFrame** with **two** columns: `country` and `total_international_students`.
 %% Cell type:code id:3947be0d tags:
 ``` python
 # compute and store the answer in the variable 'inter_students_by_country_renamed', then display its head
 ```
 %% Cell type:code id:9e114959 tags:
 ``` python
 # run this cell to confirm that your variable has been defined properly
 assert "total_international_students" in inter_students_by_country_renamed.columns
 assert math.isclose(inter_students_by_country_renamed[inter_students_by_country_renamed["country"] == "Japan"]["total_international_students"], 280.9)
 assert math.isclose(inter_students_by_country_renamed[inter_students_by_country_renamed["country"] == "Australia"]["total_international_students"], 1895.5)
 assert math.isclose(inter_students_by_country_renamed[inter_students_by_country_renamed["country"] == "United States"]["total_international_students"], 3675.0)
 ```
 %% Cell type:markdown id:79fdda0c tags:
 ### Task 2.7: Use the HAVING keyword to only keep countries with more than 1000 international students
 In addition to **WHERE**, the **HAVING** keyword is useful for filtering **GROUP BY** queries. Whereas **WHERE** filters the number of rows, **HAVING** filters the number of groups.
 * Paste your query from Task 2.6 and modify it so that it only returns countries (`country`) and `total_international_students` with **more than** *1000* international students.
 * Save the resulting **DataFrame** with **two** columns: `country` and `total_international_students`.
 %% Cell type:code id:8bc00cf4 tags:
 ``` python
 # compute and store the answer in the variable 'inter_students_by_country_more_than_1000', then display it
 ```
 %% Cell type:code id:a1c5be56 tags:
 ``` python
 # run this cell to confirm that your variable has been defined properly
 assert len(inter_students_by_country_more_than_1000) == 4
 assert "Australia" in list(inter_students_by_country_more_than_1000["country"])
 assert "Germany" in list(inter_students_by_country_more_than_1000["country"])
 assert "United Kingdom" in list(inter_students_by_country_more_than_1000["country"])
 assert "United States" in list(inter_students_by_country_more_than_1000["country"])
 ```
 %% Cell type:markdown id:d83309db tags:
 # Segment 3: Plotting
 SQL provides powerful tools to manipulate and organize data. Now we might be interested in plotting the data to engage in data exploration and visualize our results.
 Before starting this segment, it is recommended that you go through the relevant lecture code:
-* [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/22_Plotting1/lec_22_plotting1_bar_plots_notes.ipynb)  (Bar and scatter plots) and [here]() (Line plots - this is what we will talk about in the Wednesday 8/9 lecture)
+* [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/lecture_materials/22_Plotting1/lec_22_plotting1_bar_plots_notes.ipynb)  (Bar and scatter plots) and [here](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/tree/main/sum23/lecture_materials/23_Plotting2) (Line plots - this is what we will talk about in the Wednesday 8/9 lecture)
 %% Cell type:markdown id:d27b7c2c tags:
 ### Task 3.1: Use a bar plot to plot the data from Task 2.7
 Your plot should look like this:
 <div><img src="attachment:bar_plot.png" width="400"/></div>
 Make sure that the plot is labelled exactly as in the image here.
 %% Cell type:code id:5e4dc5d2 tags:
 ``` python
 # instead of specifically plotting just the DataFrame 'inter_students_by_country_more_than_1000',
 # create a general function to create bar plots
 def bar_plot(df, x, y):
    """bar_plot(df, x, y) takes in a DataFrame 'df' and displays
    a bar plot with the column 'x' as the x-axis, and the column
    'y' as the y-axis"""
    pass # replace with your code
    # TODO: set dataframe index to 'x'
    # TODO: use df.plot.bar to plot the data in black with no legend
    # TODO: set x as the x label
    # TODO: set y as the y label
 ```
 %% Cell type:code id:e21ed94a tags:
 ``` python
 # run this cell to plot the data from Task 2.7
 # verify that this plot matches exactly with the image shown above
 bar_plot(inter_students_by_country_more_than_1000, 'country', 'total_international_students')
 ```
 %% Cell type:markdown id:0adf3bdd tags:
 ### Task 3.2: Use a scatter plot to plot the relationship between employer_reputation and academic_reputation in 2019
 Your plot should look like this:
 <div><img src="attachment:scatter_plot.png" width="500"/></div>
 Make sure that the plot is labelled exactly as in the image here.
 %% Cell type:code id:8eb6036d tags:
 ``` python
 # create a general function to create scatter plots
 def scatter_plot(df, x, y):
    """scatter_plot(df, x, y) takes in a DataFrame 'df' and displays
    a scatter plot with the column 'x' as the x-axis, and the column
    'y' as the y-axis"""
    pass # replace with your code
    # TODO: use df.plot.scatter to plot the data in black with no legend
    # TODO: set x as the x label
    # TODO: set y as the y label
 ```
 %% Cell type:markdown id:d77b0f09 tags:
 With the `scatter_plot` function defined, you are ready to create the required plot.
 * Write a SQL query to select rows from the database where the `year` is *2019*.
 * Save the resulting **DataFrame** with **two** columns: `employer_reputation` and `academic_reputation`.
 * Call `scatter_plot`, passing in `employer_reputation` and `academic_reputation` as the `x` and `y` arguments.
 %% Cell type:code id:2ef617ff tags:
 ``` python
 # first compute and store the DataFrame
 # then create the scatter plot using the DataFrame
 # verify that this plot matches exactly with the image shown above
 ```
 %% Cell type:markdown id:d144417b tags:
 ### Task 3.3: Make a Horizontal Bar plot of average employer_reputation and average faculty_student_score across all years
 Your plot should look like this:
 <div><img src="attachment:horizontal_bar_plot.png" width="600"/></div>
 Make sure that the plot is labelled exactly as in the image here.
 %% Cell type:code id:78e21b0b tags:
 ``` python
 # we have done this one for you
 def horizontal_bar_plot(df, x):
    """horizontal_bar_plot(df, x) takes in a DataFrame 'df' and displays
    a horizontal bar plot with the column 'x' as the x-axis, and all
    other columns of 'df' on the y-axis"""
    df = df.set_index(x)
    ax = df.plot.barh()
    ax.legend(loc='center left', bbox_to_anchor=(1, 0.9))
 ```
 %% Cell type:markdown id:7cbdaa9f tags:
 Use the `horizontal_bar_plot` function to create the required plot.
 * Write a SQL query to select `year`, **average** `employer_reputation`, and **average** `faculty_student_score` grouped by `year`.
 * Save the resulting **DataFrame** with **three** columns: `year`, the **average** of the `employer_reputation` and the **average** of the `faculty_student_score`.
 * Call `horizontal_bar_plot`, passing in `year` as the `x` argument.
 %% Cell type:code id:bc779e0b tags:
 ``` python
 # first compute and store the DataFrame
 # then create the horizontal bar plot using the DataFrame
 # verify that this plot matches exactly with the image shown above
 ```
 %% Cell type:markdown id:aaeeebe7 tags:
 ### Task 3.4 Display a Pie Chart of the average overall score of the top 10 countries in descending order
 Your plot should look like this:
 <div><img src="attachment:pie_plot.png" width="400"/></div>
 Make sure that the plot is labelled exactly as in the image here.
 %% Cell type:code id:aedb58d2 tags:
 ``` python
 # we have done this one for you
 def pie_plot(df, x, y, title=None):
    """pie_plot(df, x, y, title) takes in a DataFrame 'df' and displays
    a pie plot with the column 'x' as the x-axis, the (numeric) column
    'y' as the y-axis, and the 'title' as the title of the plot"""
    df = df.set_index(x)
    ax = df.plot.pie(y=y, legend=False)
    ax.set_ylabel(None)
    ax.set_title(title)
 ```
 %% Cell type:markdown id:805c89c1 tags:
 Use the `pie_plot` function to create the required plot.
 * Write a SQL query to select the **top** *10* countries based on **average** `overall_score`.
 * Save the resulting **DataFrame** with **two** columns: `country`, and the **average** of the `overall_score`.
 * Call `pie_plot`, passing in `country` as the `x` argument, and the **average** of the `overall_score` as the `y` argument.
 * Your plot must also have the **title** `Countries with top 10 overall scores` as in the image.
 **Hint:** If you are having trouble writing the SQL query, take a look at Task 2.3
 %% Cell type:code id:777d3b49 tags:
 ``` python
 # first compute and store the DataFrame
 # then create the pie plot using the DataFrame
 # verify that this plot matches exactly with the image shown above
 ```
 %% Cell type:markdown id:de3777de tags:
 ### Task 3.5: Fit a regression line to the data from Task 3.2
 Your line of best fit should look like this:
 <div><img src="attachment:regression_line_plot.png" width="500"/></div>
 Make sure that the plot is labelled exactly as in the image here.
 %% Cell type:code id:68941bde tags:
 ``` python
 # we have defined this function for you
 def get_regression_coeff(df, x, y):
    """get_regression_coeff(df, x, y) takes in a DataFrame 'df' and returns
    the slope (m) and the y-intercept (b) of the line of best fit in the
    plot with the column 'x' as the x-axis, and the column 'y' as the y-axis"""
    df["1"] = 1
    res = np.linalg.lstsq(df[[x, "1"]], df[y], rcond=None)
    coefficients = res[0]
    m = coefficients[0]
    b = coefficients[1]
    return (m, b)
 ```
 %% Cell type:code id:fb427287 tags:
 ``` python
 # you must define this function to compute the best fit line
 def get_regression_line(df, x, y):
    """get_regression_line(df, x, y) takes in a DataFrame 'df' and returns
    a DataFrame with an additional column "fit" of the line of best fit in the
    plot with the column 'x' as the x-axis, and the column 'y' as the y-axis"""
    pass # replace with your code
    # TODO: use the 'get_regression_coeff' function to get the slope and
    #       intercept of the line of best fit
    # TODO: save them into variables m and b respectively
    # TODO: create a new column in the dataframe called 'fit', which is
    #       is calculated as df['fit'] = m * df[x] + b
    # TODO: return the DataFrame df
 ```
 %% Cell type:code id:0a70404d tags:
 ``` python
 # you must define this function to plot the best fit line on the scatter plot
 def regression_line_plot(df, x, y):
    """regression_line_plot(df, x, y) takes in a DataFrame 'df' and displays
    a scatter plot with the column 'x' as the x-axis, and the column
    'y' as the y-axis, as well as the best fit line for the plot"""
    pass # replace with your code
    # TODO: use 'get_regression_line' to get the data for the best fit line.
    # TODO: use df.plot.scatter (not scatter_plot) to plot the x and y columns
    #       of 'df' in black color.
    # TODO: save the return value of df.plot.scatter to a variable called 'ax'
    # TODO: use df.plot.line to plot the fitted line in red,
    #       using ax=ax as a keyword argument.
    #       this ensures that both the scatter plot and line end up on the same plot
    #       play careful attention to what the 'x' and 'y' arguments ought to be
 ```
 %% Cell type:markdown id:ef4b46de tags:
 Now, use the `regression_line_plot` function to create the required plot.
 * Call `regression_line_plot` on your data from Task 3.2 to show the correlation between `employer_reputation` and `academic_reputation`.
 %% Cell type:code id:065d0ef5 tags:
 ``` python
 # create the scatter plot with the best fit line using the DataFrame from Task 3.2
 # verify that this plot matches exactly with the image shown above
 ```
 %% Cell type:markdown id:bdb5cdb7 tags:
 ### Task 4: Closing the connection
 Now that you are done with your database, it is very important to close it.
 %% Cell type:code id:65557b40 tags:
 ``` python
 # close your connection here
 # we have done this one for you
 conn.close()
 ```
 %% Cell type:markdown id:0f20a99c tags:
 ### Congratulations, you are now ready to start P13!

--- a/sum23/projects/p13/p13.ipynb
+++ b/sum23/projects/p13/p13.ipynb
@@ -324,7 +324,7 @@
   "outputs": [],
   "source": [
    "# use the 'download' function to download the data from the webpage\n",
-    "# 'https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/QSranking.json'\n",
+    "# 'https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/raw/main/sum23/projects/p13/QSranking.json'\n",
    "# to the file 'QSranking.json'\n"
   ]
  },

 %% Cell type:code id:5e33d91e tags:
 ``` python
 import otter
 # nb_name should be the name of your notebook without the .ipynb extension
 nb_name = "p13"
 py_filename = nb_name + ".py"
 grader = otter.Notebook(nb_name + ".ipynb")
 ```
 %% Cell type:code id:0611fe14 tags:
 ``` python
 import p13_test
 ```
 %% Cell type:code id:2bcd01a8 tags:
 ``` python
 # PLEASE FILL IN THE DETAILS
 # enter none if you don't have a project partner
 # you will have to add your partner as a group member on Gradescope even after you fill this
 # project: p13
 # submitter: NETID1
 # partner: NETID2
 ```
 %% Cell type:markdown id:372ed345 tags:
 # Project 13: World University Rankings
 %% Cell type:markdown id:b30c2df0 tags:
 ## Learning Objectives:
 In this project, you will demonstrate how to:
 * query a database using SQL,
 * process data using `pandas` **DataFrames**,
 * create different types of plots.
 Please go through [Lab 13](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/labs/lab13) before working on this project. The lab introduces some useful techniques related to this project.
 %% Cell type:markdown id:479785c7 tags:
 ## Note on Academic Misconduct:
 **IMPORTANT**: P12 and P13 are two parts of the same data analysis. You **cannot** switch project partners between these two projects. That is if you partnered up with someone for P12, you have to sustain that partnership until end of P13. Now may be a good time to review [our course policies](https://canvas.wisc.edu/courses/355767/pages/syllabus?module_item_id=6048035).
 %% Cell type:markdown id:3e0e04f5 tags:
 ## Testing your code:
 Along with this notebook, you must have downloaded the file `p13_test.py`. If you are curious about how we test your code, you can explore this file, and specifically the value of the variable `expected_json`, to understand the expected answers to the questions.
 For answers involving DataFrames, `p13_test.py` compares your tables to those in `p13_expected.html`, so take a moment to open that file on a web browser (from Finder/Explorer).
 For answers involving plots, `p13_test.py` can **only** check that the **DataFrames** are correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. Your plots will be **manually graded**, and you will **lose points** if your plot is not visible, or if it is not properly labelled.
 **IMPORTANT Warning:** Do **not** download the dataset `QSranking.json` **manually**. Use the `download` function from P12 to download it. When we run the autograder, this file `QSranking.json` will **not** be in the directory. So, unless your `p13.ipynb` downloads this file, you will get a **zero score** on the project. Also, make sure your `download` function includes code to check if the file already exists. Otherwise, you will **lose** points for **hardcoding**.
 %% Cell type:markdown id:aad1951a tags:
 ## Project Description:
 For your final CS220 project, you're going to continue analyzing world university rankings. However, we will be using a different dataset this time. The data for this project has been extracted from [here](https://www.topuniversities.com/university-rankings/world-university-rankings/2023). Unlike the CWUR rankings we used in P12, the QS rankings dataset has various scores for the universities, and not just the rankings. This makes the QS rankings dataset more suitable for plotting (which you will be doing a lot of!).
 In this project, you'll have to dump your DataFrame to a SQLite database. You'll answer questions by doing queries on that database. Often, your answers will be in the form of a plot. Check these carefully, as the tests only verify that a plot has been created, not that it looks correct (the Gradescope autograder will manually deduct points for plotting mistakes).
 %% Cell type:markdown id:48aad11e tags:
 ## Project Requirements:
 You **may not** hardcode indices in your code. You **may not** manually download **any** files for this project, unless you are **explicitly** told to do so. For all other files, you **must** use the `download` function to download the files.
 **Store** your final answer for each question in the **variable specified for each question**. This step is important because Otter grades your work by comparing the value of this variable against the correct answer.
 For some of the questions, we'll ask you to write (then use) a function to compute the answer. If you compute the answer **without** creating the function we ask you to write, we'll **manually deduct** points from your autograder score on Gradescope, even if the way you did it produced the correct answer.
 Required Functions:
 - `bar_plot`
 - `scatter_plot`
 - `horizontal_bar_plot`
 - `pie_plot`
 - `get_regression_coeff`
 - `get_regression_line`
 - `regression_line_plot`
 - `download`
 In this project, you will also be required to define certain **data structures**. If you do not create these data structures exactly as specified, we'll **manually deduct** points from your autograder score on Gradescope, even if the way you did it produced the correct answer.
 Required Data Structures:
 - `conn`
 You **must** write SQL queries to solve the questions in this project, unless you are **explicitly** told otherwise. You will **not get any credit** if you use `pandas` operations to extract data. We will give you **specific** instructions for any questions where `pandas` operations are allowed. In addition, you are also **required** to follow the requirements below:
 * You **must** close the connection to `conn` at the end of your notebook.
 * Do **not** use **absolute** paths such as `C://ms//cs220//p13`. You may **only** use **relative paths**.
 * Do **not** hardcode `//` or `\` in any of your paths. You **must** use `os.path.join` to create paths.
 * Do **not** leave irrelevant output or test code that we didn't ask for.
 * **Avoid** calling **slow** functions multiple times within a loop.
 * Do **not** define multiple functions with the same name or define multiple versions of one function with different names. Just keep the best version.
 For more details on what will cause you to lose points during code review and specific requirements, please take a look at the [Grading rubric](https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/rubric.md).
 %% Cell type:markdown id:e04f805e tags:
 ## Questions and Functions:
 Let us start by importing all the modules we will need for this project.
 %% Cell type:code id:b1363e20 tags:
 ``` python
 # it is considered a good coding practice to place all import statements at the top of the notebook
 # please place all your import statements in this cell if you need to import any more modules for this project
 ```
 %% Cell type:markdown id:995a9ea8 tags:
 Now, you may copy/paste some of the functions and data structures you defined in Lab 13 and P12, which will be useful for this project.
 %% Cell type:code id:a4fab7ea tags:
 ``` python
 # this ensures that font.size setting remains uniform
 %matplotlib inline
 pd.set_option('display.max_colwidth', None)
 matplotlib.rcParams["font.size"] = 13 # don't use value > 13! Otherwise your y-axis tick labels will be different.
 ```
 %% Cell type:code id:e4eac640 tags:
 ``` python
 # copy/paste the definition of the function 'bar_plot' from lab-p13 here
 ```
 %% Cell type:code id:71c71935 tags:
 ``` python
 # copy/paste the definition of the function 'scatter_plot' from lab-p13 here
 ```
 %% Cell type:code id:153b23ad tags:
 ``` python
 # copy/paste the definition of the function 'horizontal_bar_plot' from lab-p13 here
 ```
 %% Cell type:code id:1f6d37df tags:
 ``` python
 # copy/paste the definition of the function 'pie_plot' from lab-p13 here
 ```
 %% Cell type:code id:88255766 tags:
 ``` python
 # copy/paste the definition of the function 'get_regression_coeff' from lab-p13 here
 ```
 %% Cell type:code id:8119a0ec tags:
 ``` python
 # copy/paste the definition of the function 'get_regression_line' from lab-p13 here
 ```
 %% Cell type:code id:13851f7d tags:
 ``` python
 # copy/paste the definition of the function 'regression_line_plot' from lab-p13 here
 ```
 %% Cell type:code id:c12776a3 tags:
 ``` python
 # copy/paste the definition of the function 'download' from p12 here
 ```
 %% Cell type:code id:f4fbd661 tags:
 ``` python
 # use the 'download' function to download the data from the webpage
-# 'https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/blob/main/sum23/projects/p13/QSranking.json'
+# 'https://git.doit.wisc.edu/cdis/cs/courses/cs220/cs220-lecture-material/-/raw/main/sum23/projects/p13/QSranking.json'
 # to the file 'QSranking.json'
 ```
 %% Cell type:markdown id:40f76941 tags:
 ### Data Structure 1: `conn`
 You **must** now create a **database** called `rankings.db` out of `QSranking.json`, connect to it, and save it in a variable called `conn`. You **must** use this connection to the database `rankings.db` to answer the questions that follow.
 %% Cell type:code id:8de4b158 tags:
 ``` python
 # create a database called 'rankings.db' out of 'QSranking.json'
 # TODO: load the data from 'QSranking.json' into a variable called 'qs_ranking' using pandas' 'read_json' function
 # TODO: connect to 'rankings.db' and save it to a variable called 'conn'
 # TODO: write the contents of the DataFrame 'qs_ranking' to the sqlite database
 ```
 %% Cell type:code id:9f28e183 tags:
 ``` python
 # run this cell and confirm that you have defined the variables correctly
 pd.read_sql("SELECT * FROM rankings LIMIT 5", conn)
 ```
 %% Cell type:markdown id:d31f5dd9 tags:
 **Question 1:** List **all** the statistics of the institution with the `institution_name` *University Of Wisconsin-Madison*.
 You **must** display **all** the columns. The rows **must** be in *ascending* order of `year`.
 Your output **must** be a **DataFrame** that looks like this:
 ||**rank**|**year**|**institution_name**|**country**|**academic_reputation**|**employer_reputation**|**faculty_student_score**|**citations_per_faculty**|**international_faculty**|**international_students**|**overall_score**|
 |---|---|---|---|---|---|---|---|---|---|---|---|
 |**0**|55|2018|University Of Wisconsin-Madison|United States|94.0|62.1|84.0|54.2|53.2|30.9|75.8|
 |**1**|53|2019|University Of Wisconsin-Madison|United States|88.5|51.2|87.4|52.6|58.8|30.6|73.2|
 |**2**|56|2020|University Of Wisconsin-Madison|United States|87.8|49.7|85.5|50.0|57.2|30.9|71.8|
 %% Cell type:code id:8eefb54f tags:
 ``` python
 # compute and store the answer in the variable 'uw_rating', then display it
 ```
 %% Cell type:code id:6a51b275 tags:
 ``` python
 grader.check("q1")
 ```
 %% Cell type:markdown id:587fd6d2 tags:
 **Question 2:** What are the **top** *10* institutions in *Japan* which had the **highest** score of `international_students` in the `year` *2020*?
 You **must** display the columns `institution_name` and `international_students`. The rows **must** be in *descending* order of `international_students`.
 Your output **must** be a **DataFrame** that looks like this:
 ||**institution_name**|**international_students**|
 |---------|------|---------|
 |**0**|Waseda University|35.8|
 |**1**|Tokyo Institute Of Technology|31.3|
 |**2**|University Of Tsukuba|30.4|
 |**3**|The University of Tokyo|26.2|
 |**4**|Kyushu University|21.5|
 |**5**|Nagoya University|21.3|
 |**6**|Tohoku University|17.6|
 |**7**|Kyoto University|17.5|
 |**8**|Hiroshima University|17.1|
 |**9**|Tokyo Medical and Dental University|16.7|
 %% Cell type:code id:b72f2999 tags:
 ``` python
 # compute and store the answer in the variable 'japan_top_10_inter', then display it
 ```
 %% Cell type:code id:f06aaae0 tags:
 ``` python
 grader.check("q2")
 ```
 %% Cell type:markdown id:341ac4b8 tags:
 **Question 3:** What are the **top** *10* institutions in the *United States* which had the **highest** *reputation* in the `year` *2019*?
 The `reputation` of an institution is defined as the sum of `academic_reputation` and `employer_reputation`. You **must** display the columns `institution_name` and `reputation`. The rows **must** be in *descending* order of `reputation`. In case the `reputation` is tied, the rows must be in *alphabetical* order of `institution_name`.
 Your output **must** be a **DataFrame** that looks like this:
 ||**institution_name**|**reputation**|
 |---------|------|---------|
 |**0**|Harvard University|200.0|
 |**1**|Massachusetts Institute Of Technology|200.0|
 |**2**|Stanford University|200.0|
 |**3**|University Of California, Berkeley|199.8|
 |**4**|Yale University|199.6|
 |**5**|University Of California, Los Angeles|199.1|
 |**6**|Columbia University|197.1|
 |**7**|Princeton University|196.6|
 |**8**|University Of Chicago|190.3|
 |**9**|Cornell University|189.2|
 **Hint:** You can use mathematical expressions in your **SELECT** clause. For example, if you wish to add the `academic_reputation` and `employer_reputation` for each institution, you could use the following query:
 ```sql
 SELECT (`academic_reputation` + `employer_reputation`) FROM rankings
 ```
 %% Cell type:code id:271b86d7 tags:
 ``` python
 # compute and store the answer in the variable 'us_top_10_rep', then display it
 ```
 %% Cell type:code id:96cacdd4 tags:
 ``` python
 grader.check("q3")
 ```
 %% Cell type:markdown id:21ba8c82 tags:
 **Question 4:** What are the **top** *10* countries which had the **most** *institutions* listed in the `year` *2020*?
 You **must** display the columns `country` and `num_of_institutions`. The `num_of_institutions` of a country is defined as the number of institutions from that country. The rows **must** be in *descending* order of `num_of_institutions`. In case the `num_of_institutions` is tied, the rows must be in *alphabetical* order of `country`.
 **Hint:** You **must** use the `COUNT` SQL function to answer this question.
 Your output **must** be a **DataFrame** that looks like this:
 ||**country**|**num_of_institutions**|
 |---------|------|---------|
 |**0**|United States|74|
 |**1**|United Kingdom|45|
 |**2**|Germany|23|
 |**3**|Australia|21|
 |**4**|Canada|14|
 |**5**|China|14|
 |**6**|France|14|
 |**7**|Japan|14|
 |**8**|Netherlands|13|
 |**9**|Russia|13|
 %% Cell type:code id:1991dc45 tags:
 ``` python
 # compute and store the answer in the variable 'top_10_countries', then display it
 ```
 %% Cell type:code id:3e878347 tags:
 ``` python
 grader.check("q4")
 ```
 %% Cell type:markdown id:6ef62b90 tags:
 **Question 5:** Create a **bar plot** using the data from Question 4 with the `country` on the **x-axis** and the `num_of_institutions` on the **y-axis**.
 In addition to the top ten countries, you **must** also aggregate the data for **all** the **other** countries, and represent that number in the column `Other`. You are **allowed** do this using any combination of  SQL queries and pandas operations.
 You **must** first compute a **DataFrame** `num_institutions` containing the **country**, and the **num_of_institutions** data.
 **Hint**: You can use the `append` function of a DataFrame to add a single row to the end of your **DataFrame** from Question 4. You'll also need the keyword argument `ignore_index=True`. For example:
 ```python
 my_new_dataframe = my_dataframe.append({"country": "CS220", "num_of_institutions": 22}, ignore_index=True)
 ```
 will create a *new* **DataFrame** `my_new_dataframe` which contains all the rows from `my_dataframe`, along with the **additional row** which has been appended. You can **ignore** any warnings about `append` being deprecated.
 %% Cell type:code id:a0b3223c tags:
 ``` python
 # first compute and store the DataFrame 'num_institutions', then display it
 # do NOT plot just yet
 # TODO: use a SQL query similar to Question 4 to get the number of institutions of all countries
 #       (not just the top 10), ordered by the number of institutions, and store in a DataFrame
 # TODO: Use pandas to find the sum of the institutions in all countries except the top 10
 # TODO: create a new dictionary with the data about the new row that needs to be added
 # TODO: properly append this new dictionary to 'num_institutions' and update 'num_institutions'
 ```
 %% Cell type:code id:c95611c9 tags:
 ``` python
 grader.check("q5")
 ```
 %% Cell type:markdown id:51a82c7e tags:
 Now, **plot** `num_institutions` as **bar plot** with the **x-axis** labelled *country* and the **y-axis** labelled *num_of_institutions*.
 You **must** use the `bar_plot` function to create the plot.
 **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
 Your plot should look like this:
 %% Cell type:markdown id:b7e7e295 tags:
 <div><img src="attachment:q5.png" width="400"/></div>
 <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
 %% Cell type:code id:4cd92732 tags:
 ``` python
 # create the bar plot using the DataFrame 'num_institutions' with the x-axis labelled "country"
 # and the y-axis labelled "num_of_institutions"
 ```
 %% Cell type:markdown id:6617e42c tags:
 **Question 6:** Create a **bar plot** of the **top** *10* countries with the **highest** *total* `overall_score` listed in the `year` *2019*.
 The `total_score` of a `country` is defined as the **sum** of `overall_score` of **all** institutions in that `country`. You **must** display the columns `country` and `total_score`. The rows **must** be in *descending* order of `total_score`.
 You **must** first compute a **DataFrame** `top_10_total_score` containing the **country**, and the **total_score** data.
 Your **DataFrame** should looks like this:
 ||**country**|**total_score**|
 |---------|------|---------|
 |**0**|United States|4298.4|
 |**1**|United Kingdom|2539.2|
 |**2**|Germany|1098.2|
 |**3**|Australia|1093.8|
 |**4**|Japan|752.9|
 |**5**|China|743.4|
 |**6**|Canada|705.3|
 |**7**|Netherlands|674.9|
 |**8**|South Korea|612.8|
 |**9**|France|595.2|
 %% Cell type:code id:f7cf3887 tags:
 ``` python
 # compute and store the answer in the variable 'top_10_total_score', then display it
 # do NOT plot just yet
 ```
 %% Cell type:code id:64d40c82 tags:
 ``` python
 grader.check("q6")
 ```
 %% Cell type:markdown id:2e7b11bc tags:
 Now, **plot** `top_10_total_score` as **bar plot** with the **x-axis** labelled *country* and the **y-axis** labelled *total_score*.
 You **must** use the `bar_plot` function to create the plot.
 **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
 Your plot should look like this:
 %% Cell type:markdown id:033d0733 tags:
 <div><img src="attachment:q6.png" width="400"/></div>
 <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
 %% Cell type:code id:2192b4e4 tags:
 ``` python
 # create the bar plot using the DataFrame 'top_10_total_score' with the x-axis labelled "country"
 # and the y-axis labelled "total_score"
 ```
 %% Cell type:markdown id:88cbb812 tags:
 **Question 7:** What are the **top** *10* institutions in the *United States* which had the **highest** *international_score* in the `year` *2020*?
 The *international_score* of an institution is defined as the **sum** of `international_faculty` and `international_students` scores of that institution. You **must** display the columns `institution_name` and `international_score`. The rows **must** be in *descending* order of `international_score`.
 Your output **must** be a **DataFrame** that looks like this:
 ||**institution_name**|**international_score**|
 |---------|------|---------|
 |**0**|Massachusetts Institute Of Technology|194.1|
 |**1**|California Institute Of Technology|186.7|
 |**2**|Carnegie Mellon University|183.5|
 |**3**|Rice University|180.4|
 |**4**|Northeastern University|179.1|
 |**5**|Stanford University|167.5|
 |**6**|Cornell University|166.1|
 |**7**|Purdue University|158.2|
 |**8**|University Of Rochester|157.9|
 |**9**|University Of Chicago|151.2|
 %% Cell type:code id:af3589cd tags:
 ``` python
 # compute and store the answer in the variable 'top_10_inter_score', then display it
 ```
 %% Cell type:code id:41ee5bff tags:
 ``` python
 grader.check("q7")
 ```
 %% Cell type:markdown id:4794b1a5 tags:
 **Question 8:** Create a **scatter plot** representing the `citations_per_faculty` (on the **x-axis**) against the `overall_score` (on the **y-axis**) of each institution in the `year` *2018*.
 You **must** first compute a **DataFrame** `citations_overall` containing the **citations_per_faculty**, and the **overall_score** data from the `year` *2018*, of each **institution**.
 %% Cell type:code id:92a32a11 tags:
 ``` python
 # first compute and store the DataFrame 'citations_overall', then display its head
 # do NOT plot just yet
 ```
 %% Cell type:code id:c9a2b1ba tags:
 ``` python
 grader.check("q8")
 ```
 %% Cell type:markdown id:68165402 tags:
 Now, **plot** `citations_overall` as **scatter plot** with the **x-axis** labelled *citations_per_faculty* and the **y-axis** labelled *overall_score*.
 You **must** use the `scatter_plot` function to create the plot.
 **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
 Your plot should look like this:
 %% Cell type:markdown id:667b4025 tags:
 <div><img src="attachment:q8.png" width="400"/></div>
 <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
 %% Cell type:code id:0e0b8a7d tags:
 ``` python
 # create the scatter plot using the DataFrame 'citations_overall' with the x-axis labelled "citations_per_faculty"
 # and the y-axis labelled "overall_score"
 ```
 %% Cell type:markdown id:8ba5ed8c tags:
 **Question 9:** Create a **scatter plot** representing the `academic_reputation` (on the **x-axis**) against the `employer_reputation` (on the **y-axis**) of each institution from the *United States* in the `year` *2019*.
 You **must** first compute a **DataFrame** `reputations_usa` containing the **academic_reputation**, and the **employer_reputation** data from the `year` *2019*, of each **institution** in the `country` *United States*.
 %% Cell type:code id:b04f767f tags:
 ``` python
 # first compute and store the DataFrame 'reputations_usa', then display its head
 # do NOT plot just yet
 ```
 %% Cell type:code id:05490b0c tags:
 ``` python
 grader.check("q9")
 ```
 %% Cell type:markdown id:5f8fcce5 tags:
 Now, **plot** `reputations_usa` as **scatter plot** with the **x-axis** labelled *academic_reputation* and the **y-axis** labelled *employer_reputation*.
 You **must** use the `scatter_plot` function to create the plot.
 **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
 Your plot should look like this:
 %% Cell type:markdown id:0295c09c tags:
 <div><img src="attachment:q9.png" width="400"/></div>
 <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
 %% Cell type:code id:29894cd8 tags:
 ``` python
 # create the scatter plot using the DataFrame 'reputations_usa' with the x-axis labelled "academic_reputation"
 # and the y-axis labelled "employer_reputation"
 ```
 %% Cell type:markdown id:2e739c41 tags:
 **Question 10:** Create a **scatter plot** representing the `international_students` (on the **x-axis**) against the `faculty_student_score` (on the **y-axis**) for the **top ranked** institution of **each** `country` in the `year` *2020*.
 You **must** first compute a **DataFrame** `top_ranked_inter_faculty` containing the **international_students**, and the **faculty_student_score** data from the `year` *2020*, of the **top** ranked **institution** (i.e., the institution with the **least** `rank`) of each **country**.
 **Hint:** You can use the `MIN` SQL function to return the least value of a selected column. However, there are a few things to keep in mind while using this function.
 * The function must be in **uppercase** (i.e., you must use `MIN`, and **not** `min`).
 * The column you are finding the minimum of must be inside backticks (``` ` ```). For example, if you want to find the minimum `rank`, you need to say ```MIN(`rank`)```.
 If you do not follow the syntax above, your code will likely fail.
 %% Cell type:code id:fa9e1b6f tags:
 ``` python
 # first compute and store the DataFrame 'top_ranked_inter_faculty', then display its head
 # do NOT plot just yet
 ```
 %% Cell type:code id:a4831be1 tags:
 ``` python
 grader.check("q10")
 ```
 %% Cell type:markdown id:59b40839 tags:
 Now, **plot** `top_ranked_inter_faculty` as **scatter plot** with the **x-axis** labelled *international_students* and the **y-axis** labelled *faculty_student_score*.
 You **must** use the `scatter_plot` function to create the plot.
 **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
 Your plot should look like this:
 %% Cell type:markdown id:0ffca4e3 tags:
 <div><img src="attachment:q10.png" width="400"/></div>
 <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
 %% Cell type:code id:2f17934b tags:
 ``` python
 # create the scatter plot using the DataFrame 'top_ranked_inter_faculty' with the x-axis labelled "international_students"
 # and the y-axis labelled "faculty_student_score"
 ```
 %% Cell type:markdown id:9dab472c tags:
 ### Correlations:
 You can use the `.corr()` method on a **DataFrame** that has **two** columns to get the *correlation* between those two columns.
 For example, if we have a **DataFrame** `df` with the two columns `citations_per_faculty` and `overall_score`, `df.corr()` would return
 ||**citations_per_faculty**|**overall_score**|
 |---------|------|---------|
 |citations_per_faculty|1.000000|0.574472|
 |overall_score|0.574472|1.000000|
 You can use `.loc` here to **extract** the *correlation* between the two columns (`0.574472` in this case).
 %% Cell type:markdown id:f09ade4a tags:
 **Question 11:** Find the **correlation** between `international_students` and `overall_score` for institutions from the `country` *United Kingdom* that were ranked in the **top** *100* in the `year` *2020*.
 Your output **must** be a **float** representing the absolute correlations. The **only** `pandas` operations you are **allowed** to use are: `.corr`, `.loc` and `.iloc`. You **must** use SQL to gather all other data.
 %% Cell type:code id:706db815 tags:
 ``` python
 # compute and store the answer in the variable 'uk_inter_score_corr', then display it
 ```
 %% Cell type:code id:ea738710 tags:
 ``` python
 grader.check("q11")
 ```
 %% Cell type:markdown id:314d22d6 tags:
 Let us now define a new score called `citations_per_international` as follows:
 $$\texttt{citations}\_\texttt{per}\_\texttt{international} = \frac{\texttt{citations}\_\texttt{per}\_\texttt{faculty} \times \texttt{international}\_\texttt{faculty}}{100}.$$
 %% Cell type:markdown id:cef190f0 tags:
 **Question 12:** Find the **correlation** between `citations_per_international` and `overall_score` for **all** institutions in the `year` *2019*.
 Your output **must** be a **float** representing the absolute correlations. The **only** `pandas` operations you are **allowed** to use are: `.corr`, `.loc` and `.iloc`. You **must** use SQL to gather all other data.
 %% Cell type:code id:777d001c tags:
 ``` python
 # compute and store the answer in the variable 'cit_per_inter_score_corr', then display it
 ```
 %% Cell type:code id:ee14b0ac tags:
 ``` python
 grader.check("q12")
 ```
 %% Cell type:markdown id:cc72c981 tags:
 **Question 13:** What are the **top** *15* countries with the **highest** *total* of `citations_per_international` in the `year` *2019*.
 The *total* `citations_per_international` of a `country` is defined as the **sum** of `citations_per_international` scores of **all** institutions in that `country`. You **must** display the columns `country` and `sum_inter_citations`. The rows **must** be in *descending* order of `sum_inter_citations`.
 Your output **must** be a **DataFrame** that looks like this:
 ||**country**|**sum_inter_citations**|
 |----|-----------|-----------------------|
 |**0**|United States|2623.8207|
 |**1**|United Kingdom|2347.1602|
 |**2**|Australia|1255.5530|
 |**3**|Netherlands|748.4268|
 |**4**|Canada|724.5029|
 |**5**|Switzerland|561.8790|
 |**6**|China|482.2577|
 |**7**|Germany|455.5466|
 |**8**|Hong Kong|375.3032|
 |**9**|New Zealand|327.3357|
 |**10**|Sweden|305.3745|
 |**11**|Belgium|255.0750|
 |**12**|France|198.0860|
 |**13**|Denmark|186.4904|
 |**14**|Singapore|160.3000|
 %% Cell type:code id:14aaad72 tags:
 ``` python
 # compute and store the answer in the variable 'top_cit_per_inter', then display it
 ```
 %% Cell type:code id:b44e985d tags:
 ``` python
 grader.check("q13")
 ```
 %% Cell type:markdown id:59a993ce tags:
 **Question 14:** Among the institutions ranked within the **top** *300*, find the **average** `citations_per_international` for **each** `country` in the `year` *2019*.
 You **must** display the columns `country` and `avg_inter_citations` representing the **average** of `citations_per_international` for **each** country. The rows **must** be in *descending* order of `avg_inter_citations`.
 **Hint:** To find the **average**, you can use `SUM()` and `COUNT()` or you can simply use `AVG()`.
 Your output **must** be a **DataFrame** whose **first ten rows** look like this:
 ||**country**|**avg_inter_citations**|
 |----|-----------|----------------------|
 |**0**|Singapore|80.150000|
 |**1**|Switzerland|75.497000|
 |**2**|Hong Kong|62.550533|
 |**3**|Australia|61.362388|
 |**4**|Netherlands|56.166733|
 |**5**|New Zealand|53.226220|
 |**6**|United Kingdom|52.889084|
 |**7**|Canada|50.779723|
 |**8**|Denmark|46.196200|
 |**9**|Norway|46.083300|
 %% Cell type:code id:dac3e940 tags:
 ``` python
 # compute and store the answer in the variable 'avg_cit_per_inter', then display it
 ```
 %% Cell type:code id:946bb83c tags:
 ``` python
 grader.check("q14")
 ```
 %% Cell type:markdown id:bfded4bf tags:
 **Question 15** Find the **institution** with the **highest** value of `citations_per_international` for **each** `country` in the `year` *2020*.
 Your output **must** be a **DataFrame** with the columns `country`, `institution_name`, and a new column `max_inter_citations` representing the **maximum** value of `citations_per_international` for that country. The rows **must** be in *descending* order of `max_inter_citations`. You **must** **omit** rows where `max_inter_citations` is **missing** by using the clause:
 ```sql
 HAVING `max_inter_citations` IS NOT NULL
 ```
 **Hint:** You can use the `MAX()` function to return the largest value within a group.
 Your output **must** be a **DataFrame** whose **first ten rows** look like this:
 ||**country**|**institution_name**|**max_inter_citations**|
 |----|-----------|--------------------|----------------------|
 |**0**|United States|Massachusetts Institute Of Technology|99.8000|
 |**1**|Switzerland|Ecole Polytechnique Fédérale De Lausanne|98.9000|
 |**2**|Netherlands|Eindhoven University Of Technology|95.4493|
 |**3**|United Kingdom|London School Of Economics And Political Science|91.1000|
 |**4**|Hong Kong|The Hong Kong University Of Science And Technology|89.5000|
 |**5**|Singapore|Nanyang Technological University|88.8000|
 |**6**|Australia|The University Of Western Australia|88.3000|
 |**7**|Belgium|Katholieke Universiteit Leuven|76.7700|
 |**8**|New Zealand|University Of Waikato|73.6434|
 |**9**|Canada|Western University|72.3240|
 %% Cell type:code id:fba4a1c2 tags:
 ``` python
 # compute and store the answer in the variable 'max_cit_per_inter', then display it
 ```
 %% Cell type:code id:9c4db997 tags:
 ``` python
 grader.check("q15")
 ```
 %% Cell type:markdown id:da9cb13f tags:
 **Question 16**: Among the institutions ranked within the **top** *50*, create a **horizontal bar plot** representing the **average** of both the`citations_per_faculty` and `international_faculty` scores for **all** institutions in **each** `country` in the `year` *2018*.
 You **must** first create a **DataFrame** `country_citations_inter` with **three** columns: `country`, `avg_citations` and `avg_inter_faculty` representing the name, the average value of `citations_per_faculty` and the average value of `international_faculty` for each country respectively.
 You **must** ensure that the countries in the **DataFrame** are **ordered** in **increasing** order of the **difference** between `avg_citations` and `avg_inter_faculty`.
 %% Cell type:code id:e9e566a5 tags:
 ``` python
 # first compute and store the DataFrame 'country_citations_inter', then display it
 # do NOT plot just yet
 ```
 %% Cell type:code id:60d1c6f7 tags:
 ``` python
 grader.check("q16")
 ```
 %% Cell type:markdown id:3e859552 tags:
 Now, **plot** `country_citations_inter` as **horizontal bar plot** with the **x-axis** labelled *country*.
 Then, you **must** use the `horizontal_bar_plot` function to plot this data. Verify that the countries are **ordered** in **decreasing** order of the **difference** between `avg_citations` and `avg_inter_faculty`. Verify that the **legend** appears on your plot.
 **Hint:** If you want the countries in the plot to be ordered in **decreasing** order of the difference, you will need to make sure that in the DataFrame, they are ordered in the **increasing** order.
 **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
 Your plot should look like this:
 %% Cell type:markdown id:fb3e7670 tags:
 <div><img src="attachment:q16.png" width="400"/></div>
 <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
 %% Cell type:code id:259af611 tags:
 ``` python
 # create the horizontal bar plot using the DataFrame 'country_citations_inter' with the x-axis labelled "country"
 ```
 %% Cell type:markdown id:1a5d4543 tags:
 **Question 17:** Create a **scatter plot** representing the `overall_score` (on the **x-axis**) against the `rank` (on the **y-axis**) for **all** institutions in the `year` *2020*. Additionally, **plot** a **regression line** within the same plot.
 You **must** first compute a **DataFrame** containing the **overall_score**, and the **rank** data from the `year` *2020*. You **must** use the `get_regression_line` function to compute the best fit line.
 %% Cell type:code id:d51299b8 tags:
 ``` python
 # first compute and store the DataFrame 'overall_rank', then display its head
 # do NOT plot just yet
 ```
 %% Cell type:code id:a422be6a tags:
 ``` python
 grader.check("q17")
 ```
 %% Cell type:markdown id:4c062dae tags:
 Now, **plot** `overall_rank` as **scatter plot** with a **regression line** with the **x-axis** labelled *overall_score* and the **y-axis** labelled *rank*.
 You **must** use the `regression_line_plot` function to plot this data.
 **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
 Your plot should look like this:
 %% Cell type:markdown id:aee08178 tags:
 <div><img src="attachment:q17.png" width="400"/></div>
 <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
 %% Cell type:code id:6c914693 tags:
 ``` python
 # create the scatter plot and the regression line using the DataFrame 'overall_rank' with the x-axis labelled "overall_score"
 # and the y-axis labelled "rank"
 ```
 %% Cell type:markdown id:effa2591 tags:
 **Food for thought:** Does our linear regression model fit the points well? It looks like the relationship between the `overall_score` and `rank` is **not quite linear**. In fact, a cursory look at the data suggests that the relationship is in fact, inverse.
 %% Cell type:code id:9f1de243 tags:
 ``` python
 # Food for thought is an entirely OPTIONAL exercise
 # you may leave your thoughts here as a comment if you wish to
 ```
 %% Cell type:markdown id:26e4e3c1 tags:
 **Question 18:** Create a **scatter plot** representing the **inverse** of the `overall_score` (on the **x-axis**) against the `rank` (on the **y-axis**) for **all** institutions in the `year` *2020*. Additionally, **plot** a **regression line**  within the same plot.
 The `inverse_overall_score` for each institution is simply defined as `1/overall_score` for that institution. You **must** first compute a **DataFrame** containing the **inverse_overall_score**, and the **rank** data from the `year` *2020*. You **must** use the `get_regression_line` function to compute the best fit line.
 %% Cell type:code id:6c535d83 tags:
 ``` python
 # first compute and store the DataFrame 'inverse_overall_rank', then display its head
 # do NOT plot just yet
 ```
 %% Cell type:code id:22a6a736 tags:
 ``` python
 grader.check("q18")
 ```
 %% Cell type:markdown id:e64a0040 tags:
 Now, **plot** `inverse_overall_rank` as **scatter plot** with a **regression line** with the **x-axis** labelled *inverse_overall_score* and the **y-axis** labelled *rank*.
 You **must** use the `regression_line_plot` function to plot this data.
 **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
 Your plot should look like this:
 %% Cell type:markdown id:baeb0d40 tags:
 <div><img src="attachment:q18.png" width="400"/></div>
 <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
 %% Cell type:code id:dd8efd5b tags:
 ``` python
 # create the scatter plot and the regression line using the DataFrame 'inverse_overall_rank'
 # with the x-axis labelled "inverse_overall_score" and the y-axis labelled "rank"
 ```
 %% Cell type:markdown id:9f9f2089 tags:
 This seems to be much better! Let us now use this **regression line** to **estimate** the `rank` of an institution given its `overall_score`.
 %% Cell type:markdown id:0849a83f tags:
 **Question 19:** Use the regression line to **estimate** the `rank` of an institution with an `overall_score` of *72*.
 Your output **must** be an **int**. If your **estimate** is a **float**, *round it up* using `math.ceil`.
 **Hints:**
 1. Call the `get_regression_coeff` function to get the coefficients `m` and `b`.
 2. Recall that the equation of a line is `y = m * x + b`. What are `x` and `y` here?
 %% Cell type:code id:7f3fa177 tags:
 ``` python
 # compute and store the answer in the variable 'rank_score_72', then display it
 ```
 %% Cell type:code id:c1559986 tags:
 ``` python
 grader.check("q19")
 ```
 %% Cell type:markdown id:547f4135 tags:
 **Food for thought:** Can you find out the `overall_score` of the university with this `rank` in the `year` *2020*? Does it match your prediction?
 %% Cell type:code id:60915e12 tags:
 ``` python
 # Food for thought is an entirely OPTIONAL exercise
 # you may leave your thoughts here as a comment if you wish to
 ```
 %% Cell type:markdown id:53ab4005 tags:
 **Question 20:** Using the data from Question 5, create a **pie plot** representing the number of institutions from each country.
 You **have** already computed a **DataFrame** `num_institutions` (in Question 5) containing the **country**, and the **num_of_institutions** data. Run the following cell just to confirm that the variable has not changed its values since you defined it in Question 5.
 %% Cell type:code id:2a86a546 tags:
 ``` python
 grader.check("q20")
 ```
 %% Cell type:markdown id:d95601d7 tags:
 Now, **plot** `num_institutions` as **pie plot** with the **title** *Number of institutions*.
 Now, you **must** use the `pie_plot` function to create the **pie plot**. The **colors** do **not** matter, but the plot **must** be titled `Number of institutions`, and **must** be labelled as in the sample output below.
 **Important Warning:** `p13_test.py` can check that the **DataFrame** is correct, but it **cannot** check if your plot appears on the screen, or whether the axes are correctly labelled. If your plot is not visible, or if it is not properly labelled, the Gradescope autograder will **deduct points**.
 Your plot should look like this:
 %% Cell type:markdown id:76ce5db5 tags:
 <div><img src="attachment:q20.png" width="400"/></div>
 <center> <b>Delete</b> this cell before you submit the notebook to reduce the size of your file.</center>
 %% Cell type:code id:0fdcbe48 tags:
 ``` python
 # create the pie plot using the DataFrame 'num_institutions' titled "Number of institutions"
 ```
 %% Cell type:markdown id:6bce0354 tags:
 **Food for thought:** It seems that we'll run out of colors! How can we make it so that **no two neighbors share a color**? You'll probably have to look online.
 %% Cell type:code id:7bd4d538 tags:
 ``` python
 # Food for thought is an entirely OPTIONAL exercise
 # you may leave your thoughts here as a comment if you wish to
 ```
 %% Cell type:markdown id:936abcda tags:
 ### Closing the database connection:
 Now, before you **submit** your notebook, you **must** **close** your connection `conn`. Not doing this might make **Gradescope fail**. Additionally, **delete** the example images provided with plot questions to save space, if your notebook file is too large for submission. You can **delete** any cell by selecting the cell, hitting the `Esc` key once, and then hitting the `d` key **twice**.
 %% Cell type:code id:9515f232 tags:
 ``` python
 # close your connection here
 ```
 %% Cell type:markdown id:27a5f70c tags:
 ## Submission
 Make sure you have run all cells in your notebook in order before running the following cells, so that all images/graphs appear in the output. The following cells will generate a zip file for you to submit.
 **SUBMISSION INSTRUCTIONS**:
 1. **Upload** the zipfile to Gradescope.
 2. Check **Gradescope otter** results as soon as the auto-grader execution gets completed. Don't worry about the score showing up as -/100.0. You only need to check that the test cases passed.
 %% Cell type:code id:9419c771 tags:
 ``` python
 from IPython.display import display, Javascript
 display(Javascript('IPython.notebook.save_checkpoint();'))
 ```
 %% Cell type:code id:b54d6127 tags:
 ``` python
 !jupytext --to py p13.ipynb
 ```
 %% Cell type:code id:11da7246 tags:
 ``` python
 p13_test.check_file_size("p13.ipynb")
 grader.export(pdf=False, run_tests=True, files=[py_filename])
 ```
 %% Cell type:markdown id:a44ca87a tags: