diff --git a/f23/Gurmail_Lecture_Notes/37N_Advanced_pandas_topics/lec_37_pandas3_data_transformation.ipynb b/f23/Gurmail_Lecture_Notes/37N_Advanced_pandas_topics/lec_37_pandas3_data_transformation.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..01977fb1f4ff63c7c57d52a91272fd60a7f72978 --- /dev/null +++ b/f23/Gurmail_Lecture_Notes/37N_Advanced_pandas_topics/lec_37_pandas3_data_transformation.ipynb @@ -0,0 +1,4560 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Announcements - Wednesday, December 6\n", + "* Download ALL files for today's lecture\n", + "* Q10 Released tonight at 5 pm\n", + "* <b>If you have any problem with P8-P11 grades, please send me (Gurmail.Singh@wisc.edu) an email by December 11.</b>\n", + "* Late days may not be used on P13\n", + "* If you have questions, it is almost always faster to \n", + " * Post on Piazza\n", + " * Go to [office hours](https://sites.google.com/wisc.edu/cs220-oh-f23/home?pli=1) \n", + "### Conflict Form\n", + " * [Final - December 19, 7:45 am](https://cs220.cs.wisc.edu/f23/surveys.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RHvDCo4fhXBx" + }, + "source": [ + "# Lecture 37 Pandas 3: Data Transformation\n", + "* Data transformation is the process of changing the format, structure, or values of data. \n", + "* Often needed during data cleaning and sometimes during data analysis" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yoLGptrqhbBo" + }, + "source": [ + "# Today's Learning Objectives: \n", + "\n", + "* Setting column as index for pandas `DataFrame`\n", + "* Identify, drop, or fill missing values (`np.NaN`) using Pandas `isna`, `dropna`, and `fillna`\n", + "* Applying transformations to `DataFrame`:\n", + " * Use `apply` on pandas `Series` to apply a transformation function\n", + " * Use `replace` to replace all target values in Pandas `Series` and `DataFrame` rows / columns\n", + "* Filter, aggregate, group, and summarize information in a `DataFrame` with `groupby`\n", + "* Convert .groupby examples to SQL\n", + "* Solving the same question using SQL and pandas `DataFrame` manipulations:\n", + " * filtering, grouping, and aggregation / summarization" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "CeWtFirwteFY" + }, + "outputs": [], + "source": [ + "# known import statements\n", + "import pandas as pd\n", + "import sqlite3 as sql # note that we are renaming to sql\n", + "import os\n", + "\n", + "# new import statement\n", + "import numpy as np" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FgnTeNRIswsm" + }, + "source": [ + "# The dataset: Spotify songs\n", + "Adapted from https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify.\n", + "\n", + "If you are interested in digging deeper in this dataset, here's a [blog post](https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a) that explain each column in details. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### WARMUP 1: Establish a connection to the spotify.db database" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 232 + }, + "id": "8y9scvgCnTHl", + "outputId": "c72388f8-576c-4cf2-ef51-352cd11b6c92" + }, + "outputs": [], + "source": [ + "# open up the spotify database\n", + "db_pathname = \"spotify.db\"\n", + "assert os.path.exists(db_pathname)\n", + "conn = sql.connect(db_pathname)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "def qry(sql):\n", + " return pd.read_sql(sql, conn)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### WARMUP 2: Identify the table name(s) inside the database" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + }, + "id": "ybTqbDSOnR2f", + "outputId": "8dcc943b-9382-4abb-ef78-6c6d56ad89eb" + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>type</th>\n", + " <th>name</th>\n", + " <th>tbl_name</th>\n", + " <th>rootpage</th>\n", + " <th>sql</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>table</td>\n", + " <td>spotify</td>\n", + " <td>spotify</td>\n", + " <td>1527</td>\n", + " <td>CREATE TABLE spotify(\\nid TEXT PRIMARY KEY,\\nt...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>index</td>\n", + " <td>sqlite_autoindex_spotify_1</td>\n", + " <td>spotify</td>\n", + " <td>1528</td>\n", + " <td>None</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " type name tbl_name rootpage \\\n", + "0 table spotify spotify 1527 \n", + "1 index sqlite_autoindex_spotify_1 spotify 1528 \n", + "\n", + " sql \n", + "0 CREATE TABLE spotify(\\nid TEXT PRIMARY KEY,\\nt... \n", + "1 None " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = qry(\"SELECT * from sqlite_master\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### WARMUP 3: Use pandas lookup expression to extract the \"sql\" column and display the full query using .iloc lookup" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CREATE TABLE spotify(\n", + "id TEXT PRIMARY KEY,\n", + "title BLOB,\n", + "song_name BLOB, \n", + "genre TEXT,\n", + "duration_ms INTEGER, \n", + "key INTEGER, \n", + "mode INTEGER, \n", + "time_signature INTEGER, \n", + "tempo REAL,\n", + "acousticness REAL, \n", + "danceability REAL, \n", + "energy REAL, \n", + "instrumentalness REAL, \n", + "liveness REAL, \n", + "loudness REAL, \n", + "speechiness REAL, \n", + "valence REAL)\n" + ] + } + ], + "source": [ + "print(df[\"sql\"].iloc[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### WARMUP 4: Store the data inside `spotify` table inside a variable called `df`" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 632 + }, + "id": "txAH9OIjnoQv", + "outputId": "ac9152ba-32df-4fb2-d4e0-a97f50fe58fb" + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>id</th>\n", + " <th>title</th>\n", + " <th>song_name</th>\n", + " <th>genre</th>\n", + " <th>duration_ms</th>\n", + " <th>key</th>\n", + " <th>mode</th>\n", + " <th>time_signature</th>\n", + " <th>tempo</th>\n", + " <th>acousticness</th>\n", + " <th>danceability</th>\n", + " <th>energy</th>\n", + " <th>instrumentalness</th>\n", + " <th>liveness</th>\n", + " <th>loudness</th>\n", + " <th>speechiness</th>\n", + " <th>valence</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>7pgJBLVz5VmnL7uGHmRj6p</td>\n", + " <td></td>\n", + " <td>Pathology</td>\n", + " <td>Dark Trap</td>\n", + " <td>224427</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>115.080</td>\n", + " <td>0.401000</td>\n", + " <td>0.719</td>\n", + " <td>0.493</td>\n", + " <td>0.000000</td>\n", + " <td>0.1180</td>\n", + " <td>-7.230</td>\n", + " <td>0.0794</td>\n", + " <td>0.1240</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>0vSWgAlfpye0WCGeNmuNhy</td>\n", + " <td></td>\n", + " <td>Symbiote</td>\n", + " <td>Dark Trap</td>\n", + " <td>98821</td>\n", + " <td>5</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>218.050</td>\n", + " <td>0.013800</td>\n", + " <td>0.850</td>\n", + " <td>0.893</td>\n", + " <td>0.000004</td>\n", + " <td>0.3720</td>\n", + " <td>-4.783</td>\n", + " <td>0.0623</td>\n", + " <td>0.0391</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>7EL7ifncK2PWFYThJjzR25</td>\n", + " <td></td>\n", + " <td>BRAINFOOD</td>\n", + " <td>Dark Trap</td>\n", + " <td>101172</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>189.938</td>\n", + " <td>0.187000</td>\n", + " <td>0.864</td>\n", + " <td>0.365</td>\n", + " <td>0.000000</td>\n", + " <td>0.1160</td>\n", + " <td>-10.219</td>\n", + " <td>0.0655</td>\n", + " <td>0.0478</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>1umsRbM7L4ju7rn9aU8Ju6</td>\n", + " <td></td>\n", + " <td>Sacrifice</td>\n", + " <td>Dark Trap</td>\n", + " <td>96062</td>\n", + " <td>10</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>139.990</td>\n", + " <td>0.145000</td>\n", + " <td>0.767</td>\n", + " <td>0.576</td>\n", + " <td>0.000003</td>\n", + " <td>0.0968</td>\n", + " <td>-9.683</td>\n", + " <td>0.2560</td>\n", + " <td>0.1870</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>4SKqOHKYU5pgHr5UiVKiQN</td>\n", + " <td></td>\n", + " <td>Backpack</td>\n", + " <td>Dark Trap</td>\n", + " <td>135079</td>\n", + " <td>5</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>128.014</td>\n", + " <td>0.007700</td>\n", + " <td>0.765</td>\n", + " <td>0.726</td>\n", + " <td>0.000000</td>\n", + " <td>0.6190</td>\n", + " <td>-5.580</td>\n", + " <td>0.1910</td>\n", + " <td>0.2700</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35872</th>\n", + " <td>46bXU7Sgj7104ZoXxzz9tM</td>\n", + " <td>Euphoric Hardstyle</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>269208</td>\n", + " <td>4</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>150.013</td>\n", + " <td>0.031500</td>\n", + " <td>0.528</td>\n", + " <td>0.693</td>\n", + " <td>0.000345</td>\n", + " <td>0.1210</td>\n", + " <td>-5.148</td>\n", + " <td>0.0304</td>\n", + " <td>0.3940</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35873</th>\n", + " <td>0he2ViGMUO3ajKTxLOfWVT</td>\n", + " <td>Greatest Hardstyle Playlist</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>210112</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>149.928</td>\n", + " <td>0.022500</td>\n", + " <td>0.517</td>\n", + " <td>0.768</td>\n", + " <td>0.000018</td>\n", + " <td>0.2050</td>\n", + " <td>-7.922</td>\n", + " <td>0.0479</td>\n", + " <td>0.3830</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35874</th>\n", + " <td>72DAt9Lbpy9EUS29OzQLob</td>\n", + " <td>Best of Hardstyle 2020</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>234823</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>154.935</td>\n", + " <td>0.026000</td>\n", + " <td>0.361</td>\n", + " <td>0.821</td>\n", + " <td>0.000242</td>\n", + " <td>0.3850</td>\n", + " <td>-3.102</td>\n", + " <td>0.0505</td>\n", + " <td>0.1240</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35875</th>\n", + " <td>6HXgExFVuE1c3cq9QjFCcU</td>\n", + " <td>Euphoric Hardstyle</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>323200</td>\n", + " <td>6</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>150.042</td>\n", + " <td>0.000551</td>\n", + " <td>0.477</td>\n", + " <td>0.921</td>\n", + " <td>0.029600</td>\n", + " <td>0.0575</td>\n", + " <td>-4.777</td>\n", + " <td>0.0392</td>\n", + " <td>0.4880</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35876</th>\n", + " <td>6MAAMZImxcvYhRnxDLTufD</td>\n", + " <td>Best of Hardstyle 2020</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>162161</td>\n", + " <td>9</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>155.047</td>\n", + " <td>0.001890</td>\n", + " <td>0.529</td>\n", + " <td>0.945</td>\n", + " <td>0.000055</td>\n", + " <td>0.4140</td>\n", + " <td>-5.862</td>\n", + " <td>0.0615</td>\n", + " <td>0.1340</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>35877 rows × 17 columns</p>\n", + "</div>" + ], + "text/plain": [ + " id title song_name \\\n", + "0 7pgJBLVz5VmnL7uGHmRj6p Pathology \n", + "1 0vSWgAlfpye0WCGeNmuNhy Symbiote \n", + "2 7EL7ifncK2PWFYThJjzR25 BRAINFOOD \n", + "3 1umsRbM7L4ju7rn9aU8Ju6 Sacrifice \n", + "4 4SKqOHKYU5pgHr5UiVKiQN Backpack \n", + "... ... ... ... \n", + "35872 46bXU7Sgj7104ZoXxzz9tM Euphoric Hardstyle \n", + "35873 0he2ViGMUO3ajKTxLOfWVT Greatest Hardstyle Playlist \n", + "35874 72DAt9Lbpy9EUS29OzQLob Best of Hardstyle 2020 \n", + "35875 6HXgExFVuE1c3cq9QjFCcU Euphoric Hardstyle \n", + "35876 6MAAMZImxcvYhRnxDLTufD Best of Hardstyle 2020 \n", + "\n", + " genre duration_ms key mode time_signature tempo \\\n", + "0 Dark Trap 224427 8 1 4 115.080 \n", + "1 Dark Trap 98821 5 1 4 218.050 \n", + "2 Dark Trap 101172 8 1 4 189.938 \n", + "3 Dark Trap 96062 10 0 4 139.990 \n", + "4 Dark Trap 135079 5 1 4 128.014 \n", + "... ... ... ... ... ... ... \n", + "35872 hardstyle 269208 4 1 4 150.013 \n", + "35873 hardstyle 210112 0 0 4 149.928 \n", + "35874 hardstyle 234823 8 1 4 154.935 \n", + "35875 hardstyle 323200 6 0 4 150.042 \n", + "35876 hardstyle 162161 9 1 4 155.047 \n", + "\n", + " acousticness danceability energy instrumentalness liveness \\\n", + "0 0.401000 0.719 0.493 0.000000 0.1180 \n", + "1 0.013800 0.850 0.893 0.000004 0.3720 \n", + "2 0.187000 0.864 0.365 0.000000 0.1160 \n", + "3 0.145000 0.767 0.576 0.000003 0.0968 \n", + "4 0.007700 0.765 0.726 0.000000 0.6190 \n", + "... ... ... ... ... ... \n", + "35872 0.031500 0.528 0.693 0.000345 0.1210 \n", + "35873 0.022500 0.517 0.768 0.000018 0.2050 \n", + "35874 0.026000 0.361 0.821 0.000242 0.3850 \n", + "35875 0.000551 0.477 0.921 0.029600 0.0575 \n", + "35876 0.001890 0.529 0.945 0.000055 0.4140 \n", + "\n", + " loudness speechiness valence \n", + "0 -7.230 0.0794 0.1240 \n", + "1 -4.783 0.0623 0.0391 \n", + "2 -10.219 0.0655 0.0478 \n", + "3 -9.683 0.2560 0.1870 \n", + "4 -5.580 0.1910 0.2700 \n", + "... ... ... ... \n", + "35872 -5.148 0.0304 0.3940 \n", + "35873 -7.922 0.0479 0.3830 \n", + "35874 -3.102 0.0505 0.1240 \n", + "35875 -4.777 0.0392 0.4880 \n", + "35876 -5.862 0.0615 0.1340 \n", + "\n", + "[35877 rows x 17 columns]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = qry(\"SELECT * FROM spotify\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setting a column as row indices for the `DataFrame`\n", + "\n", + "- Syntax: `df.set_index(\"<COLUMN>\")`\n", + "- Returns a new DataFrame object instance reference.\n", + "- WARNING: executing this twice will result in `KeyError` being thrown. Once you set a column as row index, it will no longer be a column within the `DataFrame`. If you tried this, go back and execute the above cell and update `df` once more and then execute the below cell exactly once." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>title</th>\n", + " <th>song_name</th>\n", + " <th>genre</th>\n", + " <th>duration_ms</th>\n", + " <th>key</th>\n", + " <th>mode</th>\n", + " <th>time_signature</th>\n", + " <th>tempo</th>\n", + " <th>acousticness</th>\n", + " <th>danceability</th>\n", + " <th>energy</th>\n", + " <th>instrumentalness</th>\n", + " <th>liveness</th>\n", + " <th>loudness</th>\n", + " <th>speechiness</th>\n", + " <th>valence</th>\n", + " </tr>\n", + " <tr>\n", + " <th>id</th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>7pgJBLVz5VmnL7uGHmRj6p</th>\n", + " <td></td>\n", + " <td>Pathology</td>\n", + " <td>Dark Trap</td>\n", + " <td>224427</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>115.080</td>\n", + " <td>0.401000</td>\n", + " <td>0.719</td>\n", + " <td>0.493</td>\n", + " <td>0.000000</td>\n", + " <td>0.1180</td>\n", + " <td>-7.230</td>\n", + " <td>0.0794</td>\n", + " <td>0.1240</td>\n", + " </tr>\n", + " <tr>\n", + " <th>0vSWgAlfpye0WCGeNmuNhy</th>\n", + " <td></td>\n", + " <td>Symbiote</td>\n", + " <td>Dark Trap</td>\n", + " <td>98821</td>\n", + " <td>5</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>218.050</td>\n", + " <td>0.013800</td>\n", + " <td>0.850</td>\n", + " <td>0.893</td>\n", + " <td>0.000004</td>\n", + " <td>0.3720</td>\n", + " <td>-4.783</td>\n", + " <td>0.0623</td>\n", + " <td>0.0391</td>\n", + " </tr>\n", + " <tr>\n", + " <th>7EL7ifncK2PWFYThJjzR25</th>\n", + " <td></td>\n", + " <td>BRAINFOOD</td>\n", + " <td>Dark Trap</td>\n", + " <td>101172</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>189.938</td>\n", + " <td>0.187000</td>\n", + " <td>0.864</td>\n", + " <td>0.365</td>\n", + " <td>0.000000</td>\n", + " <td>0.1160</td>\n", + " <td>-10.219</td>\n", + " <td>0.0655</td>\n", + " <td>0.0478</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1umsRbM7L4ju7rn9aU8Ju6</th>\n", + " <td></td>\n", + " <td>Sacrifice</td>\n", + " <td>Dark Trap</td>\n", + " <td>96062</td>\n", + " <td>10</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>139.990</td>\n", + " <td>0.145000</td>\n", + " <td>0.767</td>\n", + " <td>0.576</td>\n", + " <td>0.000003</td>\n", + " <td>0.0968</td>\n", + " <td>-9.683</td>\n", + " <td>0.2560</td>\n", + " <td>0.1870</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4SKqOHKYU5pgHr5UiVKiQN</th>\n", + " <td></td>\n", + " <td>Backpack</td>\n", + " <td>Dark Trap</td>\n", + " <td>135079</td>\n", + " <td>5</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>128.014</td>\n", + " <td>0.007700</td>\n", + " <td>0.765</td>\n", + " <td>0.726</td>\n", + " <td>0.000000</td>\n", + " <td>0.6190</td>\n", + " <td>-5.580</td>\n", + " <td>0.1910</td>\n", + " <td>0.2700</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>46bXU7Sgj7104ZoXxzz9tM</th>\n", + " <td>Euphoric Hardstyle</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>269208</td>\n", + " <td>4</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>150.013</td>\n", + " <td>0.031500</td>\n", + " <td>0.528</td>\n", + " <td>0.693</td>\n", + " <td>0.000345</td>\n", + " <td>0.1210</td>\n", + " <td>-5.148</td>\n", + " <td>0.0304</td>\n", + " <td>0.3940</td>\n", + " </tr>\n", + " <tr>\n", + " <th>0he2ViGMUO3ajKTxLOfWVT</th>\n", + " <td>Greatest Hardstyle Playlist</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>210112</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>149.928</td>\n", + " <td>0.022500</td>\n", + " <td>0.517</td>\n", + " <td>0.768</td>\n", + " <td>0.000018</td>\n", + " <td>0.2050</td>\n", + " <td>-7.922</td>\n", + " <td>0.0479</td>\n", + " <td>0.3830</td>\n", + " </tr>\n", + " <tr>\n", + " <th>72DAt9Lbpy9EUS29OzQLob</th>\n", + " <td>Best of Hardstyle 2020</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>234823</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>154.935</td>\n", + " <td>0.026000</td>\n", + " <td>0.361</td>\n", + " <td>0.821</td>\n", + " <td>0.000242</td>\n", + " <td>0.3850</td>\n", + " <td>-3.102</td>\n", + " <td>0.0505</td>\n", + " <td>0.1240</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6HXgExFVuE1c3cq9QjFCcU</th>\n", + " <td>Euphoric Hardstyle</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>323200</td>\n", + " <td>6</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>150.042</td>\n", + " <td>0.000551</td>\n", + " <td>0.477</td>\n", + " <td>0.921</td>\n", + " <td>0.029600</td>\n", + " <td>0.0575</td>\n", + " <td>-4.777</td>\n", + " <td>0.0392</td>\n", + " <td>0.4880</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6MAAMZImxcvYhRnxDLTufD</th>\n", + " <td>Best of Hardstyle 2020</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>162161</td>\n", + " <td>9</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>155.047</td>\n", + " <td>0.001890</td>\n", + " <td>0.529</td>\n", + " <td>0.945</td>\n", + " <td>0.000055</td>\n", + " <td>0.4140</td>\n", + " <td>-5.862</td>\n", + " <td>0.0615</td>\n", + " <td>0.1340</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>35877 rows × 16 columns</p>\n", + "</div>" + ], + "text/plain": [ + " title song_name genre \\\n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p Pathology Dark Trap \n", + "0vSWgAlfpye0WCGeNmuNhy Symbiote Dark Trap \n", + "7EL7ifncK2PWFYThJjzR25 BRAINFOOD Dark Trap \n", + "1umsRbM7L4ju7rn9aU8Ju6 Sacrifice Dark Trap \n", + "4SKqOHKYU5pgHr5UiVKiQN Backpack Dark Trap \n", + "... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM Euphoric Hardstyle hardstyle \n", + "0he2ViGMUO3ajKTxLOfWVT Greatest Hardstyle Playlist hardstyle \n", + "72DAt9Lbpy9EUS29OzQLob Best of Hardstyle 2020 hardstyle \n", + "6HXgExFVuE1c3cq9QjFCcU Euphoric Hardstyle hardstyle \n", + "6MAAMZImxcvYhRnxDLTufD Best of Hardstyle 2020 hardstyle \n", + "\n", + " duration_ms key mode time_signature tempo \\\n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p 224427 8 1 4 115.080 \n", + "0vSWgAlfpye0WCGeNmuNhy 98821 5 1 4 218.050 \n", + "7EL7ifncK2PWFYThJjzR25 101172 8 1 4 189.938 \n", + "1umsRbM7L4ju7rn9aU8Ju6 96062 10 0 4 139.990 \n", + "4SKqOHKYU5pgHr5UiVKiQN 135079 5 1 4 128.014 \n", + "... ... ... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM 269208 4 1 4 150.013 \n", + "0he2ViGMUO3ajKTxLOfWVT 210112 0 0 4 149.928 \n", + "72DAt9Lbpy9EUS29OzQLob 234823 8 1 4 154.935 \n", + "6HXgExFVuE1c3cq9QjFCcU 323200 6 0 4 150.042 \n", + "6MAAMZImxcvYhRnxDLTufD 162161 9 1 4 155.047 \n", + "\n", + " acousticness danceability energy instrumentalness \\\n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p 0.401000 0.719 0.493 0.000000 \n", + "0vSWgAlfpye0WCGeNmuNhy 0.013800 0.850 0.893 0.000004 \n", + "7EL7ifncK2PWFYThJjzR25 0.187000 0.864 0.365 0.000000 \n", + "1umsRbM7L4ju7rn9aU8Ju6 0.145000 0.767 0.576 0.000003 \n", + "4SKqOHKYU5pgHr5UiVKiQN 0.007700 0.765 0.726 0.000000 \n", + "... ... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM 0.031500 0.528 0.693 0.000345 \n", + "0he2ViGMUO3ajKTxLOfWVT 0.022500 0.517 0.768 0.000018 \n", + "72DAt9Lbpy9EUS29OzQLob 0.026000 0.361 0.821 0.000242 \n", + "6HXgExFVuE1c3cq9QjFCcU 0.000551 0.477 0.921 0.029600 \n", + "6MAAMZImxcvYhRnxDLTufD 0.001890 0.529 0.945 0.000055 \n", + "\n", + " liveness loudness speechiness valence \n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p 0.1180 -7.230 0.0794 0.1240 \n", + "0vSWgAlfpye0WCGeNmuNhy 0.3720 -4.783 0.0623 0.0391 \n", + "7EL7ifncK2PWFYThJjzR25 0.1160 -10.219 0.0655 0.0478 \n", + "1umsRbM7L4ju7rn9aU8Ju6 0.0968 -9.683 0.2560 0.1870 \n", + "4SKqOHKYU5pgHr5UiVKiQN 0.6190 -5.580 0.1910 0.2700 \n", + "... ... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM 0.1210 -5.148 0.0304 0.3940 \n", + "0he2ViGMUO3ajKTxLOfWVT 0.2050 -7.922 0.0479 0.3830 \n", + "72DAt9Lbpy9EUS29OzQLob 0.3850 -3.102 0.0505 0.1240 \n", + "6HXgExFVuE1c3cq9QjFCcU 0.0575 -4.777 0.0392 0.4880 \n", + "6MAAMZImxcvYhRnxDLTufD 0.4140 -5.862 0.0615 0.1340 \n", + "\n", + "[35877 rows x 16 columns]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Set the id column as row indices\n", + "df = df.set_index(\"id\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Not a Number\n", + "\n", + "- `np.NaN` is the floating point representation of Not a Number\n", + "- You do not need to know / learn the details about the `numpy` package \n", + "\n", + "### Replacing / modifying values within the `DataFrame`\n", + "\n", + "Syntax: `df.replace(<TARGET>, <REPLACE>)`\n", + "- Your target can be `str`, `int`, `float`, `None` (there are other possiblities, but those are too advanced for this course)\n", + "- Returns a new DataFrame object instance reference.\n", + "\n", + "Let's now replace the missing values (empty strings) with `np.NAN`" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>title</th>\n", + " <th>song_name</th>\n", + " <th>genre</th>\n", + " <th>duration_ms</th>\n", + " <th>key</th>\n", + " <th>mode</th>\n", + " <th>time_signature</th>\n", + " <th>tempo</th>\n", + " <th>acousticness</th>\n", + " <th>danceability</th>\n", + " <th>energy</th>\n", + " <th>instrumentalness</th>\n", + " <th>liveness</th>\n", + " <th>loudness</th>\n", + " <th>speechiness</th>\n", + " <th>valence</th>\n", + " </tr>\n", + " <tr>\n", + " <th>id</th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>7pgJBLVz5VmnL7uGHmRj6p</th>\n", + " <td>NaN</td>\n", + " <td>Pathology</td>\n", + " <td>Dark Trap</td>\n", + " <td>224427</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>115.080</td>\n", + " <td>0.4010</td>\n", + " <td>0.719</td>\n", + " <td>0.493</td>\n", + " <td>0.000000</td>\n", + " <td>0.1180</td>\n", + " <td>-7.230</td>\n", + " <td>0.0794</td>\n", + " <td>0.1240</td>\n", + " </tr>\n", + " <tr>\n", + " <th>0vSWgAlfpye0WCGeNmuNhy</th>\n", + " <td>NaN</td>\n", + " <td>Symbiote</td>\n", + " <td>Dark Trap</td>\n", + " <td>98821</td>\n", + " <td>5</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>218.050</td>\n", + " <td>0.0138</td>\n", + " <td>0.850</td>\n", + " <td>0.893</td>\n", + " <td>0.000004</td>\n", + " <td>0.3720</td>\n", + " <td>-4.783</td>\n", + " <td>0.0623</td>\n", + " <td>0.0391</td>\n", + " </tr>\n", + " <tr>\n", + " <th>7EL7ifncK2PWFYThJjzR25</th>\n", + " <td>NaN</td>\n", + " <td>BRAINFOOD</td>\n", + " <td>Dark Trap</td>\n", + " <td>101172</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>189.938</td>\n", + " <td>0.1870</td>\n", + " <td>0.864</td>\n", + " <td>0.365</td>\n", + " <td>0.000000</td>\n", + " <td>0.1160</td>\n", + " <td>-10.219</td>\n", + " <td>0.0655</td>\n", + " <td>0.0478</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1umsRbM7L4ju7rn9aU8Ju6</th>\n", + " <td>NaN</td>\n", + " <td>Sacrifice</td>\n", + " <td>Dark Trap</td>\n", + " <td>96062</td>\n", + " <td>10</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>139.990</td>\n", + " <td>0.1450</td>\n", + " <td>0.767</td>\n", + " <td>0.576</td>\n", + " <td>0.000003</td>\n", + " <td>0.0968</td>\n", + " <td>-9.683</td>\n", + " <td>0.2560</td>\n", + " <td>0.1870</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4SKqOHKYU5pgHr5UiVKiQN</th>\n", + " <td>NaN</td>\n", + " <td>Backpack</td>\n", + " <td>Dark Trap</td>\n", + " <td>135079</td>\n", + " <td>5</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>128.014</td>\n", + " <td>0.0077</td>\n", + " <td>0.765</td>\n", + " <td>0.726</td>\n", + " <td>0.000000</td>\n", + " <td>0.6190</td>\n", + " <td>-5.580</td>\n", + " <td>0.1910</td>\n", + " <td>0.2700</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3uE1swbcRp5BrO64UNy6Ex</th>\n", + " <td>NaN</td>\n", + " <td>TakingOutTheTrash</td>\n", + " <td>Dark Trap</td>\n", + " <td>192833</td>\n", + " <td>11</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>120.004</td>\n", + " <td>0.1720</td>\n", + " <td>0.814</td>\n", + " <td>0.575</td>\n", + " <td>0.000291</td>\n", + " <td>0.1090</td>\n", + " <td>-9.635</td>\n", + " <td>0.0635</td>\n", + " <td>0.2880</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3KJrwOuqiEwHq6QTreZT61</th>\n", + " <td>NaN</td>\n", + " <td>Io sono qui</td>\n", + " <td>Dark Trap</td>\n", + " <td>180880</td>\n", + " <td>10</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>128.066</td>\n", + " <td>0.0987</td>\n", + " <td>0.812</td>\n", + " <td>0.813</td>\n", + " <td>0.000150</td>\n", + " <td>0.0758</td>\n", + " <td>-5.583</td>\n", + " <td>0.0984</td>\n", + " <td>0.3480</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4QhUXx4ON40TIBrZIlnIke</th>\n", + " <td>NaN</td>\n", + " <td>Murder</td>\n", + " <td>Dark Trap</td>\n", + " <td>186261</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>114.956</td>\n", + " <td>0.0343</td>\n", + " <td>0.602</td>\n", + " <td>0.578</td>\n", + " <td>0.000000</td>\n", + " <td>0.1640</td>\n", + " <td>-5.610</td>\n", + " <td>0.0283</td>\n", + " <td>0.1560</td>\n", + " </tr>\n", + " <tr>\n", + " <th>09320vyX4qHd4GjHIpy5w0</th>\n", + " <td>NaN</td>\n", + " <td>High 'N Mighty</td>\n", + " <td>Dark Trap</td>\n", + " <td>124676</td>\n", + " <td>7</td>\n", + " <td>1</td>\n", + " <td>5</td>\n", + " <td>111.958</td>\n", + " <td>0.1120</td>\n", + " <td>0.876</td>\n", + " <td>0.768</td>\n", + " <td>0.000012</td>\n", + " <td>0.2830</td>\n", + " <td>-6.606</td>\n", + " <td>0.2010</td>\n", + " <td>0.7200</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6xEnbXM1us9fDJy2LC0lru</th>\n", + " <td>NaN</td>\n", + " <td>Bang Ya Fucking Head</td>\n", + " <td>Dark Trap</td>\n", + " <td>154929</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>125.013</td>\n", + " <td>0.0525</td>\n", + " <td>0.690</td>\n", + " <td>0.760</td>\n", + " <td>0.000000</td>\n", + " <td>0.1340</td>\n", + " <td>-5.431</td>\n", + " <td>0.0895</td>\n", + " <td>0.0797</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " title song_name genre duration_ms \\\n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p NaN Pathology Dark Trap 224427 \n", + "0vSWgAlfpye0WCGeNmuNhy NaN Symbiote Dark Trap 98821 \n", + "7EL7ifncK2PWFYThJjzR25 NaN BRAINFOOD Dark Trap 101172 \n", + "1umsRbM7L4ju7rn9aU8Ju6 NaN Sacrifice Dark Trap 96062 \n", + "4SKqOHKYU5pgHr5UiVKiQN NaN Backpack Dark Trap 135079 \n", + "3uE1swbcRp5BrO64UNy6Ex NaN TakingOutTheTrash Dark Trap 192833 \n", + "3KJrwOuqiEwHq6QTreZT61 NaN Io sono qui Dark Trap 180880 \n", + "4QhUXx4ON40TIBrZIlnIke NaN Murder Dark Trap 186261 \n", + "09320vyX4qHd4GjHIpy5w0 NaN High 'N Mighty Dark Trap 124676 \n", + "6xEnbXM1us9fDJy2LC0lru NaN Bang Ya Fucking Head Dark Trap 154929 \n", + "\n", + " key mode time_signature tempo acousticness \\\n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p 8 1 4 115.080 0.4010 \n", + "0vSWgAlfpye0WCGeNmuNhy 5 1 4 218.050 0.0138 \n", + "7EL7ifncK2PWFYThJjzR25 8 1 4 189.938 0.1870 \n", + "1umsRbM7L4ju7rn9aU8Ju6 10 0 4 139.990 0.1450 \n", + "4SKqOHKYU5pgHr5UiVKiQN 5 1 4 128.014 0.0077 \n", + "3uE1swbcRp5BrO64UNy6Ex 11 1 4 120.004 0.1720 \n", + "3KJrwOuqiEwHq6QTreZT61 10 0 4 128.066 0.0987 \n", + "4QhUXx4ON40TIBrZIlnIke 0 1 4 114.956 0.0343 \n", + "09320vyX4qHd4GjHIpy5w0 7 1 5 111.958 0.1120 \n", + "6xEnbXM1us9fDJy2LC0lru 1 1 4 125.013 0.0525 \n", + "\n", + " danceability energy instrumentalness liveness \\\n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p 0.719 0.493 0.000000 0.1180 \n", + "0vSWgAlfpye0WCGeNmuNhy 0.850 0.893 0.000004 0.3720 \n", + "7EL7ifncK2PWFYThJjzR25 0.864 0.365 0.000000 0.1160 \n", + "1umsRbM7L4ju7rn9aU8Ju6 0.767 0.576 0.000003 0.0968 \n", + "4SKqOHKYU5pgHr5UiVKiQN 0.765 0.726 0.000000 0.6190 \n", + "3uE1swbcRp5BrO64UNy6Ex 0.814 0.575 0.000291 0.1090 \n", + "3KJrwOuqiEwHq6QTreZT61 0.812 0.813 0.000150 0.0758 \n", + "4QhUXx4ON40TIBrZIlnIke 0.602 0.578 0.000000 0.1640 \n", + "09320vyX4qHd4GjHIpy5w0 0.876 0.768 0.000012 0.2830 \n", + "6xEnbXM1us9fDJy2LC0lru 0.690 0.760 0.000000 0.1340 \n", + "\n", + " loudness speechiness valence \n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p -7.230 0.0794 0.1240 \n", + "0vSWgAlfpye0WCGeNmuNhy -4.783 0.0623 0.0391 \n", + "7EL7ifncK2PWFYThJjzR25 -10.219 0.0655 0.0478 \n", + "1umsRbM7L4ju7rn9aU8Ju6 -9.683 0.2560 0.1870 \n", + "4SKqOHKYU5pgHr5UiVKiQN -5.580 0.1910 0.2700 \n", + "3uE1swbcRp5BrO64UNy6Ex -9.635 0.0635 0.2880 \n", + "3KJrwOuqiEwHq6QTreZT61 -5.583 0.0984 0.3480 \n", + "4QhUXx4ON40TIBrZIlnIke -5.610 0.0283 0.1560 \n", + "09320vyX4qHd4GjHIpy5w0 -6.606 0.2010 0.7200 \n", + "6xEnbXM1us9fDJy2LC0lru -5.431 0.0895 0.0797 " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = df.replace(\"\", np.NaN)\n", + "df.head(10) # title is the album name" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Checking for missing values\n", + "\n", + "Syntax: `Series.isna()`\n", + "- Returns a boolean Series\n", + "\n", + "Let's check if any of the \"song_name\"(s) are missing" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "JqzSwG5PEZRq", + "outputId": "05529a3d-4a5c-4654-fe05-d04b2c10ae6c" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "id\n", + "7pgJBLVz5VmnL7uGHmRj6p False\n", + "0vSWgAlfpye0WCGeNmuNhy False\n", + "7EL7ifncK2PWFYThJjzR25 False\n", + "1umsRbM7L4ju7rn9aU8Ju6 False\n", + "4SKqOHKYU5pgHr5UiVKiQN False\n", + " ... \n", + "46bXU7Sgj7104ZoXxzz9tM True\n", + "0he2ViGMUO3ajKTxLOfWVT True\n", + "72DAt9Lbpy9EUS29OzQLob True\n", + "6HXgExFVuE1c3cq9QjFCcU True\n", + "6MAAMZImxcvYhRnxDLTufD True\n", + "Name: song_name, Length: 35877, dtype: bool" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[\"song_name\"].isna()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Review: `Pandas.Series.value_counts()`\n", + "- Returns a new `Series` with unique values from the original `Series` as keys and the count of those unique values as values. \n", + "- Return value `Series` is ordered using descending order of counts" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "uCLDr8EIGMeJ", + "outputId": "241d6181-d525-4019-a8f2-689939b2ab33" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "False 18342\n", + "True 17535\n", + "Name: song_name, dtype: int64" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# count the number of missing values for song name\n", + "df[\"song_name\"].isna().value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Missing value manipulation\n", + "Syntax: `df.fillna(<REPLACE>)`\n", + "- Returns a new DataFrame object instance reference." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "pJ2CIqq9HWvN", + "outputId": "2895e862-18e5-4742-9750-31b130aae668" + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>title</th>\n", + " <th>song_name</th>\n", + " <th>genre</th>\n", + " <th>duration_ms</th>\n", + " <th>key</th>\n", + " <th>mode</th>\n", + " <th>time_signature</th>\n", + " <th>tempo</th>\n", + " <th>acousticness</th>\n", + " <th>danceability</th>\n", + " <th>energy</th>\n", + " <th>instrumentalness</th>\n", + " <th>liveness</th>\n", + " <th>loudness</th>\n", + " <th>speechiness</th>\n", + " <th>valence</th>\n", + " </tr>\n", + " <tr>\n", + " <th>id</th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>7pgJBLVz5VmnL7uGHmRj6p</th>\n", + " <td>NaN</td>\n", + " <td>Pathology</td>\n", + " <td>Dark Trap</td>\n", + " <td>224427</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>115.080</td>\n", + " <td>0.401000</td>\n", + " <td>0.719</td>\n", + " <td>0.493</td>\n", + " <td>0.000000</td>\n", + " <td>0.1180</td>\n", + " <td>-7.230</td>\n", + " <td>0.0794</td>\n", + " <td>0.1240</td>\n", + " </tr>\n", + " <tr>\n", + " <th>0vSWgAlfpye0WCGeNmuNhy</th>\n", + " <td>NaN</td>\n", + " <td>Symbiote</td>\n", + " <td>Dark Trap</td>\n", + " <td>98821</td>\n", + " <td>5</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>218.050</td>\n", + " <td>0.013800</td>\n", + " <td>0.850</td>\n", + " <td>0.893</td>\n", + " <td>0.000004</td>\n", + " <td>0.3720</td>\n", + " <td>-4.783</td>\n", + " <td>0.0623</td>\n", + " <td>0.0391</td>\n", + " </tr>\n", + " <tr>\n", + " <th>7EL7ifncK2PWFYThJjzR25</th>\n", + " <td>NaN</td>\n", + " <td>BRAINFOOD</td>\n", + " <td>Dark Trap</td>\n", + " <td>101172</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>189.938</td>\n", + " <td>0.187000</td>\n", + " <td>0.864</td>\n", + " <td>0.365</td>\n", + " <td>0.000000</td>\n", + " <td>0.1160</td>\n", + " <td>-10.219</td>\n", + " <td>0.0655</td>\n", + " <td>0.0478</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1umsRbM7L4ju7rn9aU8Ju6</th>\n", + " <td>NaN</td>\n", + " <td>Sacrifice</td>\n", + " <td>Dark Trap</td>\n", + " <td>96062</td>\n", + " <td>10</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>139.990</td>\n", + " <td>0.145000</td>\n", + " <td>0.767</td>\n", + " <td>0.576</td>\n", + " <td>0.000003</td>\n", + " <td>0.0968</td>\n", + " <td>-9.683</td>\n", + " <td>0.2560</td>\n", + " <td>0.1870</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4SKqOHKYU5pgHr5UiVKiQN</th>\n", + " <td>NaN</td>\n", + " <td>Backpack</td>\n", + " <td>Dark Trap</td>\n", + " <td>135079</td>\n", + " <td>5</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>128.014</td>\n", + " <td>0.007700</td>\n", + " <td>0.765</td>\n", + " <td>0.726</td>\n", + " <td>0.000000</td>\n", + " <td>0.6190</td>\n", + " <td>-5.580</td>\n", + " <td>0.1910</td>\n", + " <td>0.2700</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>46bXU7Sgj7104ZoXxzz9tM</th>\n", + " <td>Euphoric Hardstyle</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>269208</td>\n", + " <td>4</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>150.013</td>\n", + " <td>0.031500</td>\n", + " <td>0.528</td>\n", + " <td>0.693</td>\n", + " <td>0.000345</td>\n", + " <td>0.1210</td>\n", + " <td>-5.148</td>\n", + " <td>0.0304</td>\n", + " <td>0.3940</td>\n", + " </tr>\n", + " <tr>\n", + " <th>0he2ViGMUO3ajKTxLOfWVT</th>\n", + " <td>Greatest Hardstyle Playlist</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>210112</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>149.928</td>\n", + " <td>0.022500</td>\n", + " <td>0.517</td>\n", + " <td>0.768</td>\n", + " <td>0.000018</td>\n", + " <td>0.2050</td>\n", + " <td>-7.922</td>\n", + " <td>0.0479</td>\n", + " <td>0.3830</td>\n", + " </tr>\n", + " <tr>\n", + " <th>72DAt9Lbpy9EUS29OzQLob</th>\n", + " <td>Best of Hardstyle 2020</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>234823</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>154.935</td>\n", + " <td>0.026000</td>\n", + " <td>0.361</td>\n", + " <td>0.821</td>\n", + " <td>0.000242</td>\n", + " <td>0.3850</td>\n", + " <td>-3.102</td>\n", + " <td>0.0505</td>\n", + " <td>0.1240</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6HXgExFVuE1c3cq9QjFCcU</th>\n", + " <td>Euphoric Hardstyle</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>323200</td>\n", + " <td>6</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>150.042</td>\n", + " <td>0.000551</td>\n", + " <td>0.477</td>\n", + " <td>0.921</td>\n", + " <td>0.029600</td>\n", + " <td>0.0575</td>\n", + " <td>-4.777</td>\n", + " <td>0.0392</td>\n", + " <td>0.4880</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6MAAMZImxcvYhRnxDLTufD</th>\n", + " <td>Best of Hardstyle 2020</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>162161</td>\n", + " <td>9</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>155.047</td>\n", + " <td>0.001890</td>\n", + " <td>0.529</td>\n", + " <td>0.945</td>\n", + " <td>0.000055</td>\n", + " <td>0.4140</td>\n", + " <td>-5.862</td>\n", + " <td>0.0615</td>\n", + " <td>0.1340</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>35877 rows × 16 columns</p>\n", + "</div>" + ], + "text/plain": [ + " title song_name genre \\\n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p NaN Pathology Dark Trap \n", + "0vSWgAlfpye0WCGeNmuNhy NaN Symbiote Dark Trap \n", + "7EL7ifncK2PWFYThJjzR25 NaN BRAINFOOD Dark Trap \n", + "1umsRbM7L4ju7rn9aU8Ju6 NaN Sacrifice Dark Trap \n", + "4SKqOHKYU5pgHr5UiVKiQN NaN Backpack Dark Trap \n", + "... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM Euphoric Hardstyle No Song Name hardstyle \n", + "0he2ViGMUO3ajKTxLOfWVT Greatest Hardstyle Playlist No Song Name hardstyle \n", + "72DAt9Lbpy9EUS29OzQLob Best of Hardstyle 2020 No Song Name hardstyle \n", + "6HXgExFVuE1c3cq9QjFCcU Euphoric Hardstyle No Song Name hardstyle \n", + "6MAAMZImxcvYhRnxDLTufD Best of Hardstyle 2020 No Song Name hardstyle \n", + "\n", + " duration_ms key mode time_signature tempo \\\n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p 224427 8 1 4 115.080 \n", + "0vSWgAlfpye0WCGeNmuNhy 98821 5 1 4 218.050 \n", + "7EL7ifncK2PWFYThJjzR25 101172 8 1 4 189.938 \n", + "1umsRbM7L4ju7rn9aU8Ju6 96062 10 0 4 139.990 \n", + "4SKqOHKYU5pgHr5UiVKiQN 135079 5 1 4 128.014 \n", + "... ... ... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM 269208 4 1 4 150.013 \n", + "0he2ViGMUO3ajKTxLOfWVT 210112 0 0 4 149.928 \n", + "72DAt9Lbpy9EUS29OzQLob 234823 8 1 4 154.935 \n", + "6HXgExFVuE1c3cq9QjFCcU 323200 6 0 4 150.042 \n", + "6MAAMZImxcvYhRnxDLTufD 162161 9 1 4 155.047 \n", + "\n", + " acousticness danceability energy instrumentalness \\\n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p 0.401000 0.719 0.493 0.000000 \n", + "0vSWgAlfpye0WCGeNmuNhy 0.013800 0.850 0.893 0.000004 \n", + "7EL7ifncK2PWFYThJjzR25 0.187000 0.864 0.365 0.000000 \n", + "1umsRbM7L4ju7rn9aU8Ju6 0.145000 0.767 0.576 0.000003 \n", + "4SKqOHKYU5pgHr5UiVKiQN 0.007700 0.765 0.726 0.000000 \n", + "... ... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM 0.031500 0.528 0.693 0.000345 \n", + "0he2ViGMUO3ajKTxLOfWVT 0.022500 0.517 0.768 0.000018 \n", + "72DAt9Lbpy9EUS29OzQLob 0.026000 0.361 0.821 0.000242 \n", + "6HXgExFVuE1c3cq9QjFCcU 0.000551 0.477 0.921 0.029600 \n", + "6MAAMZImxcvYhRnxDLTufD 0.001890 0.529 0.945 0.000055 \n", + "\n", + " liveness loudness speechiness valence \n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p 0.1180 -7.230 0.0794 0.1240 \n", + "0vSWgAlfpye0WCGeNmuNhy 0.3720 -4.783 0.0623 0.0391 \n", + "7EL7ifncK2PWFYThJjzR25 0.1160 -10.219 0.0655 0.0478 \n", + "1umsRbM7L4ju7rn9aU8Ju6 0.0968 -9.683 0.2560 0.1870 \n", + "4SKqOHKYU5pgHr5UiVKiQN 0.6190 -5.580 0.1910 0.2700 \n", + "... ... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM 0.1210 -5.148 0.0304 0.3940 \n", + "0he2ViGMUO3ajKTxLOfWVT 0.2050 -7.922 0.0479 0.3830 \n", + "72DAt9Lbpy9EUS29OzQLob 0.3850 -3.102 0.0505 0.1240 \n", + "6HXgExFVuE1c3cq9QjFCcU 0.0575 -4.777 0.0392 0.4880 \n", + "6MAAMZImxcvYhRnxDLTufD 0.4140 -5.862 0.0615 0.1340 \n", + "\n", + "[35877 rows x 16 columns]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# use .fillna to replace missing values\n", + "df[\"song_name\"].fillna(\"No Song Name\")\n", + "\n", + "# to replace the original DataFrame's column, you need to explicitly update that object instance\n", + "df[\"song_name\"] = df[\"song_name\"].fillna(\"No Song Name\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dropping missing values\n", + "Syntax: `df.dropna()`\n", + "- Returns a new DataFrame object instance reference." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 145 + }, + "id": "O_1ZeHG8N-rB", + "outputId": "3b112da2-2b3c-4fb8-c7ae-dc2f2127856d" + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>title</th>\n", + " <th>song_name</th>\n", + " <th>genre</th>\n", + " <th>duration_ms</th>\n", + " <th>key</th>\n", + " <th>mode</th>\n", + " <th>time_signature</th>\n", + " <th>tempo</th>\n", + " <th>acousticness</th>\n", + " <th>danceability</th>\n", + " <th>energy</th>\n", + " <th>instrumentalness</th>\n", + " <th>liveness</th>\n", + " <th>loudness</th>\n", + " <th>speechiness</th>\n", + " <th>valence</th>\n", + " </tr>\n", + " <tr>\n", + " <th>id</th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>5LzAV6KfjN8VhWCedeygfY</th>\n", + " <td>Dirtybird Players</td>\n", + " <td>No Song Name</td>\n", + " <td>techhouse</td>\n", + " <td>197499</td>\n", + " <td>7</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>127.997</td>\n", + " <td>0.000957</td>\n", + " <td>0.806</td>\n", + " <td>0.950</td>\n", + " <td>0.920000</td>\n", + " <td>0.1130</td>\n", + " <td>-6.782</td>\n", + " <td>0.0811</td>\n", + " <td>0.580</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3TsCb6ueD678XBJDiRrvhr</th>\n", + " <td>tech house</td>\n", + " <td>No Song Name</td>\n", + " <td>techhouse</td>\n", + " <td>206000</td>\n", + " <td>10</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>124.994</td>\n", + " <td>0.062300</td>\n", + " <td>0.729</td>\n", + " <td>0.978</td>\n", + " <td>0.908000</td>\n", + " <td>0.0353</td>\n", + " <td>-6.645</td>\n", + " <td>0.0420</td>\n", + " <td>0.778</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6Y0Fy2buEis7bEOlG0QET1</th>\n", + " <td>Tech House Bangerz</td>\n", + " <td>No Song Name</td>\n", + " <td>techhouse</td>\n", + " <td>199839</td>\n", + " <td>4</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>124.006</td>\n", + " <td>0.019100</td>\n", + " <td>0.724</td>\n", + " <td>0.792</td>\n", + " <td>0.812000</td>\n", + " <td>0.1080</td>\n", + " <td>-8.555</td>\n", + " <td>0.0405</td>\n", + " <td>0.346</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4EJI2XGViSQp6WscLKgYDD</th>\n", + " <td>tech house</td>\n", + " <td>No Song Name</td>\n", + " <td>techhouse</td>\n", + " <td>173861</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>125.031</td>\n", + " <td>0.053000</td>\n", + " <td>0.700</td>\n", + " <td>0.898</td>\n", + " <td>0.418000</td>\n", + " <td>0.5740</td>\n", + " <td>-6.099</td>\n", + " <td>0.2570</td>\n", + " <td>0.791</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4x6VzOQTLIrkkCWcDPh5Y0</th>\n", + " <td>blanc | Tech House</td>\n", + " <td>No Song Name</td>\n", + " <td>techhouse</td>\n", + " <td>394960</td>\n", + " <td>8</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>127.029</td>\n", + " <td>0.000301</td>\n", + " <td>0.803</td>\n", + " <td>0.919</td>\n", + " <td>0.926000</td>\n", + " <td>0.1020</td>\n", + " <td>-8.667</td>\n", + " <td>0.0702</td>\n", + " <td>0.754</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>46bXU7Sgj7104ZoXxzz9tM</th>\n", + " <td>Euphoric Hardstyle</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>269208</td>\n", + " <td>4</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>150.013</td>\n", + " <td>0.031500</td>\n", + " <td>0.528</td>\n", + " <td>0.693</td>\n", + " <td>0.000345</td>\n", + " <td>0.1210</td>\n", + " <td>-5.148</td>\n", + " <td>0.0304</td>\n", + " <td>0.394</td>\n", + " </tr>\n", + " <tr>\n", + " <th>0he2ViGMUO3ajKTxLOfWVT</th>\n", + " <td>Greatest Hardstyle Playlist</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>210112</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>149.928</td>\n", + " <td>0.022500</td>\n", + " <td>0.517</td>\n", + " <td>0.768</td>\n", + " <td>0.000018</td>\n", + " <td>0.2050</td>\n", + " <td>-7.922</td>\n", + " <td>0.0479</td>\n", + " <td>0.383</td>\n", + " </tr>\n", + " <tr>\n", + " <th>72DAt9Lbpy9EUS29OzQLob</th>\n", + " <td>Best of Hardstyle 2020</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>234823</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>154.935</td>\n", + " <td>0.026000</td>\n", + " <td>0.361</td>\n", + " <td>0.821</td>\n", + " <td>0.000242</td>\n", + " <td>0.3850</td>\n", + " <td>-3.102</td>\n", + " <td>0.0505</td>\n", + " <td>0.124</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6HXgExFVuE1c3cq9QjFCcU</th>\n", + " <td>Euphoric Hardstyle</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>323200</td>\n", + " <td>6</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>150.042</td>\n", + " <td>0.000551</td>\n", + " <td>0.477</td>\n", + " <td>0.921</td>\n", + " <td>0.029600</td>\n", + " <td>0.0575</td>\n", + " <td>-4.777</td>\n", + " <td>0.0392</td>\n", + " <td>0.488</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6MAAMZImxcvYhRnxDLTufD</th>\n", + " <td>Best of Hardstyle 2020</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>162161</td>\n", + " <td>9</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>155.047</td>\n", + " <td>0.001890</td>\n", + " <td>0.529</td>\n", + " <td>0.945</td>\n", + " <td>0.000055</td>\n", + " <td>0.4140</td>\n", + " <td>-5.862</td>\n", + " <td>0.0615</td>\n", + " <td>0.134</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>17529 rows × 16 columns</p>\n", + "</div>" + ], + "text/plain": [ + " title song_name genre \\\n", + "id \n", + "5LzAV6KfjN8VhWCedeygfY Dirtybird Players No Song Name techhouse \n", + "3TsCb6ueD678XBJDiRrvhr tech house No Song Name techhouse \n", + "6Y0Fy2buEis7bEOlG0QET1 Tech House Bangerz No Song Name techhouse \n", + "4EJI2XGViSQp6WscLKgYDD tech house No Song Name techhouse \n", + "4x6VzOQTLIrkkCWcDPh5Y0 blanc | Tech House No Song Name techhouse \n", + "... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM Euphoric Hardstyle No Song Name hardstyle \n", + "0he2ViGMUO3ajKTxLOfWVT Greatest Hardstyle Playlist No Song Name hardstyle \n", + "72DAt9Lbpy9EUS29OzQLob Best of Hardstyle 2020 No Song Name hardstyle \n", + "6HXgExFVuE1c3cq9QjFCcU Euphoric Hardstyle No Song Name hardstyle \n", + "6MAAMZImxcvYhRnxDLTufD Best of Hardstyle 2020 No Song Name hardstyle \n", + "\n", + " duration_ms key mode time_signature tempo \\\n", + "id \n", + "5LzAV6KfjN8VhWCedeygfY 197499 7 1 4 127.997 \n", + "3TsCb6ueD678XBJDiRrvhr 206000 10 1 4 124.994 \n", + "6Y0Fy2buEis7bEOlG0QET1 199839 4 0 4 124.006 \n", + "4EJI2XGViSQp6WscLKgYDD 173861 8 1 4 125.031 \n", + "4x6VzOQTLIrkkCWcDPh5Y0 394960 8 0 4 127.029 \n", + "... ... ... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM 269208 4 1 4 150.013 \n", + "0he2ViGMUO3ajKTxLOfWVT 210112 0 0 4 149.928 \n", + "72DAt9Lbpy9EUS29OzQLob 234823 8 1 4 154.935 \n", + "6HXgExFVuE1c3cq9QjFCcU 323200 6 0 4 150.042 \n", + "6MAAMZImxcvYhRnxDLTufD 162161 9 1 4 155.047 \n", + "\n", + " acousticness danceability energy instrumentalness \\\n", + "id \n", + "5LzAV6KfjN8VhWCedeygfY 0.000957 0.806 0.950 0.920000 \n", + "3TsCb6ueD678XBJDiRrvhr 0.062300 0.729 0.978 0.908000 \n", + "6Y0Fy2buEis7bEOlG0QET1 0.019100 0.724 0.792 0.812000 \n", + "4EJI2XGViSQp6WscLKgYDD 0.053000 0.700 0.898 0.418000 \n", + "4x6VzOQTLIrkkCWcDPh5Y0 0.000301 0.803 0.919 0.926000 \n", + "... ... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM 0.031500 0.528 0.693 0.000345 \n", + "0he2ViGMUO3ajKTxLOfWVT 0.022500 0.517 0.768 0.000018 \n", + "72DAt9Lbpy9EUS29OzQLob 0.026000 0.361 0.821 0.000242 \n", + "6HXgExFVuE1c3cq9QjFCcU 0.000551 0.477 0.921 0.029600 \n", + "6MAAMZImxcvYhRnxDLTufD 0.001890 0.529 0.945 0.000055 \n", + "\n", + " liveness loudness speechiness valence \n", + "id \n", + "5LzAV6KfjN8VhWCedeygfY 0.1130 -6.782 0.0811 0.580 \n", + "3TsCb6ueD678XBJDiRrvhr 0.0353 -6.645 0.0420 0.778 \n", + "6Y0Fy2buEis7bEOlG0QET1 0.1080 -8.555 0.0405 0.346 \n", + "4EJI2XGViSQp6WscLKgYDD 0.5740 -6.099 0.2570 0.791 \n", + "4x6VzOQTLIrkkCWcDPh5Y0 0.1020 -8.667 0.0702 0.754 \n", + "... ... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM 0.1210 -5.148 0.0304 0.394 \n", + "0he2ViGMUO3ajKTxLOfWVT 0.2050 -7.922 0.0479 0.383 \n", + "72DAt9Lbpy9EUS29OzQLob 0.3850 -3.102 0.0505 0.124 \n", + "6HXgExFVuE1c3cq9QjFCcU 0.0575 -4.777 0.0392 0.488 \n", + "6MAAMZImxcvYhRnxDLTufD 0.4140 -5.862 0.0615 0.134 \n", + "\n", + "[17529 rows x 16 columns]" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# .dropna will drop all rows that contain NaN in them\n", + "df.dropna()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ggttXEqUbI_E" + }, + "source": [ + "### Review: `Pandas.Series.apply(...)`\n", + "Syntax: `Series.apply(<FUNCTION OBJECT REFERENCE>)`\n", + "- applies input function to every element of the Series.\n", + "- Returns a new `Series` object instance reference.\n", + "\n", + "Let's apply transformation function to `mode` column `Series`:\n", + "- mode = 1 means major modality (sounds happy)\n", + "- mode = 0 means minor modality (sounds sad)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "def replace_mode(m): \n", + " if m == 1: \n", + " return \"major\"\n", + " else: \n", + " return \"minor\"" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "id\n", + "7pgJBLVz5VmnL7uGHmRj6p major\n", + "0vSWgAlfpye0WCGeNmuNhy major\n", + "7EL7ifncK2PWFYThJjzR25 major\n", + "1umsRbM7L4ju7rn9aU8Ju6 minor\n", + "4SKqOHKYU5pgHr5UiVKiQN major\n", + " ... \n", + "46bXU7Sgj7104ZoXxzz9tM major\n", + "0he2ViGMUO3ajKTxLOfWVT minor\n", + "72DAt9Lbpy9EUS29OzQLob major\n", + "6HXgExFVuE1c3cq9QjFCcU minor\n", + "6MAAMZImxcvYhRnxDLTufD major\n", + "Name: mode, Length: 35877, dtype: object" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[\"mode\"].apply(replace_mode)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### `lambda`\n", + "\n", + "Let's write a `lambda` function instead of the `replace_mode` function" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9AJ3p-_TarnN", + "outputId": "a087df5d-2002-417c-e99c-5e6fc8ea9809" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "id\n", + "7pgJBLVz5VmnL7uGHmRj6p major\n", + "0vSWgAlfpye0WCGeNmuNhy major\n", + "7EL7ifncK2PWFYThJjzR25 major\n", + "1umsRbM7L4ju7rn9aU8Ju6 minor\n", + "4SKqOHKYU5pgHr5UiVKiQN major\n", + " ... \n", + "46bXU7Sgj7104ZoXxzz9tM major\n", + "0he2ViGMUO3ajKTxLOfWVT minor\n", + "72DAt9Lbpy9EUS29OzQLob major\n", + "6HXgExFVuE1c3cq9QjFCcU minor\n", + "6MAAMZImxcvYhRnxDLTufD major\n", + "Name: mode, Length: 35877, dtype: object" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[\"mode\"].apply(lambda m: \"major\" if m == 1 else \"minor\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Typically transformed columns are added as new columns within the DataFrame.\n", + "Let's add a new `modified_mode` column." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>title</th>\n", + " <th>song_name</th>\n", + " <th>genre</th>\n", + " <th>duration_ms</th>\n", + " <th>key</th>\n", + " <th>mode</th>\n", + " <th>time_signature</th>\n", + " <th>tempo</th>\n", + " <th>acousticness</th>\n", + " <th>danceability</th>\n", + " <th>energy</th>\n", + " <th>instrumentalness</th>\n", + " <th>liveness</th>\n", + " <th>loudness</th>\n", + " <th>speechiness</th>\n", + " <th>valence</th>\n", + " <th>modified_mode</th>\n", + " </tr>\n", + " <tr>\n", + " <th>id</th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " <th></th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>7pgJBLVz5VmnL7uGHmRj6p</th>\n", + " <td>NaN</td>\n", + " <td>Pathology</td>\n", + " <td>Dark Trap</td>\n", + " <td>224427</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>115.080</td>\n", + " <td>0.401000</td>\n", + " <td>0.719</td>\n", + " <td>0.493</td>\n", + " <td>0.000000</td>\n", + " <td>0.1180</td>\n", + " <td>-7.230</td>\n", + " <td>0.0794</td>\n", + " <td>0.1240</td>\n", + " <td>major</td>\n", + " </tr>\n", + " <tr>\n", + " <th>0vSWgAlfpye0WCGeNmuNhy</th>\n", + " <td>NaN</td>\n", + " <td>Symbiote</td>\n", + " <td>Dark Trap</td>\n", + " <td>98821</td>\n", + " <td>5</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>218.050</td>\n", + " <td>0.013800</td>\n", + " <td>0.850</td>\n", + " <td>0.893</td>\n", + " <td>0.000004</td>\n", + " <td>0.3720</td>\n", + " <td>-4.783</td>\n", + " <td>0.0623</td>\n", + " <td>0.0391</td>\n", + " <td>major</td>\n", + " </tr>\n", + " <tr>\n", + " <th>7EL7ifncK2PWFYThJjzR25</th>\n", + " <td>NaN</td>\n", + " <td>BRAINFOOD</td>\n", + " <td>Dark Trap</td>\n", + " <td>101172</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>189.938</td>\n", + " <td>0.187000</td>\n", + " <td>0.864</td>\n", + " <td>0.365</td>\n", + " <td>0.000000</td>\n", + " <td>0.1160</td>\n", + " <td>-10.219</td>\n", + " <td>0.0655</td>\n", + " <td>0.0478</td>\n", + " <td>major</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1umsRbM7L4ju7rn9aU8Ju6</th>\n", + " <td>NaN</td>\n", + " <td>Sacrifice</td>\n", + " <td>Dark Trap</td>\n", + " <td>96062</td>\n", + " <td>10</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>139.990</td>\n", + " <td>0.145000</td>\n", + " <td>0.767</td>\n", + " <td>0.576</td>\n", + " <td>0.000003</td>\n", + " <td>0.0968</td>\n", + " <td>-9.683</td>\n", + " <td>0.2560</td>\n", + " <td>0.1870</td>\n", + " <td>minor</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4SKqOHKYU5pgHr5UiVKiQN</th>\n", + " <td>NaN</td>\n", + " <td>Backpack</td>\n", + " <td>Dark Trap</td>\n", + " <td>135079</td>\n", + " <td>5</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>128.014</td>\n", + " <td>0.007700</td>\n", + " <td>0.765</td>\n", + " <td>0.726</td>\n", + " <td>0.000000</td>\n", + " <td>0.6190</td>\n", + " <td>-5.580</td>\n", + " <td>0.1910</td>\n", + " <td>0.2700</td>\n", + " <td>major</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>46bXU7Sgj7104ZoXxzz9tM</th>\n", + " <td>Euphoric Hardstyle</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>269208</td>\n", + " <td>4</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>150.013</td>\n", + " <td>0.031500</td>\n", + " <td>0.528</td>\n", + " <td>0.693</td>\n", + " <td>0.000345</td>\n", + " <td>0.1210</td>\n", + " <td>-5.148</td>\n", + " <td>0.0304</td>\n", + " <td>0.3940</td>\n", + " <td>major</td>\n", + " </tr>\n", + " <tr>\n", + " <th>0he2ViGMUO3ajKTxLOfWVT</th>\n", + " <td>Greatest Hardstyle Playlist</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>210112</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>149.928</td>\n", + " <td>0.022500</td>\n", + " <td>0.517</td>\n", + " <td>0.768</td>\n", + " <td>0.000018</td>\n", + " <td>0.2050</td>\n", + " <td>-7.922</td>\n", + " <td>0.0479</td>\n", + " <td>0.3830</td>\n", + " <td>minor</td>\n", + " </tr>\n", + " <tr>\n", + " <th>72DAt9Lbpy9EUS29OzQLob</th>\n", + " <td>Best of Hardstyle 2020</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>234823</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>154.935</td>\n", + " <td>0.026000</td>\n", + " <td>0.361</td>\n", + " <td>0.821</td>\n", + " <td>0.000242</td>\n", + " <td>0.3850</td>\n", + " <td>-3.102</td>\n", + " <td>0.0505</td>\n", + " <td>0.1240</td>\n", + " <td>major</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6HXgExFVuE1c3cq9QjFCcU</th>\n", + " <td>Euphoric Hardstyle</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>323200</td>\n", + " <td>6</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>150.042</td>\n", + " <td>0.000551</td>\n", + " <td>0.477</td>\n", + " <td>0.921</td>\n", + " <td>0.029600</td>\n", + " <td>0.0575</td>\n", + " <td>-4.777</td>\n", + " <td>0.0392</td>\n", + " <td>0.4880</td>\n", + " <td>minor</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6MAAMZImxcvYhRnxDLTufD</th>\n", + " <td>Best of Hardstyle 2020</td>\n", + " <td>No Song Name</td>\n", + " <td>hardstyle</td>\n", + " <td>162161</td>\n", + " <td>9</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>155.047</td>\n", + " <td>0.001890</td>\n", + " <td>0.529</td>\n", + " <td>0.945</td>\n", + " <td>0.000055</td>\n", + " <td>0.4140</td>\n", + " <td>-5.862</td>\n", + " <td>0.0615</td>\n", + " <td>0.1340</td>\n", + " <td>major</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>35877 rows × 17 columns</p>\n", + "</div>" + ], + "text/plain": [ + " title song_name genre \\\n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p NaN Pathology Dark Trap \n", + "0vSWgAlfpye0WCGeNmuNhy NaN Symbiote Dark Trap \n", + "7EL7ifncK2PWFYThJjzR25 NaN BRAINFOOD Dark Trap \n", + "1umsRbM7L4ju7rn9aU8Ju6 NaN Sacrifice Dark Trap \n", + "4SKqOHKYU5pgHr5UiVKiQN NaN Backpack Dark Trap \n", + "... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM Euphoric Hardstyle No Song Name hardstyle \n", + "0he2ViGMUO3ajKTxLOfWVT Greatest Hardstyle Playlist No Song Name hardstyle \n", + "72DAt9Lbpy9EUS29OzQLob Best of Hardstyle 2020 No Song Name hardstyle \n", + "6HXgExFVuE1c3cq9QjFCcU Euphoric Hardstyle No Song Name hardstyle \n", + "6MAAMZImxcvYhRnxDLTufD Best of Hardstyle 2020 No Song Name hardstyle \n", + "\n", + " duration_ms key mode time_signature tempo \\\n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p 224427 8 1 4 115.080 \n", + "0vSWgAlfpye0WCGeNmuNhy 98821 5 1 4 218.050 \n", + "7EL7ifncK2PWFYThJjzR25 101172 8 1 4 189.938 \n", + "1umsRbM7L4ju7rn9aU8Ju6 96062 10 0 4 139.990 \n", + "4SKqOHKYU5pgHr5UiVKiQN 135079 5 1 4 128.014 \n", + "... ... ... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM 269208 4 1 4 150.013 \n", + "0he2ViGMUO3ajKTxLOfWVT 210112 0 0 4 149.928 \n", + "72DAt9Lbpy9EUS29OzQLob 234823 8 1 4 154.935 \n", + "6HXgExFVuE1c3cq9QjFCcU 323200 6 0 4 150.042 \n", + "6MAAMZImxcvYhRnxDLTufD 162161 9 1 4 155.047 \n", + "\n", + " acousticness danceability energy instrumentalness \\\n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p 0.401000 0.719 0.493 0.000000 \n", + "0vSWgAlfpye0WCGeNmuNhy 0.013800 0.850 0.893 0.000004 \n", + "7EL7ifncK2PWFYThJjzR25 0.187000 0.864 0.365 0.000000 \n", + "1umsRbM7L4ju7rn9aU8Ju6 0.145000 0.767 0.576 0.000003 \n", + "4SKqOHKYU5pgHr5UiVKiQN 0.007700 0.765 0.726 0.000000 \n", + "... ... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM 0.031500 0.528 0.693 0.000345 \n", + "0he2ViGMUO3ajKTxLOfWVT 0.022500 0.517 0.768 0.000018 \n", + "72DAt9Lbpy9EUS29OzQLob 0.026000 0.361 0.821 0.000242 \n", + "6HXgExFVuE1c3cq9QjFCcU 0.000551 0.477 0.921 0.029600 \n", + "6MAAMZImxcvYhRnxDLTufD 0.001890 0.529 0.945 0.000055 \n", + "\n", + " liveness loudness speechiness valence modified_mode \n", + "id \n", + "7pgJBLVz5VmnL7uGHmRj6p 0.1180 -7.230 0.0794 0.1240 major \n", + "0vSWgAlfpye0WCGeNmuNhy 0.3720 -4.783 0.0623 0.0391 major \n", + "7EL7ifncK2PWFYThJjzR25 0.1160 -10.219 0.0655 0.0478 major \n", + "1umsRbM7L4ju7rn9aU8Ju6 0.0968 -9.683 0.2560 0.1870 minor \n", + "4SKqOHKYU5pgHr5UiVKiQN 0.6190 -5.580 0.1910 0.2700 major \n", + "... ... ... ... ... ... \n", + "46bXU7Sgj7104ZoXxzz9tM 0.1210 -5.148 0.0304 0.3940 major \n", + "0he2ViGMUO3ajKTxLOfWVT 0.2050 -7.922 0.0479 0.3830 minor \n", + "72DAt9Lbpy9EUS29OzQLob 0.3850 -3.102 0.0505 0.1240 major \n", + "6HXgExFVuE1c3cq9QjFCcU 0.0575 -4.777 0.0392 0.4880 minor \n", + "6MAAMZImxcvYhRnxDLTufD 0.4140 -5.862 0.0615 0.1340 major \n", + "\n", + "[35877 rows x 17 columns]" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[\"modified_mode\"] = df[\"mode\"].apply(lambda m: \"major\" if m == 1 else \"minor\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Let's go back to the original table from the SQL database" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "ZoiyUleiyhMg" + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>id</th>\n", + " <th>title</th>\n", + " <th>song_name</th>\n", + " <th>genre</th>\n", + " <th>duration_ms</th>\n", + " <th>key</th>\n", + " <th>mode</th>\n", + " <th>time_signature</th>\n", + " <th>tempo</th>\n", + " <th>acousticness</th>\n", + " <th>danceability</th>\n", + " <th>energy</th>\n", + " <th>instrumentalness</th>\n", + " <th>liveness</th>\n", + " <th>loudness</th>\n", + " <th>speechiness</th>\n", + " <th>valence</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>7pgJBLVz5VmnL7uGHmRj6p</td>\n", + " <td></td>\n", + " <td>Pathology</td>\n", + " <td>Dark Trap</td>\n", + " <td>224427</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>115.080</td>\n", + " <td>0.401000</td>\n", + " <td>0.719</td>\n", + " <td>0.493</td>\n", + " <td>0.000000</td>\n", + " <td>0.1180</td>\n", + " <td>-7.230</td>\n", + " <td>0.0794</td>\n", + " <td>0.1240</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>0vSWgAlfpye0WCGeNmuNhy</td>\n", + " <td></td>\n", + " <td>Symbiote</td>\n", + " <td>Dark Trap</td>\n", + " <td>98821</td>\n", + " <td>5</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>218.050</td>\n", + " <td>0.013800</td>\n", + " <td>0.850</td>\n", + " <td>0.893</td>\n", + " <td>0.000004</td>\n", + " <td>0.3720</td>\n", + " <td>-4.783</td>\n", + " <td>0.0623</td>\n", + " <td>0.0391</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>7EL7ifncK2PWFYThJjzR25</td>\n", + " <td></td>\n", + " <td>BRAINFOOD</td>\n", + " <td>Dark Trap</td>\n", + " <td>101172</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>189.938</td>\n", + " <td>0.187000</td>\n", + " <td>0.864</td>\n", + " <td>0.365</td>\n", + " <td>0.000000</td>\n", + " <td>0.1160</td>\n", + " <td>-10.219</td>\n", + " <td>0.0655</td>\n", + " <td>0.0478</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>1umsRbM7L4ju7rn9aU8Ju6</td>\n", + " <td></td>\n", + " <td>Sacrifice</td>\n", + " <td>Dark Trap</td>\n", + " <td>96062</td>\n", + " <td>10</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>139.990</td>\n", + " <td>0.145000</td>\n", + " <td>0.767</td>\n", + " <td>0.576</td>\n", + " <td>0.000003</td>\n", + " <td>0.0968</td>\n", + " <td>-9.683</td>\n", + " <td>0.2560</td>\n", + " <td>0.1870</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>4SKqOHKYU5pgHr5UiVKiQN</td>\n", + " <td></td>\n", + " <td>Backpack</td>\n", + " <td>Dark Trap</td>\n", + " <td>135079</td>\n", + " <td>5</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>128.014</td>\n", + " <td>0.007700</td>\n", + " <td>0.765</td>\n", + " <td>0.726</td>\n", + " <td>0.000000</td>\n", + " <td>0.6190</td>\n", + " <td>-5.580</td>\n", + " <td>0.1910</td>\n", + " <td>0.2700</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35872</th>\n", + " <td>46bXU7Sgj7104ZoXxzz9tM</td>\n", + " <td>Euphoric Hardstyle</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>269208</td>\n", + " <td>4</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>150.013</td>\n", + " <td>0.031500</td>\n", + " <td>0.528</td>\n", + " <td>0.693</td>\n", + " <td>0.000345</td>\n", + " <td>0.1210</td>\n", + " <td>-5.148</td>\n", + " <td>0.0304</td>\n", + " <td>0.3940</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35873</th>\n", + " <td>0he2ViGMUO3ajKTxLOfWVT</td>\n", + " <td>Greatest Hardstyle Playlist</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>210112</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>149.928</td>\n", + " <td>0.022500</td>\n", + " <td>0.517</td>\n", + " <td>0.768</td>\n", + " <td>0.000018</td>\n", + " <td>0.2050</td>\n", + " <td>-7.922</td>\n", + " <td>0.0479</td>\n", + " <td>0.3830</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35874</th>\n", + " <td>72DAt9Lbpy9EUS29OzQLob</td>\n", + " <td>Best of Hardstyle 2020</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>234823</td>\n", + " <td>8</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>154.935</td>\n", + " <td>0.026000</td>\n", + " <td>0.361</td>\n", + " <td>0.821</td>\n", + " <td>0.000242</td>\n", + " <td>0.3850</td>\n", + " <td>-3.102</td>\n", + " <td>0.0505</td>\n", + " <td>0.1240</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35875</th>\n", + " <td>6HXgExFVuE1c3cq9QjFCcU</td>\n", + " <td>Euphoric Hardstyle</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>323200</td>\n", + " <td>6</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>150.042</td>\n", + " <td>0.000551</td>\n", + " <td>0.477</td>\n", + " <td>0.921</td>\n", + " <td>0.029600</td>\n", + " <td>0.0575</td>\n", + " <td>-4.777</td>\n", + " <td>0.0392</td>\n", + " <td>0.4880</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35876</th>\n", + " <td>6MAAMZImxcvYhRnxDLTufD</td>\n", + " <td>Best of Hardstyle 2020</td>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " <td>162161</td>\n", + " <td>9</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>155.047</td>\n", + " <td>0.001890</td>\n", + " <td>0.529</td>\n", + " <td>0.945</td>\n", + " <td>0.000055</td>\n", + " <td>0.4140</td>\n", + " <td>-5.862</td>\n", + " <td>0.0615</td>\n", + " <td>0.1340</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>35877 rows × 17 columns</p>\n", + "</div>" + ], + "text/plain": [ + " id title song_name \\\n", + "0 7pgJBLVz5VmnL7uGHmRj6p Pathology \n", + "1 0vSWgAlfpye0WCGeNmuNhy Symbiote \n", + "2 7EL7ifncK2PWFYThJjzR25 BRAINFOOD \n", + "3 1umsRbM7L4ju7rn9aU8Ju6 Sacrifice \n", + "4 4SKqOHKYU5pgHr5UiVKiQN Backpack \n", + "... ... ... ... \n", + "35872 46bXU7Sgj7104ZoXxzz9tM Euphoric Hardstyle \n", + "35873 0he2ViGMUO3ajKTxLOfWVT Greatest Hardstyle Playlist \n", + "35874 72DAt9Lbpy9EUS29OzQLob Best of Hardstyle 2020 \n", + "35875 6HXgExFVuE1c3cq9QjFCcU Euphoric Hardstyle \n", + "35876 6MAAMZImxcvYhRnxDLTufD Best of Hardstyle 2020 \n", + "\n", + " genre duration_ms key mode time_signature tempo \\\n", + "0 Dark Trap 224427 8 1 4 115.080 \n", + "1 Dark Trap 98821 5 1 4 218.050 \n", + "2 Dark Trap 101172 8 1 4 189.938 \n", + "3 Dark Trap 96062 10 0 4 139.990 \n", + "4 Dark Trap 135079 5 1 4 128.014 \n", + "... ... ... ... ... ... ... \n", + "35872 hardstyle 269208 4 1 4 150.013 \n", + "35873 hardstyle 210112 0 0 4 149.928 \n", + "35874 hardstyle 234823 8 1 4 154.935 \n", + "35875 hardstyle 323200 6 0 4 150.042 \n", + "35876 hardstyle 162161 9 1 4 155.047 \n", + "\n", + " acousticness danceability energy instrumentalness liveness \\\n", + "0 0.401000 0.719 0.493 0.000000 0.1180 \n", + "1 0.013800 0.850 0.893 0.000004 0.3720 \n", + "2 0.187000 0.864 0.365 0.000000 0.1160 \n", + "3 0.145000 0.767 0.576 0.000003 0.0968 \n", + "4 0.007700 0.765 0.726 0.000000 0.6190 \n", + "... ... ... ... ... ... \n", + "35872 0.031500 0.528 0.693 0.000345 0.1210 \n", + "35873 0.022500 0.517 0.768 0.000018 0.2050 \n", + "35874 0.026000 0.361 0.821 0.000242 0.3850 \n", + "35875 0.000551 0.477 0.921 0.029600 0.0575 \n", + "35876 0.001890 0.529 0.945 0.000055 0.4140 \n", + "\n", + " loudness speechiness valence \n", + "0 -7.230 0.0794 0.1240 \n", + "1 -4.783 0.0623 0.0391 \n", + "2 -10.219 0.0655 0.0478 \n", + "3 -9.683 0.2560 0.1870 \n", + "4 -5.580 0.1910 0.2700 \n", + "... ... ... ... \n", + "35872 -5.148 0.0304 0.3940 \n", + "35873 -7.922 0.0479 0.3830 \n", + "35874 -3.102 0.0505 0.1240 \n", + "35875 -4.777 0.0392 0.4880 \n", + "35876 -5.862 0.0615 0.1340 \n", + "\n", + "[35877 rows x 17 columns]" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = qry(\"SELECT * FROM spotify\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Extract just the \"genre\" and \"duration_ms\" columns from `df`." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>genre</th>\n", + " <th>duration_ms</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>Dark Trap</td>\n", + " <td>224427</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>Dark Trap</td>\n", + " <td>98821</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>Dark Trap</td>\n", + " <td>101172</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>Dark Trap</td>\n", + " <td>96062</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>Dark Trap</td>\n", + " <td>135079</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35872</th>\n", + " <td>hardstyle</td>\n", + " <td>269208</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35873</th>\n", + " <td>hardstyle</td>\n", + " <td>210112</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35874</th>\n", + " <td>hardstyle</td>\n", + " <td>234823</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35875</th>\n", + " <td>hardstyle</td>\n", + " <td>323200</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35876</th>\n", + " <td>hardstyle</td>\n", + " <td>162161</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>35877 rows × 2 columns</p>\n", + "</div>" + ], + "text/plain": [ + " genre duration_ms\n", + "0 Dark Trap 224427\n", + "1 Dark Trap 98821\n", + "2 Dark Trap 101172\n", + "3 Dark Trap 96062\n", + "4 Dark Trap 135079\n", + "... ... ...\n", + "35872 hardstyle 269208\n", + "35873 hardstyle 210112\n", + "35874 hardstyle 234823\n", + "35875 hardstyle 323200\n", + "35876 hardstyle 162161\n", + "\n", + "[35877 rows x 2 columns]" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[[\"genre\", \"duration_ms\"]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### `Pandas.DataFrame.groupby(...)`\n", + "\n", + "Syntax: `DataFrame.groupby(<COLUMN>)`\n", + "- Returns a `groupby` object instance reference\n", + "- Need to apply aggregation methods to use the return value of `groupby`" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 551 + }, + "id": "trRMgGMysdkb", + "outputId": "d02098c3-7722-4505-c599-5897bb8ace19" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbc472bad90>" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[[\"genre\", \"duration_ms\"]].groupby(\"genre\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What is the average duration for each genre ordered based on decreasing order of averages?\n", + "#### v1: using `df` (`pandas`) to answer the question" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>duration_ms</th>\n", + " </tr>\n", + " <tr>\n", + " <th>genre</th>\n", + " <th></th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>Dark Trap</th>\n", + " <td>196059.938997</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Emo</th>\n", + " <td>218370.989519</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Hiphop</th>\n", + " <td>227885.028411</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Pop</th>\n", + " <td>211558.052980</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Rap</th>\n", + " <td>200816.798836</td>\n", + " </tr>\n", + " <tr>\n", + " <th>RnB</th>\n", + " <td>225628.556955</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Trap Metal</th>\n", + " <td>145940.519467</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Underground Rap</th>\n", + " <td>175506.191224</td>\n", + " </tr>\n", + " <tr>\n", + " <th>dnb</th>\n", + " <td>288860.138811</td>\n", + " </tr>\n", + " <tr>\n", + " <th>hardstyle</th>\n", + " <td>232828.626542</td>\n", + " </tr>\n", + " <tr>\n", + " <th>psytrance</th>\n", + " <td>445770.492075</td>\n", + " </tr>\n", + " <tr>\n", + " <th>techhouse</th>\n", + " <td>298395.587596</td>\n", + " </tr>\n", + " <tr>\n", + " <th>techno</th>\n", + " <td>399123.187453</td>\n", + " </tr>\n", + " <tr>\n", + " <th>trance</th>\n", + " <td>288729.366262</td>\n", + " </tr>\n", + " <tr>\n", + " <th>trap</th>\n", + " <td>225149.277731</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " duration_ms\n", + "genre \n", + "Dark Trap 196059.938997\n", + "Emo 218370.989519\n", + "Hiphop 227885.028411\n", + "Pop 211558.052980\n", + "Rap 200816.798836\n", + "RnB 225628.556955\n", + "Trap Metal 145940.519467\n", + "Underground Rap 175506.191224\n", + "dnb 288860.138811\n", + "hardstyle 232828.626542\n", + "psytrance 445770.492075\n", + "techhouse 298395.587596\n", + "techno 399123.187453\n", + "trance 288729.366262\n", + "trap 225149.277731" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[[\"genre\", \"duration_ms\"]].groupby(\"genre\").mean()" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>duration_ms</th>\n", + " </tr>\n", + " <tr>\n", + " <th>genre</th>\n", + " <th></th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>psytrance</th>\n", + " <td>445770.492075</td>\n", + " </tr>\n", + " <tr>\n", + " <th>techno</th>\n", + " <td>399123.187453</td>\n", + " </tr>\n", + " <tr>\n", + " <th>techhouse</th>\n", + " <td>298395.587596</td>\n", + " </tr>\n", + " <tr>\n", + " <th>dnb</th>\n", + " <td>288860.138811</td>\n", + " </tr>\n", + " <tr>\n", + " <th>trance</th>\n", + " <td>288729.366262</td>\n", + " </tr>\n", + " <tr>\n", + " <th>hardstyle</th>\n", + " <td>232828.626542</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Hiphop</th>\n", + " <td>227885.028411</td>\n", + " </tr>\n", + " <tr>\n", + " <th>RnB</th>\n", + " <td>225628.556955</td>\n", + " </tr>\n", + " <tr>\n", + " <th>trap</th>\n", + " <td>225149.277731</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Emo</th>\n", + " <td>218370.989519</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Pop</th>\n", + " <td>211558.052980</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Rap</th>\n", + " <td>200816.798836</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Dark Trap</th>\n", + " <td>196059.938997</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Underground Rap</th>\n", + " <td>175506.191224</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Trap Metal</th>\n", + " <td>145940.519467</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " duration_ms\n", + "genre \n", + "psytrance 445770.492075\n", + "techno 399123.187453\n", + "techhouse 298395.587596\n", + "dnb 288860.138811\n", + "trance 288729.366262\n", + "hardstyle 232828.626542\n", + "Hiphop 227885.028411\n", + "RnB 225628.556955\n", + "trap 225149.277731\n", + "Emo 218370.989519\n", + "Pop 211558.052980\n", + "Rap 200816.798836\n", + "Dark Trap 196059.938997\n", + "Underground Rap 175506.191224\n", + "Trap Metal 145940.519467" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[[\"genre\", \"duration_ms\"]].groupby(\"genre\").mean().sort_values(by = \"duration_ms\", ascending = False)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One way to check whether `groupby` works would be to use `value_counts` on the same column `Series`." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Underground Rap 4330\n", + "Dark Trap 3590\n", + "Hiphop 3027\n", + "trance 2804\n", + "psytrance 2650\n", + "techno 2646\n", + "dnb 2507\n", + "trap 2362\n", + "hardstyle 2351\n", + "techhouse 2209\n", + "RnB 1905\n", + "Trap Metal 1875\n", + "Emo 1622\n", + "Rap 1546\n", + "Pop 453\n", + "Name: genre, dtype: int64" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df[\"genre\"].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What is the average duration for each genre ordered based on decreasing order of averages?\n", + "#### v2: using SQL query to answer the question" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 551 + }, + "id": "89hMTXCKxWG8", + "outputId": "5737da11-aa8a-4ed0-9b05-cd379b28904b" + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>avg_duration</th>\n", + " </tr>\n", + " <tr>\n", + " <th>genre</th>\n", + " <th></th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>psytrance</th>\n", + " <td>445770.492075</td>\n", + " </tr>\n", + " <tr>\n", + " <th>techno</th>\n", + " <td>399123.187453</td>\n", + " </tr>\n", + " <tr>\n", + " <th>techhouse</th>\n", + " <td>298395.587596</td>\n", + " </tr>\n", + " <tr>\n", + " <th>dnb</th>\n", + " <td>288860.138811</td>\n", + " </tr>\n", + " <tr>\n", + " <th>trance</th>\n", + " <td>288729.366262</td>\n", + " </tr>\n", + " <tr>\n", + " <th>hardstyle</th>\n", + " <td>232828.626542</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Hiphop</th>\n", + " <td>227885.028411</td>\n", + " </tr>\n", + " <tr>\n", + " <th>RnB</th>\n", + " <td>225628.556955</td>\n", + " </tr>\n", + " <tr>\n", + " <th>trap</th>\n", + " <td>225149.277731</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Emo</th>\n", + " <td>218370.989519</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Pop</th>\n", + " <td>211558.052980</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Rap</th>\n", + " <td>200816.798836</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Dark Trap</th>\n", + " <td>196059.938997</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Underground Rap</th>\n", + " <td>175506.191224</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Trap Metal</th>\n", + " <td>145940.519467</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " avg_duration\n", + "genre \n", + "psytrance 445770.492075\n", + "techno 399123.187453\n", + "techhouse 298395.587596\n", + "dnb 288860.138811\n", + "trance 288729.366262\n", + "hardstyle 232828.626542\n", + "Hiphop 227885.028411\n", + "RnB 225628.556955\n", + "trap 225149.277731\n", + "Emo 218370.989519\n", + "Pop 211558.052980\n", + "Rap 200816.798836\n", + "Dark Trap 196059.938997\n", + "Underground Rap 175506.191224\n", + "Trap Metal 145940.519467" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# SQL equivalent query of the above Pandas query\n", + "avg_duration_per_genre = qry(\"\"\"\n", + "SELECT genre, AVG(duration_ms) as avg_duration\n", + "FROM spotify \n", + "GROUP BY genre\n", + "ORDER BY avg_duration DESC\n", + "\"\"\")\n", + "\n", + "# How can we get make the SQL query output to be exactly same as df.groupby?\n", + "avg_duration_per_genre = avg_duration_per_genre.set_index(\"genre\")\n", + "avg_duration_per_genre" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "12ZdqYoIy_8U" + }, + "source": [ + "### What is the average speechiness for each mode, time signature pair?\n", + "#### v1: pandas" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 332 + }, + "id": "fVejD2KPyveX", + "outputId": "fe5c8fda-29a2-4f1a-8ff4-de9ad2a3cde0" + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th></th>\n", + " <th>speechiness</th>\n", + " </tr>\n", + " <tr>\n", + " <th>mode</th>\n", + " <th>time_signature</th>\n", + " <th></th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th rowspan=\"4\" valign=\"top\">0</th>\n", + " <th>1</th>\n", + " <td>0.181224</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>0.121837</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>0.126688</td>\n", + " </tr>\n", + " <tr>\n", + " <th>5</th>\n", + " <td>0.204890</td>\n", + " </tr>\n", + " <tr>\n", + " <th rowspan=\"4\" valign=\"top\">1</th>\n", + " <th>1</th>\n", + " <td>0.173138</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>0.129512</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>0.139170</td>\n", + " </tr>\n", + " <tr>\n", + " <th>5</th>\n", + " <td>0.220177</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " speechiness\n", + "mode time_signature \n", + "0 1 0.181224\n", + " 3 0.121837\n", + " 4 0.126688\n", + " 5 0.204890\n", + "1 1 0.173138\n", + " 3 0.129512\n", + " 4 0.139170\n", + " 5 0.220177" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# use a list to indicate all the columns you want to groupby \n", + "df[[\"mode\", \"time_signature\", \"speechiness\"]].groupby([\"mode\", \"time_signature\"]).mean()" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + }, + "id": "ImYEuOMox-ps", + "outputId": "2674dabd-3ff7-4099-fdc3-54e5ba0e2628" + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>mode</th>\n", + " <th>time_signature</th>\n", + " <th>avg_speechiness</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>0.181224</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>0.121837</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>0.126688</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>0</td>\n", + " <td>5</td>\n", + " <td>0.204890</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0.173138</td>\n", + " </tr>\n", + " <tr>\n", + " <th>5</th>\n", + " <td>1</td>\n", + " <td>3</td>\n", + " <td>0.129512</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6</th>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>0.139170</td>\n", + " </tr>\n", + " <tr>\n", + " <th>7</th>\n", + " <td>1</td>\n", + " <td>5</td>\n", + " <td>0.220177</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " mode time_signature avg_speechiness\n", + "0 0 1 0.181224\n", + "1 0 3 0.121837\n", + "2 0 4 0.126688\n", + "3 0 5 0.204890\n", + "4 1 1 0.173138\n", + "5 1 3 0.129512\n", + "6 1 4 0.139170\n", + "7 1 5 0.220177" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# SQL equivalent query of the above Pandas query\n", + "qry(\"\"\"\n", + "SELECT mode, time_signature, AVG(speechiness) as avg_speechiness\n", + "FROM spotify \n", + "GROUP BY mode, time_signature\n", + "\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sEDc5zGu0bc9" + }, + "source": [ + "### Self-practice" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Which songs have a tempo greater than 150 and what are their genre?" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>song_name</th>\n", + " <th>genre</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>Symbiote</td>\n", + " <td>Dark Trap</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>BRAINFOOD</td>\n", + " <td>Dark Trap</td>\n", + " </tr>\n", + " <tr>\n", + " <th>18</th>\n", + " <td>FunnyToSeeYouHere</td>\n", + " <td>Dark Trap</td>\n", + " </tr>\n", + " <tr>\n", + " <th>19</th>\n", + " <td>Killer</td>\n", + " <td>Dark Trap</td>\n", + " </tr>\n", + " <tr>\n", + " <th>20</th>\n", + " <td>608</td>\n", + " <td>Dark Trap</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35871</th>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35872</th>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35874</th>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35875</th>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35876</th>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>13753 rows × 2 columns</p>\n", + "</div>" + ], + "text/plain": [ + " song_name genre\n", + "1 Symbiote Dark Trap\n", + "2 BRAINFOOD Dark Trap\n", + "18 FunnyToSeeYouHere Dark Trap\n", + "19 Killer Dark Trap\n", + "20 608 Dark Trap\n", + "... ... ...\n", + "35871 hardstyle\n", + "35872 hardstyle\n", + "35874 hardstyle\n", + "35875 hardstyle\n", + "35876 hardstyle\n", + "\n", + "[13753 rows x 2 columns]" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# v1: pandas\n", + "fast_songs = df[df[\"tempo\"] > 150]\n", + "fast_songs[[\"song_name\", \"genre\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>song_name</th>\n", + " <th>genre</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>Symbiote</td>\n", + " <td>Dark Trap</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>BRAINFOOD</td>\n", + " <td>Dark Trap</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>FunnyToSeeYouHere</td>\n", + " <td>Dark Trap</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>Killer</td>\n", + " <td>Dark Trap</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>608</td>\n", + " <td>Dark Trap</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>13748</th>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " </tr>\n", + " <tr>\n", + " <th>13749</th>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " </tr>\n", + " <tr>\n", + " <th>13750</th>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " </tr>\n", + " <tr>\n", + " <th>13751</th>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " </tr>\n", + " <tr>\n", + " <th>13752</th>\n", + " <td></td>\n", + " <td>hardstyle</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>13753 rows × 2 columns</p>\n", + "</div>" + ], + "text/plain": [ + " song_name genre\n", + "0 Symbiote Dark Trap\n", + "1 BRAINFOOD Dark Trap\n", + "2 FunnyToSeeYouHere Dark Trap\n", + "3 Killer Dark Trap\n", + "4 608 Dark Trap\n", + "... ... ...\n", + "13748 hardstyle\n", + "13749 hardstyle\n", + "13750 hardstyle\n", + "13751 hardstyle\n", + "13752 hardstyle\n", + "\n", + "[13753 rows x 2 columns]" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# v2: SQL\n", + "\n", + "qry(\"\"\"\n", + "SELECT song_name, genre\n", + "FROM spotify\n", + "WHERE tempo > 150\n", + "\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What is the sum of danceability and liveness for \"Hiphop\" genre songs?" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "15321 0.8416\n", + "15322 0.9201\n", + "15323 0.8580\n", + "15324 0.8240\n", + "15325 0.9348\n", + " ... \n", + "18343 0.6690\n", + "18344 0.5370\n", + "18345 0.8850\n", + "18346 0.8770\n", + "18347 0.8703\n", + "Length: 3027, dtype: float64" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# v1: pandas\n", + "hiphop_songs = df[df[\"genre\"] == \"Hiphop\"]\n", + "hiphop_songs[\"danceability\"] + hiphop_songs[\"liveness\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 0.8416\n", + "1 0.9201\n", + "2 0.8580\n", + "3 0.8240\n", + "4 0.9348\n", + " ... \n", + "3022 0.6690\n", + "3023 0.5370\n", + "3024 0.8850\n", + "3025 0.8770\n", + "3026 0.8703\n", + "Name: song_score, Length: 3027, dtype: float64" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# v2: SQL\n", + "hiphop_songs = qry(\"\"\"\n", + "SELECT danceability + liveness as song_score\n", + "FROM spotify\n", + "WHERE genre = \"Hiphop\"\n", + "\"\"\")\n", + "hiphop_songs[\"song_score\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Find all song_name ordered by ascending order of duration_ms. Eliminate songs which don't have a song_name" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "# v1: pandas\n", + "songs_by_duration = list(df.sort_values(by = \"duration_ms\")[\"song_name\"])\n", + "# [song for song in songs_by_duration if song != \"\"] # uncomment to see the output" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "# v2\n", + "songs_by_duration = qry(\"\"\"\n", + "SELECT song_name\n", + "FROM spotify\n", + "ORDER BY duration_ms\n", + "\"\"\")\n", + "songs_by_duration = list(songs_by_duration[\"song_name\"])\n", + "# [song for song in songs_by_duration if song != \"\"] # uncomment to see the output" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### How many distinct \"genre\"s are there in the dataset?" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['trance',\n", + " 'techno',\n", + " 'dnb',\n", + " 'Trap Metal',\n", + " 'RnB',\n", + " 'Pop',\n", + " 'psytrance',\n", + " 'techhouse',\n", + " 'trap',\n", + " 'Dark Trap',\n", + " 'Emo',\n", + " 'Underground Rap',\n", + " 'Rap',\n", + " 'Hiphop',\n", + " 'hardstyle']" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# v1: pandas\n", + "list(set(list(df[\"genre\"])))" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['Dark Trap',\n", + " 'Underground Rap',\n", + " 'Trap Metal',\n", + " 'Emo',\n", + " 'Rap',\n", + " 'RnB',\n", + " 'Pop',\n", + " 'Hiphop',\n", + " 'techhouse',\n", + " 'techno',\n", + " 'trance',\n", + " 'psytrance',\n", + " 'trap',\n", + " 'dnb',\n", + " 'hardstyle']" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# v2: SQL\n", + "genres = qry(\"\"\"\n", + "SELECT DISTINCT genre\n", + "FROM spotify\n", + "\"\"\")\n", + "list(genres[\"genre\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Considering only songs with energy greater than 0.5, what is the maximum energy for each \"genre\" with song count greater than 2000?" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "genre\n", + "Dark Trap 0.998\n", + "Emo 0.995\n", + "Hiphop 0.978\n", + "Pop 0.977\n", + "Rap 0.980\n", + "RnB 0.974\n", + "Trap Metal 0.999\n", + "Underground Rap 0.997\n", + "dnb 0.999\n", + "hardstyle 0.999\n", + "psytrance 0.999\n", + "techhouse 0.999\n", + "techno 1.000\n", + "trance 1.000\n", + "trap 1.000\n", + "Name: energy, dtype: float64" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# v1: pandas\n", + "high_energy_songs = df[df[\"energy\"] > 0.5]\n", + "genre_groups = high_energy_songs[[\"genre\", \"energy\"]].groupby(\"genre\")\n", + "max_energy = genre_groups.max()\n", + "max_energy[\"energy\"]" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>energy</th>\n", + " <th>energy_max</th>\n", + " </tr>\n", + " <tr>\n", + " <th>genre</th>\n", + " <th></th>\n", + " <th></th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>Dark Trap</th>\n", + " <td>2757</td>\n", + " <td>0.998</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Hiphop</th>\n", + " <td>2497</td>\n", + " <td>0.978</td>\n", + " </tr>\n", + " <tr>\n", + " <th>Underground Rap</th>\n", + " <td>3420</td>\n", + " <td>0.997</td>\n", + " </tr>\n", + " <tr>\n", + " <th>dnb</th>\n", + " <td>2496</td>\n", + " <td>0.999</td>\n", + " </tr>\n", + " <tr>\n", + " <th>hardstyle</th>\n", + " <td>2345</td>\n", + " <td>0.999</td>\n", + " </tr>\n", + " <tr>\n", + " <th>psytrance</th>\n", + " <td>2642</td>\n", + " <td>0.999</td>\n", + " </tr>\n", + " <tr>\n", + " <th>techhouse</th>\n", + " <td>2164</td>\n", + " <td>0.999</td>\n", + " </tr>\n", + " <tr>\n", + " <th>techno</th>\n", + " <td>2534</td>\n", + " <td>1.000</td>\n", + " </tr>\n", + " <tr>\n", + " <th>trance</th>\n", + " <td>2786</td>\n", + " <td>1.000</td>\n", + " </tr>\n", + " <tr>\n", + " <th>trap</th>\n", + " <td>2346</td>\n", + " <td>1.000</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " energy energy_max\n", + "genre \n", + "Dark Trap 2757 0.998\n", + "Hiphop 2497 0.978\n", + "Underground Rap 3420 0.997\n", + "dnb 2496 0.999\n", + "hardstyle 2345 0.999\n", + "psytrance 2642 0.999\n", + "techhouse 2164 0.999\n", + "techno 2534 1.000\n", + "trance 2786 1.000\n", + "trap 2346 1.000" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "genre_counts = genre_groups.count()\n", + "genre_counts[\"energy_max\"] = max_energy[\"energy\"]\n", + "filtered_genre_counts = genre_counts[genre_counts[\"energy\"] > 2000]\n", + "filtered_genre_counts" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>genre</th>\n", + " <th>song_count</th>\n", + " <th>energy_max</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>Dark Trap</td>\n", + " <td>2757</td>\n", + " <td>0.998</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>Hiphop</td>\n", + " <td>2497</td>\n", + " <td>0.978</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>Underground Rap</td>\n", + " <td>3420</td>\n", + " <td>0.997</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>dnb</td>\n", + " <td>2496</td>\n", + " <td>0.999</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>hardstyle</td>\n", + " <td>2345</td>\n", + " <td>0.999</td>\n", + " </tr>\n", + " <tr>\n", + " <th>5</th>\n", + " <td>psytrance</td>\n", + " <td>2642</td>\n", + " <td>0.999</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6</th>\n", + " <td>techhouse</td>\n", + " <td>2164</td>\n", + " <td>0.999</td>\n", + " </tr>\n", + " <tr>\n", + " <th>7</th>\n", + " <td>techno</td>\n", + " <td>2534</td>\n", + " <td>1.000</td>\n", + " </tr>\n", + " <tr>\n", + " <th>8</th>\n", + " <td>trance</td>\n", + " <td>2786</td>\n", + " <td>1.000</td>\n", + " </tr>\n", + " <tr>\n", + " <th>9</th>\n", + " <td>trap</td>\n", + " <td>2346</td>\n", + " <td>1.000</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " genre song_count energy_max\n", + "0 Dark Trap 2757 0.998\n", + "1 Hiphop 2497 0.978\n", + "2 Underground Rap 3420 0.997\n", + "3 dnb 2496 0.999\n", + "4 hardstyle 2345 0.999\n", + "5 psytrance 2642 0.999\n", + "6 techhouse 2164 0.999\n", + "7 techno 2534 1.000\n", + "8 trance 2786 1.000\n", + "9 trap 2346 1.000" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# v2: SQL\n", + "qry(\"\"\"\n", + "SELECT genre, COUNT(*) as song_count, MAX(\"energy\") as energy_max\n", + "FROM spotify\n", + "WHERE energy > 0.5\n", + "GROUP BY genre\n", + "HAVING song_count > 2000\n", + "\"\"\")" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [], + "source": [ + "# Close the database connection here\n", + "conn.close()" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/f23/Gurmail_Lecture_Notes/37N_Advanced_pandas_topics/lec_37_pandas3_data_transformation_template_Gurmail_lec1.ipynb b/f23/Gurmail_Lecture_Notes/37N_Advanced_pandas_topics/lec_37_pandas3_data_transformation_template_Gurmail_lec1.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..95dcf3a346b7bba48cd85a92613f73ad24c99f0a --- /dev/null +++ b/f23/Gurmail_Lecture_Notes/37N_Advanced_pandas_topics/lec_37_pandas3_data_transformation_template_Gurmail_lec1.ipynb @@ -0,0 +1,810 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Announcements - Wednesday, December 6\n", + "* Download ALL files for today's lecture\n", + "* Q10 Released tonight at 5 pm\n", + "* <b>If you have any problem with P8-P11 grades, please send me (Gurmail.Singh@wisc.edu) an email by December 11.</b>\n", + "* Late days may not be used on P13\n", + "* If you have questions, it is almost always faster to \n", + " * Post on Piazza\n", + " * Go to [office hours](https://sites.google.com/wisc.edu/cs220-oh-f23/home?pli=1) \n", + "### Conflict Form\n", + " * [Final - December 19, 7:45 am](https://cs220.cs.wisc.edu/f23/surveys.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RHvDCo4fhXBx" + }, + "source": [ + "# Lecture 37 Pandas 3: Data Transformation\n", + "* Data transformation is the process of changing the format, structure, or values of data. \n", + "* Often needed during data cleaning and sometimes during data analysis" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yoLGptrqhbBo" + }, + "source": [ + "# Today's Learning Objectives: \n", + "\n", + "* Setting column as index for pandas `DataFrame`\n", + "* Identify, drop, or fill missing values (`np.NaN`) using Pandas `isna`, `dropna`, and `fillna`\n", + "* Applying transformations to `DataFrame`:\n", + " * Use `apply` on pandas `Series` to apply a transformation function\n", + " * Use `replace` to replace all target values in Pandas `Series` and `DataFrame` rows / columns\n", + "* Filter, aggregate, group, and summarize information in a `DataFrame` with `groupby`\n", + "* Convert .groupby examples to SQL\n", + "* Solving the same question using SQL and pandas `DataFrame` manipulations:\n", + " * filtering, grouping, and aggregation / summarization" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CeWtFirwteFY" + }, + "outputs": [], + "source": [ + "# known import statements\n", + "import pandas as pd\n", + "import sqlite3 as sql # note that we are renaming to sql\n", + "import os\n", + "\n", + "# new import statement\n", + "import numpy as np" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FgnTeNRIswsm" + }, + "source": [ + "# The dataset: Spotify songs\n", + "Adapted from https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify.\n", + "\n", + "If you are interested in digging deeper in this dataset, here's a [blog post](https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a) that explain each column in details. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### WARMUP 1: Establish a connection to the spotify.db database" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 232 + }, + "id": "8y9scvgCnTHl", + "outputId": "c72388f8-576c-4cf2-ef51-352cd11b6c92" + }, + "outputs": [], + "source": [ + "# open up the spotify database\n", + "db_pathname = \"spotify.db\"\n", + "assert ???\n", + "conn = sql.connect(db_pathname)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def qry(sql):\n", + " return pd.read_sql(sql, conn)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### WARMUP 2: Identify the table name(s) inside the database" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + }, + "id": "ybTqbDSOnR2f", + "outputId": "8dcc943b-9382-4abb-ef78-6c6d56ad89eb" + }, + "outputs": [], + "source": [ + "df = qry(\"\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### WARMUP 3: Use pandas lookup expression to extract the \"sql\" column and display the full query using .iloc lookup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### WARMUP 4: Store the data inside `spotify` table inside a variable called `df`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 632 + }, + "id": "txAH9OIjnoQv", + "outputId": "ac9152ba-32df-4fb2-d4e0-a97f50fe58fb" + }, + "outputs": [], + "source": [ + "df = qry(\"\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setting a column as row indices for the `DataFrame`\n", + "\n", + "- Syntax: `df.set_index(\"<COLUMN>\")`\n", + "- Returns a new DataFrame object instance reference.\n", + "- WARNING: executing this twice will result in `KeyError` being thrown. Once you set a column as row index, it will no longer be a column within the `DataFrame`. If you tried this, go back and execute the above cell and update `df` once more and then execute the below cell exactly once." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Set the id column as row indices\n", + "df = \n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Not a Number\n", + "\n", + "- `np.NaN` is the floating point representation of Not a Number\n", + "- You do not need to know / learn the details about the `numpy` package \n", + "\n", + "### Replacing / modifying values within the `DataFrame`\n", + "\n", + "Syntax: `df.replace(<TARGET>, <REPLACE>)`\n", + "- Your target can be `str`, `int`, `float`, `None` (there are other possiblities, but those are too advanced for this course)\n", + "- Returns a new DataFrame object instance reference.\n", + "\n", + "Let's now replace the missing values (empty strings) with `np.NAN`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = \n", + "df.head(10) # title is the album name" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Checking for missing values\n", + "\n", + "Syntax: `Series.isna()`\n", + "- Returns a boolean Series\n", + "\n", + "Let's check if any of the \"song_name\"(s) are missing" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "JqzSwG5PEZRq", + "outputId": "05529a3d-4a5c-4654-fe05-d04b2c10ae6c" + }, + "outputs": [], + "source": [ + "df[\"song_name\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Review: `Pandas.Series.value_counts()`\n", + "- Returns a new `Series` with unique values from the original `Series` as keys and the count of those unique values as values. \n", + "- Return value `Series` is ordered using descending order of counts" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "uCLDr8EIGMeJ", + "outputId": "241d6181-d525-4019-a8f2-689939b2ab33" + }, + "outputs": [], + "source": [ + "# count the number of missing values for song name\n", + "df[\"song_name\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Missing value manipulation\n", + "Syntax: `df.fillna(<REPLACE>)`\n", + "- Returns a new DataFrame object instance reference." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "pJ2CIqq9HWvN", + "outputId": "2895e862-18e5-4742-9750-31b130aae668" + }, + "outputs": [], + "source": [ + "# use .fillna to replace missing values\n", + "df[\"song_name\"]\n", + "\n", + "# to replace the original DataFrame's column, you need to explicitly update that object instance\n", + "# TODO: uncomment the below lines and update the code\n", + "#df[\"song_name\"] = ???\n", + "#df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dropping missing values\n", + "Syntax: `df.dropna()`\n", + "- Returns a new DataFrame object instance reference." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 145 + }, + "id": "O_1ZeHG8N-rB", + "outputId": "3b112da2-2b3c-4fb8-c7ae-dc2f2127856d" + }, + "outputs": [], + "source": [ + "# .dropna will drop all rows that contain NaN in them\n", + "df.dropna()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ggttXEqUbI_E" + }, + "source": [ + "### Review: `Pandas.Series.apply(...)`\n", + "Syntax: `Series.apply(<FUNCTION OBJECT REFERENCE>)`\n", + "- applies input function to every element of the Series.\n", + "- Returns a new `Series` object instance reference.\n", + "\n", + "Let's apply transformation function to `mode` column `Series`:\n", + "- mode = 1 means major modality (sounds happy)\n", + "- mode = 0 means minor modality (sounds sad)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def replace_mode(m): \n", + " if m == 1: \n", + " return \"major\"\n", + " else: \n", + " return \"minor\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[\"mode\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### `lambda`\n", + "\n", + "Let's write a `lambda` function instead of the `replace_mode` function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9AJ3p-_TarnN", + "outputId": "a087df5d-2002-417c-e99c-5e6fc8ea9809" + }, + "outputs": [], + "source": [ + "df[\"mode\"].apply(???)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Typically transformed columns are added as new columns within the DataFrame.\n", + "Let's add a new `modified_mode` column." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[\"modified_mode\"] = df[\"mode\"].apply(lambda m: \"major\" if m == 1 else \"minor\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Let's go back to the original table from the SQL database" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZoiyUleiyhMg" + }, + "outputs": [], + "source": [ + "df = qry(\"SELECT * FROM spotify\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Extract just the \"genre\" and \"duration_ms\" columns from `df`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[???]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### `Pandas.DataFrame.groupby(...)`\n", + "\n", + "Syntax: `DataFrame.groupby(<COLUMN>)`\n", + "- Returns a `groupby` object instance reference\n", + "- Need to apply aggregation methods to use the return value of `groupby`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 551 + }, + "id": "trRMgGMysdkb", + "outputId": "d02098c3-7722-4505-c599-5897bb8ace19" + }, + "outputs": [], + "source": [ + "df[[\"genre\", \"duration_ms\"]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What is the average duration for each genre ordered based on decreasing order of averages?\n", + "#### v1: using `df` (`pandas`) to answer the question" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[[\"genre\", \"duration_ms\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[[\"genre\", \"duration_ms\"]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One way to check whether `groupby` works would be to use `value_counts` on the same column `Series`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[\"genre\"].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What is the average duration for each genre ordered based on decreasing order of averages?\n", + "#### v2: using SQL query to answer the question" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 551 + }, + "id": "89hMTXCKxWG8", + "outputId": "5737da11-aa8a-4ed0-9b05-cd379b28904b" + }, + "outputs": [], + "source": [ + "# SQL equivalent query of the above Pandas query\n", + "avg_duration_per_genre = qry(\"\"\"\n", + "\n", + "\"\"\")\n", + "\n", + "# How can we get make the SQL query output to be exactly same as df.groupby?\n", + "avg_duration_per_genre = avg_duration_per_genre.set_index(\"genre\")\n", + "avg_duration_per_genre" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "12ZdqYoIy_8U" + }, + "source": [ + "### What is the average speechiness for each mode, time signature pair?\n", + "#### v1: pandas" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 332 + }, + "id": "fVejD2KPyveX", + "outputId": "fe5c8fda-29a2-4f1a-8ff4-de9ad2a3cde0" + }, + "outputs": [], + "source": [ + "# use a list to indicate all the columns you want to groupby \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + }, + "id": "ImYEuOMox-ps", + "outputId": "2674dabd-3ff7-4099-fdc3-54e5ba0e2628" + }, + "outputs": [], + "source": [ + "# SQL equivalent query of the above Pandas query\n", + "qry(\"\"\"\n", + "\n", + "\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sEDc5zGu0bc9" + }, + "source": [ + "### Self-practice" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Which songs have a tempo greater than 150 and what are their genre?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v1: pandas\n", + "fast_songs = " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v2: SQL\n", + "\n", + "qry(\"\"\"\n", + "\n", + "\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What is the sum of danceability and liveness for \"Hiphop\" genre songs?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v1: pandas\n", + "hiphop_songs = " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v2: SQL\n", + "hiphop_songs = qry(\"\"\"\n", + "\n", + "\"\"\")\n", + "hiphop_songs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Find all song_name ordered by ascending order of duration_ms. Eliminate songs which don't have a song_name" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v1: pandas\n", + "songs_by_duration = " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v2\n", + "songs_by_duration = qry(\"\"\"\n", + "\n", + "\"\"\")\n", + "songs_by_duration" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### How many distinct \"genre\"s are there in the dataset?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v1: pandas\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v2: SQL\n", + "genres = qry(\"\"\"\n", + "\n", + "\"\"\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Considering only songs with energy greater than 0.5, what is the maximum energy for each \"genre\" with song count greater than 2000?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "genre_groups = " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v1: pandas\n", + "high_energy_songs = ???\n", + "genre_groups = ???\n", + "max_energy = ???\n", + "max_energy[\"energy\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "genre_counts = ???\n", + "genre_counts[\"energy_max\"] = max_energy[\"energy\"]\n", + "filtered_genre_counts = ???\n", + "filtered_genre_counts" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v2: SQL\n", + "qry(\"\"\"\n", + "\n", + "\"\"\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Close the database connection here\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/f23/Gurmail_Lecture_Notes/37N_Advanced_pandas_topics/lec_37_pandas3_data_transformation_template_Gurmail_lec2.ipynb b/f23/Gurmail_Lecture_Notes/37N_Advanced_pandas_topics/lec_37_pandas3_data_transformation_template_Gurmail_lec2.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..95dcf3a346b7bba48cd85a92613f73ad24c99f0a --- /dev/null +++ b/f23/Gurmail_Lecture_Notes/37N_Advanced_pandas_topics/lec_37_pandas3_data_transformation_template_Gurmail_lec2.ipynb @@ -0,0 +1,810 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Announcements - Wednesday, December 6\n", + "* Download ALL files for today's lecture\n", + "* Q10 Released tonight at 5 pm\n", + "* <b>If you have any problem with P8-P11 grades, please send me (Gurmail.Singh@wisc.edu) an email by December 11.</b>\n", + "* Late days may not be used on P13\n", + "* If you have questions, it is almost always faster to \n", + " * Post on Piazza\n", + " * Go to [office hours](https://sites.google.com/wisc.edu/cs220-oh-f23/home?pli=1) \n", + "### Conflict Form\n", + " * [Final - December 19, 7:45 am](https://cs220.cs.wisc.edu/f23/surveys.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RHvDCo4fhXBx" + }, + "source": [ + "# Lecture 37 Pandas 3: Data Transformation\n", + "* Data transformation is the process of changing the format, structure, or values of data. \n", + "* Often needed during data cleaning and sometimes during data analysis" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yoLGptrqhbBo" + }, + "source": [ + "# Today's Learning Objectives: \n", + "\n", + "* Setting column as index for pandas `DataFrame`\n", + "* Identify, drop, or fill missing values (`np.NaN`) using Pandas `isna`, `dropna`, and `fillna`\n", + "* Applying transformations to `DataFrame`:\n", + " * Use `apply` on pandas `Series` to apply a transformation function\n", + " * Use `replace` to replace all target values in Pandas `Series` and `DataFrame` rows / columns\n", + "* Filter, aggregate, group, and summarize information in a `DataFrame` with `groupby`\n", + "* Convert .groupby examples to SQL\n", + "* Solving the same question using SQL and pandas `DataFrame` manipulations:\n", + " * filtering, grouping, and aggregation / summarization" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CeWtFirwteFY" + }, + "outputs": [], + "source": [ + "# known import statements\n", + "import pandas as pd\n", + "import sqlite3 as sql # note that we are renaming to sql\n", + "import os\n", + "\n", + "# new import statement\n", + "import numpy as np" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FgnTeNRIswsm" + }, + "source": [ + "# The dataset: Spotify songs\n", + "Adapted from https://www.kaggle.com/datasets/mrmorj/dataset-of-songs-in-spotify.\n", + "\n", + "If you are interested in digging deeper in this dataset, here's a [blog post](https://medium.com/@boplantinga/what-do-spotifys-audio-features-tell-us-about-this-year-s-eurovision-song-contest-66ad188e112a) that explain each column in details. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### WARMUP 1: Establish a connection to the spotify.db database" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 232 + }, + "id": "8y9scvgCnTHl", + "outputId": "c72388f8-576c-4cf2-ef51-352cd11b6c92" + }, + "outputs": [], + "source": [ + "# open up the spotify database\n", + "db_pathname = \"spotify.db\"\n", + "assert ???\n", + "conn = sql.connect(db_pathname)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def qry(sql):\n", + " return pd.read_sql(sql, conn)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### WARMUP 2: Identify the table name(s) inside the database" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + }, + "id": "ybTqbDSOnR2f", + "outputId": "8dcc943b-9382-4abb-ef78-6c6d56ad89eb" + }, + "outputs": [], + "source": [ + "df = qry(\"\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### WARMUP 3: Use pandas lookup expression to extract the \"sql\" column and display the full query using .iloc lookup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### WARMUP 4: Store the data inside `spotify` table inside a variable called `df`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 632 + }, + "id": "txAH9OIjnoQv", + "outputId": "ac9152ba-32df-4fb2-d4e0-a97f50fe58fb" + }, + "outputs": [], + "source": [ + "df = qry(\"\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Setting a column as row indices for the `DataFrame`\n", + "\n", + "- Syntax: `df.set_index(\"<COLUMN>\")`\n", + "- Returns a new DataFrame object instance reference.\n", + "- WARNING: executing this twice will result in `KeyError` being thrown. Once you set a column as row index, it will no longer be a column within the `DataFrame`. If you tried this, go back and execute the above cell and update `df` once more and then execute the below cell exactly once." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Set the id column as row indices\n", + "df = \n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Not a Number\n", + "\n", + "- `np.NaN` is the floating point representation of Not a Number\n", + "- You do not need to know / learn the details about the `numpy` package \n", + "\n", + "### Replacing / modifying values within the `DataFrame`\n", + "\n", + "Syntax: `df.replace(<TARGET>, <REPLACE>)`\n", + "- Your target can be `str`, `int`, `float`, `None` (there are other possiblities, but those are too advanced for this course)\n", + "- Returns a new DataFrame object instance reference.\n", + "\n", + "Let's now replace the missing values (empty strings) with `np.NAN`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = \n", + "df.head(10) # title is the album name" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Checking for missing values\n", + "\n", + "Syntax: `Series.isna()`\n", + "- Returns a boolean Series\n", + "\n", + "Let's check if any of the \"song_name\"(s) are missing" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "JqzSwG5PEZRq", + "outputId": "05529a3d-4a5c-4654-fe05-d04b2c10ae6c" + }, + "outputs": [], + "source": [ + "df[\"song_name\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Review: `Pandas.Series.value_counts()`\n", + "- Returns a new `Series` with unique values from the original `Series` as keys and the count of those unique values as values. \n", + "- Return value `Series` is ordered using descending order of counts" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "uCLDr8EIGMeJ", + "outputId": "241d6181-d525-4019-a8f2-689939b2ab33" + }, + "outputs": [], + "source": [ + "# count the number of missing values for song name\n", + "df[\"song_name\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Missing value manipulation\n", + "Syntax: `df.fillna(<REPLACE>)`\n", + "- Returns a new DataFrame object instance reference." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "pJ2CIqq9HWvN", + "outputId": "2895e862-18e5-4742-9750-31b130aae668" + }, + "outputs": [], + "source": [ + "# use .fillna to replace missing values\n", + "df[\"song_name\"]\n", + "\n", + "# to replace the original DataFrame's column, you need to explicitly update that object instance\n", + "# TODO: uncomment the below lines and update the code\n", + "#df[\"song_name\"] = ???\n", + "#df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dropping missing values\n", + "Syntax: `df.dropna()`\n", + "- Returns a new DataFrame object instance reference." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 145 + }, + "id": "O_1ZeHG8N-rB", + "outputId": "3b112da2-2b3c-4fb8-c7ae-dc2f2127856d" + }, + "outputs": [], + "source": [ + "# .dropna will drop all rows that contain NaN in them\n", + "df.dropna()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ggttXEqUbI_E" + }, + "source": [ + "### Review: `Pandas.Series.apply(...)`\n", + "Syntax: `Series.apply(<FUNCTION OBJECT REFERENCE>)`\n", + "- applies input function to every element of the Series.\n", + "- Returns a new `Series` object instance reference.\n", + "\n", + "Let's apply transformation function to `mode` column `Series`:\n", + "- mode = 1 means major modality (sounds happy)\n", + "- mode = 0 means minor modality (sounds sad)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def replace_mode(m): \n", + " if m == 1: \n", + " return \"major\"\n", + " else: \n", + " return \"minor\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[\"mode\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### `lambda`\n", + "\n", + "Let's write a `lambda` function instead of the `replace_mode` function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9AJ3p-_TarnN", + "outputId": "a087df5d-2002-417c-e99c-5e6fc8ea9809" + }, + "outputs": [], + "source": [ + "df[\"mode\"].apply(???)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Typically transformed columns are added as new columns within the DataFrame.\n", + "Let's add a new `modified_mode` column." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[\"modified_mode\"] = df[\"mode\"].apply(lambda m: \"major\" if m == 1 else \"minor\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Let's go back to the original table from the SQL database" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZoiyUleiyhMg" + }, + "outputs": [], + "source": [ + "df = qry(\"SELECT * FROM spotify\")\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Extract just the \"genre\" and \"duration_ms\" columns from `df`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[???]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### `Pandas.DataFrame.groupby(...)`\n", + "\n", + "Syntax: `DataFrame.groupby(<COLUMN>)`\n", + "- Returns a `groupby` object instance reference\n", + "- Need to apply aggregation methods to use the return value of `groupby`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 551 + }, + "id": "trRMgGMysdkb", + "outputId": "d02098c3-7722-4505-c599-5897bb8ace19" + }, + "outputs": [], + "source": [ + "df[[\"genre\", \"duration_ms\"]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What is the average duration for each genre ordered based on decreasing order of averages?\n", + "#### v1: using `df` (`pandas`) to answer the question" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[[\"genre\", \"duration_ms\"]]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[[\"genre\", \"duration_ms\"]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "One way to check whether `groupby` works would be to use `value_counts` on the same column `Series`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df[\"genre\"].value_counts()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What is the average duration for each genre ordered based on decreasing order of averages?\n", + "#### v2: using SQL query to answer the question" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 551 + }, + "id": "89hMTXCKxWG8", + "outputId": "5737da11-aa8a-4ed0-9b05-cd379b28904b" + }, + "outputs": [], + "source": [ + "# SQL equivalent query of the above Pandas query\n", + "avg_duration_per_genre = qry(\"\"\"\n", + "\n", + "\"\"\")\n", + "\n", + "# How can we get make the SQL query output to be exactly same as df.groupby?\n", + "avg_duration_per_genre = avg_duration_per_genre.set_index(\"genre\")\n", + "avg_duration_per_genre" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "12ZdqYoIy_8U" + }, + "source": [ + "### What is the average speechiness for each mode, time signature pair?\n", + "#### v1: pandas" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 332 + }, + "id": "fVejD2KPyveX", + "outputId": "fe5c8fda-29a2-4f1a-8ff4-de9ad2a3cde0" + }, + "outputs": [], + "source": [ + "# use a list to indicate all the columns you want to groupby \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 300 + }, + "id": "ImYEuOMox-ps", + "outputId": "2674dabd-3ff7-4099-fdc3-54e5ba0e2628" + }, + "outputs": [], + "source": [ + "# SQL equivalent query of the above Pandas query\n", + "qry(\"\"\"\n", + "\n", + "\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sEDc5zGu0bc9" + }, + "source": [ + "### Self-practice" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Which songs have a tempo greater than 150 and what are their genre?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v1: pandas\n", + "fast_songs = " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v2: SQL\n", + "\n", + "qry(\"\"\"\n", + "\n", + "\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What is the sum of danceability and liveness for \"Hiphop\" genre songs?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v1: pandas\n", + "hiphop_songs = " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v2: SQL\n", + "hiphop_songs = qry(\"\"\"\n", + "\n", + "\"\"\")\n", + "hiphop_songs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Find all song_name ordered by ascending order of duration_ms. Eliminate songs which don't have a song_name" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v1: pandas\n", + "songs_by_duration = " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v2\n", + "songs_by_duration = qry(\"\"\"\n", + "\n", + "\"\"\")\n", + "songs_by_duration" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### How many distinct \"genre\"s are there in the dataset?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v1: pandas\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v2: SQL\n", + "genres = qry(\"\"\"\n", + "\n", + "\"\"\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Considering only songs with energy greater than 0.5, what is the maximum energy for each \"genre\" with song count greater than 2000?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "genre_groups = " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v1: pandas\n", + "high_energy_songs = ???\n", + "genre_groups = ???\n", + "max_energy = ???\n", + "max_energy[\"energy\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "genre_counts = ???\n", + "genre_counts[\"energy_max\"] = max_energy[\"energy\"]\n", + "filtered_genre_counts = ???\n", + "filtered_genre_counts" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# v2: SQL\n", + "qry(\"\"\"\n", + "\n", + "\"\"\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Close the database connection here\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/f23/Gurmail_Lecture_Notes/37N_Advanced_pandas_topics/spotify.db b/f23/Gurmail_Lecture_Notes/37N_Advanced_pandas_topics/spotify.db new file mode 100644 index 0000000000000000000000000000000000000000..a0e53761991a54fc8804d2b98bcc34ac4d99b70f Binary files /dev/null and b/f23/Gurmail_Lecture_Notes/37N_Advanced_pandas_topics/spotify.db differ