In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("practice.ipynb")

In [None]:
import practice_test

# Lab-P10:  File Handling and Namedtuples

## Learning Objectives:

In this lab, you will practice how to...
* use the `os` module to handle files,
* load data in json files,
* combine data from different files to create data structures,
* create named tuples,
* use `try/except` to handle malformed data.

## Note on Academic Misconduct:

**IMPORTANT**: P10 and P11 are two parts of the same data analysis. You **cannot** switch project partners between these two projects. That is if you partner up with someone for Lab-P10 and P10, you have to work on Lab-P11 and P11 with the **same partner**.

You may do these lab exercises with only with your project partner; you are not allowed to start working on Lab-P10 with one person, then do the project with a different partner.  Now may be a good time to review [our course policies](https://cs220.cs.wisc.edu/s23/syllabus.html).

## Setup:

Before proceeding much further, download `small_data.zip` and extract it to a directory on your
computer (using [Mac directions](http://osxdaily.com/2017/11/05/how-open-zip-file-mac/) or
[Windows directions](https://support.microsoft.com/en-us/help/4028088/windows-zip-and-unzip-files)).

You need to make sure that the project files are stored in the following structure:

```
+-- practice.ipynb
+-- practice_test.py
+-- small_data
|   +-- .DS_Store
|   +-- .ipynb_checkpoints
|   +-- mapping_1.json
|   +-- mapping_2.json
|   +-- mapping_3.json
|   +-- planets_1.csv
|   +-- planets_2.csv
|   +-- planets_3.csv
|   +-- stars_1.csv
|   +-- stars_2.csv
|   +-- stars_3.csv
```

Make sure that the files inside `small_data.zip` are inside the `small_data` directory.

## Introduction:

In P10 and P11, we will be studying stars and planets outside our Solar System using this dataset from the [NASA Exoplanet Archive](https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=PSCompPars). We will use Python to ask some interesting questions about the laws of the universe and explore the habitability of other planets in our universe.

In Lab-P10, you will work with a small subset of the full dataset. You can find these files inside `small_data.zip`. The full dataset used in P10 and P11 is stored in the same format, so you can then use this code to parse the dataset in P10 and P11.

## The Data:

You can open each of the files inside the `small_data` directory using Microsoft Excel or some other Spreadsheet viewing software to see how the data is stored. For example, these are the contents of the file `stars_1.csv`:

|Name|Spectral Type|Stellar Effective Temperature [K]|Stellar Radius [Solar Radius]|Stellar Mass [Solar mass]|Stellar Luminosity [log(Solar)]|Stellar Surface Gravity [log10(cm/s**2)]|Stellar Age [Gyr]|
|----|-------------|---------------------------------|-----------------------------|-------------------------|-------------------------------|----------------------------------------|-----------------|
|55 Cnc|G8V|5172.00|0.94|0.91|-0.197|4.43|10.200|
|DMPP-1|F8 V|6196.00|1.26|1.21|0.320|4.41|2.010|
|GJ 876|M2.5V|3271.00|0.30|0.32|-1.907|4.87|1.000|

As you might have already guessed, this file contains data on a number of *stars* outside our solar system along with some important statistics about these stars. The columns here are as follows:

- `Name`: The name given to the star by the International Astronomical Union,
- `Spectral Type`: The Spectral Classification of the star as per the Morganâ€“Keenan (MK) system,
- `Stellar Effective Temperature [K]`: The temperature of a black body (in units of Kelvin) that would emit the observed radiation of the star,
- `Stellar Radius [Solar Radius]`: The radius of the star (in units of the radius of the Sun),
- `Stellar Mass [Solar mass]`: The mass of the star (in units of the mass of the Sun),
- `Stellar Luminosity [log(Solar)]`: The total amount of energy radiated by the star each second (represented by the logarithm of the energy radiated by the Sun in each second),
- `Stellar Surface Gravity [log10(cm/s**2)]`: The acceleration due to the gravity of the Star at its surface (represented by the logarithm of the acceleration measured in centimeter per second squared),
- `Stellar Age [Gyr]`: The total age of the star (in units of Giga years, i.e., billions of years).

The two other files `stars_2.csv`, and `stars_3.csv` also store similar data in the same format. At this stage, it is alright if you do not understand what these columns mean - they will be explained to you when they become necessary (in P10 and P11).

On the other hand, here are the contents of the file `planets_1.csv`:

|Planet Name|Discovery Method|Discovery Year|Controversial Flag|Orbital Period [days]|Planet Radius [Earth Radius]|Planet Mass [Earth Mass]|Orbit Semi-Major Axis [au]|Eccentricity|Equilibrium Temperature [K]|Insolation Flux [Earth Flux]|
|-----------|----------------|--------------|------------------|---------------------|----------------------------|------------------------|---------------------------|------------|---------------------------|----------------------------|
|55 Cnc b|Radial Velocity|1996|0|14.65160000|13.900|263.97850|0.113400|0.000000|700||
|55 Cnc c|Radial Velocity|2004|0|44.39890000|8.510|54.47380|0.237300|0.030000|||
|DMPP-1 b|Radial Velocity|2019|0|18.57000000|5.290|24.27000|0.146200|0.083000|877||
|GJ 876 b|Radial Velocity|1998|0|61.11660000|13.300|723.22350|0.208317|0.032400|||
|GJ 876 c|Radial Velocity|2000|0|30.08810000|14.000|226.98460|0.129590|0.255910|||


This file contains data on a number of *planets* outside our solar system along with some important statistics about these planets. The columns here are as follows:

- `Planet Name`: The name given to the planet by the International Astronomical Union,
- `Discovery Method`: The method by which the planet was discovered,
- `Discovery Year`: The year in which the planet was discovered,
- `Controversial Flag`: Indicates whether the status of the discovered object as a planet was disputed at the time of discovery, 
- `Orbital Period [days]`: The amount of time (in units of days) it takes for the planet to complete one orbit around its star,
- `Planet Radius [Earth Radius]`: The radius of the planet (in units of the radius of the Earth),
- `Planet Mass [Earth Mass]`: The mass of the planet (in units of the mass of the Earth),
- `Orbit Semi-Major Axis [au]`: The semi-major axis of the planet's elliptical orbit around its host star (in units of Astronomical Units),
- `Eccentricity`: The eccentricity of the planet's orbit around its host star,
- `Equilibrium Temperature [K]`: The temperature of the planet (in units of Kelvin) if it were a black body heated only by its host star,
- `Insolation Flux [Earth Flux]`:  The amount of radiation the planet received from its host star per unit of area (in units of the Insolation Flux of the Earth from the Sun).

The two other files `planets_2.csv`, and `planets_3.csv` also store similar data in the same format.


Finally, if you take a look at `mapping_1.json` (you can open json files using any Text Editor), you will see that the file looks like this:

```
{"55 Cnc b": "55 Cnc", "55 Cnc c": "55 Cnc", "DMPP-1 b": "DMPP-1", "GJ 876 b": "GJ 876", "GJ 876 c": "GJ 876"}
```

This file contains a *mapping* from each *planet* in `planets_1.csv` to the *star* in `stars_1.csv` that the planet orbits. Similarly, `mapping_2.json` contains a *mapping* from each *planet* in `planets_2.csv` to the *star* in `stars_2.csv` that the planet orbits, and `mapping_3.json` contains a *mapping* from each *planet* in `planets_3.csv` to the *star* in `stars_3.csv` that the planet orbits.

## Questions and Functions:

Let us start by importing all the modules we will need for this project.

In [None]:
# it is considered a good coding practice to place all import statements at the top of the notebook
# place all your import statements in this cell if you need to import any more modules for this project

# we have imported these modules for you
import os
from collections import namedtuple
import csv
import json

## Segment 2: File handling with the `os` module

In this segment, you will learn how to use the `os` module effectively.

**Question 1.1**: List **all** the files and directories in the directory `small_data` using the `os.listdir` function.

Your output **must** be a **list** of **strings**. The order does **not** matter.

In [None]:
# we have done this one for you

all_files = os.listdir('small_data')

all_files

In [None]:
grader.check("q1-1")

**Important Warning:** That appeared to work just fine, but you should be **very careful** when using the `os` module. You might have noticed that there are files and directories in the list returned by `os.listdir` that **begin** with the character `"."` (specifically in this case, the file `".DS_Store"` and the directory `".ipynb_checkpoints"`). Such files and directories are used by some operating systems to store metadata. These files are not actually a part of your dataset, and must be **ignored**. 

When you are processing the files in any directory, you **must** always **ignore** such files that begin with the character `"."`, as they are not actually files in the directory. You **must** do this every time you use `os.listdir`.

**Question 1.2**: List **all** the files and directories in the directory `small_data` that do **not** **start with** the character`"."`.

Your output **must** be a **list** of **strings**. The order does **not** matter.

In [None]:
# compute and store the answer in the variable 'actual_files', then display it


In [None]:
grader.check("q1-2")

**Important Warning:** You are not done yet. Look at the order in which the files in the **list** `actual_files` are stored. The **ordering** of the files in the **list** returned by `os.listdir` **depends on the operating system**. This means that if you run this code on a **different OS**, the files might be sorted in a **different order**. This makes `os.listdir` a little dangerous because you could index it, and it will always work the same way on your computer, but will **behave differently on another computer**. To avoid these issues, you should make sure that you always **sort** the output of `os.listdir` before you use it. This will ensure that the ordering remains consistent across all operating systems.

When you are processing the files in any directory, you **must** always **sort** the output of `os.listdir` first. You **must** do this every time you use `os.listdir`.

**Question 2**: List **all** the files and directories in the directory `small_data` that do **not** **start with** the character`"."`, sorted in **reverse alphabetical order**.

Your output **must** be a **list** of **strings**, sorted in **reverse alphabetical** order.

In [None]:
# compute and store the answer in the variable 'files_in_small_data', then display it


In [None]:
grader.check("q2")

**Important Warning:** Every time you use `os.listdir`, you **must** **ignore** files and directories that start with `"."`, and also **sort** the **list** returned by the function, before you do anything else. Otherwise, you are likely to write code that **works on your computer**, but **crashes on other computers**. Such errors are hard to debug, and you **must** be very careful.

**Question 3.1**: What is the **path** of the file `stars_1.csv` in the directory `small_data`.

You are **allowed** to 'hardcode' the strings `'small_data'` and `'stars_1.csv'` to answer this question.

**Warnings:**

1. You **must not** hardcode the **absolute path** of any file in your code. For instance, the **absolute path** of this file `stars_1.csv` could be: `C:\Users\ms\cs220\lab-p10\small_data\stars_1.csv`. However, if you hardcode this path in your code, it will **only work on your computer**. In this case, since the notebook `practice.ipynb` is stored in the path `C:\Users\ms\cs220\lab-p10`, the **relative path** of the file is `small_data\stars_1.csv`, and this is the path that **must** be used, if you want your code to work on all computers.
2. You **must not** hardcode either the character `"\"` or the character `"/"` in your paths. If you do so, your code will **crash** when it runs on a **different operating system**. You **must** use the `os.path.join` function to create paths.

In [None]:
# we have done this one for you

stars_1_path = os.path.join("small_data", "stars_1.csv")

stars_1_path

In [None]:
grader.check("q3-1")

**Question 3.2**: List the **paths** of **all** the files in the directory `small_data`.

Your output **must** be a **list** of **strings**. You must **ignore** files that **start with** the character`"."`, and your output **must** be sorted in **reverse alphabetical order**.

You are **allowed** to "hardcode" the name of the directory `small_data` to answer this question.

**Warnings:**

1. You **must not** hardcode the **absolute path** of any file in your code. You must use the **relative path** of the files.
2. You **must not** hardcode either the character `"\"` or the character `"/"` in your paths. You **must** use the `os.path.join` function to create paths.

In [None]:
# compute and store the answer in the variable 'paths_in_small_data', then display it


In [None]:
grader.check("q3-2")

**Question 4.1**: List the **paths** of **all** the JSON files in the directory `small_data`.

Your output **must** be a **list** of **strings**. You must **ignore** files that **start with** the character`"."`, and your output **must** sorted in **reverse alphabetical order**.

**Hint:** You can identify the JSON files as the files which end with the string `".json"`.

In [None]:
# compute and store the answer in the variable 'json_paths', then display it


In [None]:
grader.check("q4-1")

**Question 4.2**: List the **paths** of **all** the files in the directory `small_data`, whose filename starts with `"stars"`.

Your output **must** be a **list** of **strings**. You must **ignore** files that **start with** the character`"."`, and your output **must** sorted in **reverse alphabetical order**.

In [None]:
# compute and store the answer in the variable 'stars_paths', then display it


In [None]:
grader.check("q4-2")

## Segment 3: Creating Namedtuples

In P10, you will be reading the data in files similar to `stars_1.csv`, `stars_2.csv`, and `stars_3.csv`, and storing the data as a **dictionary** of **named tuples**. Now would be a great time to practice creating similar data structues.

### Data Structure 1: namedtuple `Star`

We will now create a new `Star` type (using namedtuple). It **must** have the following attributes:

* `spectral_type`,
* `stellar_effective_temperature`,
* `stellar_radius`,
* `stellar_mass`,
* `stellar_luminosity`,
* `stellar_surface_gravity`,
* `stellar_age`.

In [None]:
# we have done this one for you

# define the list of attributes we want in our namedtuple
star_attributes = ['spectral_type',
                  'stellar_effective_temperature',
                  'stellar_radius',
                  'stellar_mass',
                  'stellar_luminosity',
                  'stellar_surface_gravity',
                  'stellar_age']

# create the namedtuple type 'Star' with the correct attributes
Star = namedtuple("Star", star_attributes)

Let us now test whether we have defined the namedtuple properly by creating a `Star` object.

In [None]:
# run this following cell to initialize and test an example Star object

sun = Star('G2 V', 5780.0, 1.0, 1.0, 0.0, 4.44, 4.6)

sun

In [None]:
grader.check("star_object")

### Segment 3.1: Creating `Star` objects from `stars_1.csv`

Now that we have created the `Star` namedtuple, our next objective will be to read the files `stars_1.csv`, `stars_2.csv`, and `stars_3.csv` and create `Star` objects out of all the stars in there. In order to process the CSV files, you will first need to copy/paste the `process_csv` function you have been using since P6.

In [None]:
# copy & paste the process_csv file from previous projects here


You are now ready to read the data in `stars_1.csv` using `process_csv` and convert the data into `Star` objects. In the cell below, you **must** read the data in `stars_1.csv` and extract the **header** and the non-header **rows** of the file.

In [None]:
# replace the ... with your code

stars_1_csv = process_csv(os.path.join("small_data", "stars_1.csv")) # read the data in 'stars_1.csv'
stars_header = ...
stars_1_rows = ...

If you wish to **verify** that you have read the file and defined the variables correctly, you can check that `stars_header` has the value:

```python
['Name', 'Spectral Type', 'Stellar Effective Temperature [K]', 'Stellar Radius [Solar Radius]', 
 'Stellar Mass [Solar mass]', 'Stellar Luminosity [log(Solar)]', 'Stellar Surface Gravity [log10(cm/s**2)]',
 'Stellar Age [Gyr]']
```

and that `stars_1_rows` has the value:

```python
[['55 Cnc', 'G8V', '5172.00', '0.94', '0.91', '-0.197', '4.43', '10.200'],
 ['DMPP-1', 'F8 V', '6196.00', '1.26', '1.21', '0.320', '4.41', '2.010'],
 ['GJ 876', 'M2.5V', '3271.00', '0.30', '0.32', '-1.907', '4.87', '1.000']]
```

**Question 5**: Create a `Star` object for the **first** star in `"stars_1.csv"`.

The **attribute** of the `Star` namedtuple object, the corresponding **column** of the `stars_1.csv` file where the value should be obtained from, and the correct **data type** for the value are listed in the table below:

|Attribute of `Star` object|Column of `stars_1.csv`|Data Type|
|---------|------|---------|
|`spectral_type`|Spectral Type|**string**|
|`stellar_effective_temperature`|Stellar Effective Temperature [K]|**float**|
|`stellar_radius`|Stellar Radius [Solar Radius]|**float**|
|`stellar_mass`|Stellar Mass [Solar mass]|**float**|
|`stellar_luminosity`|Stellar Luminosity [log(Solar)]|**float**|
|`stellar_surface_gravity`|Stellar Surface Gravity [log10(cm/s**2)]|**float**|
|`stellar_age`|Stellar Age [Gyr]|**float**|

In [None]:
# replace the ... with your code

row_idx = 0 # the index of the star we want to convert into a Star object

# extract the values from stars_1_rows
spectral_type = stars_1_rows[row_idx][stars_header.index(...)]
stellar_effective_temperature = float(stars_1_rows[row_idx][stars_header.index(...)])
stellar_radius = ...
stellar_mass = ...
stellar_luminosity = ...
stellar_surface_gravity = ...
stellar_age = ...

# initialize 'first_star'
first_star = Star(spectral_type, stellar_effective_temperature, stellar_radius, \
                  stellar_mass, stellar_luminosity, \
                  stellar_surface_gravity, stellar_age)

first_star

In [None]:
grader.check("q5")

**Question 6**: Create a `Star` object for the **second** star in `"stars_1.csv"`.

You **must** create the `Star` object similarly to what you did in the previous question.

In [None]:
# compute and store the answer in the variable 'second_star', then display it


In [None]:
grader.check("q6")

**Question 7.1**: What is the `spectral_type` of the **second** star in `"stars_1.csv"`?

You **must** answer this question by accessing the correct **attribute** of the `Star` object `second_star`.

In [None]:
# we have done this one for you

second_star_spectral_type = second_star.spectral_type

second_star_spectral_type

In [None]:
grader.check("q7-1")

**Question 7.2**: What is the `stellar_age` of the **first** star in `"stars_1.csv"`?

You **must** answer this question by accessing the correct **attribute** of the `Star` object `first_star`.

In [None]:
# compute and store the answer in the variable 'first_star_stellar_age', then display it


In [None]:
grader.check("q7-2")

**Question 7.3**: What is the **ratio** of the `stellar_radius` of the **first** star in `"stars_1.csv"` to the **second** star in `"stars_1.csv"`?

You **must** answer this question by accessing the correct **attribute** of the `Star` objects `first_star` and `second_star`.

In [None]:
# compute and store the answer in the variable 'stellar_radius_ratio', then display it


In [None]:
grader.check("q7-3")

**Question 8**: Create a **dictionary** mapping the `name` of each star in `"stars_1.csv"` to its `Star` object.

Your output **must** look like this:
```python
{'55 Cnc': Star(spectral_type='G8V', stellar_effective_temperature=5172.0, stellar_radius=0.94, 
                stellar_mass=0.91, stellar_luminosity=-0.197, stellar_surface_gravity=4.43, stellar_age=10.2),
 'DMPP-1': Star(spectral_type='F8 V', stellar_effective_temperature=6196.0, stellar_radius=1.26, 
                stellar_mass=1.21, stellar_luminosity=0.32, stellar_surface_gravity=4.41, stellar_age=2.01),
 'GJ 876': Star(spectral_type='M2.5V', stellar_effective_temperature=3271.0, stellar_radius=0.3, 
                stellar_mass=0.32, stellar_luminosity=-1.907, stellar_surface_gravity=4.87, stellar_age=1.0)}
```

In [None]:
# replace the ... with your code

stars_1_dict = {} # initialize empty dictionary to store all stars

for row_idx in range(len(stars_1_rows)):
    star_name = stars_1_rows[row_idx][stars_header.index(...)]
    spectral_type = ...
    stellar_effective_temperature = ...
    # extract the other columns from 'stars_1_rows'
    
    star = ... # initialize the 'Star' object using the variables defined above
    stars_1_dict[...] = star

stars_1_dict

In [None]:
stars_1_rows

In [None]:
grader.check("q8")

**Question 9.1**: What is the `Star` object of the star (in `stars_1.csv`) named *GJ 876*?

You **must** access the `Star` object in `stars_1_dict` **dictionary** defined above to answer this question.

In [None]:
# compute and store the answer in the variable 'gj_876', then display it


In [None]:
grader.check("q9-1")

**Question 9.2**: What is the `stellar_luminosity` of the star (in `stars_1.csv`) named *GJ 876*?

You **must** access the `Star` object in `stars_1_dict` **dictionary** defined above to answer this question.

In [None]:
# compute and store the answer in the variable 'gj_876_luminosity', then display it


In [None]:
grader.check("q9-2")

### Segment 3.2: Data Cleaning - missing data

We have already parsed the data in `stars_1.csv`. We are now ready to parse the data in **all** the star files of the `small_data` directory. However, there is one minor inconvenience - there is some missing data in `stars_2.csv` and `stars_3.csv`. For example, this is the **first** row of `stars_2.csv`:

```python
['HD 158259', 'G0', '5801.89', '1.21', '1.08', '0.212', '4.25', '']
```

As you can see, the value of the last column (`Stellar Age [Gyr]`) is `''`, which means that the data is missing. When the data is missing, we will want the value of the corresponding attribute in the `Star` object to be `None`.

So, for example, if we are to convert the row above to be a `Star` object, it should look like:

```python
Star(spectral_type='G0', stellar_effective_temperature=5801.89, stellar_radius=1.21, stellar_mass=1.08,
     stellar_luminosity=0.212, stellar_surface_gravity=4.25, stellar_age=None)
```

### Function 1: `star_cell(row_idx, col_name, stars_rows, header=stars_header)`

Since we need to clean the values of the **list** of **lists** `stars_rows` before we can create our required data structure (**dictionary** mapping **strings** to `Star` objects), now would be a good time to create a function that takes in a `row_idx`, a `col_name` and a **list** of **lists** `stars_rows` (as well as the optional argument `header`) and returns the value of the column `col_name` at the row `row_idx`.

This function **must** typecast the values it returns based on the `col_name`. If the value in `stars_rows` is missing (i.e., it is `''`), then the value returned **must** be `None`.

Recall that the **column** of `stars_rows` where the value should be obtained from, and the correct **data type** for the value are listed in the table below:

|Column of `stars_rows`|Data Type|
|------|---------|
|Name|**string**|
|Spectral Type|**string**|
|Stellar Effective Temperature [K]|**float**|
|Stellar Radius [Solar Radius]|**float**|
|Stellar Mass [Solar mass]|**float**|
|Stellar Luminosity [log(Solar)]|**float**|
|Stellar Surface Gravity [log10(cm/s**2)]|**float**|
|Stellar Age [Gyr]|**float**|

**Hint:** You can use the `cell` function defined in P6 and P7 for inspiration here.

In [None]:
# replace the ... with your code

# the default argument to the parameter 'header' is the global variable 'stars_header' defined above
def star_cell(row_idx, col_name, stars_rows, header=stars_header):
    col_idx = header.index(...)
    val = stars_rows[row_idx][col_idx]
    # return None if value is missing
    # else typecast 'val' and return it depending on 'col_name'

**Question 10.1**: Use the `star_cell` function to find the value of the column `"Spectral Type"` of the **first** star in `"stars_2.csv"`.

In [None]:
# we have done this one for you

# first read the data in 'stars_2.csv' as a list of lists
stars_2_data = process_csv(os.path.join("small_data", "stars_2.csv"))
stars_2_rows = stars_2_data[1:]

# use the 'star_cell' function to extract the correct value
first_star_type = star_cell(0, 'Spectral Type', stars_2_rows)

first_star_type

In [None]:
grader.check("q10-1")

**Question 10.2**: Use the `star_cell` function to find the value of the column `"Stellar Age [Gyr]"` of the **second** star in `"stars_2.csv"`.

In [None]:
# we have done this one for you
# do not worry if there is no output, the variable is expected to hold the value None

# use the 'star_cell' function to extract the correct value
second_star_age = star_cell(1, 'Stellar Age [Gyr]', stars_2_rows)

second_star_age

In [None]:
grader.check("q10-2")

**Question 10.3**: Use the `star_cell` function to find the value of the column `"Stellar Mass [Solar mass]"` of the **third** star in `"stars_2.csv"`.

In [None]:
# we have done this one for you

# use the 'star_cell' function to extract the correct value
third_star_mass = star_cell(2, 'Stellar Mass [Solar mass]', stars_2_rows)

third_star_mass

In [None]:
grader.check("q10-3")

**Question 11**: Create a **dictionary** mapping the `name` of each star in `"stars_2.csv"` to its `Star` object.

You **must** use the `star_cell` function to extract data from `stars_2.csv`.

Your output **must** look like this:
```python
{'HD 158259': Star(spectral_type='G0', stellar_effective_temperature=5801.89, stellar_radius=1.21, 
                   stellar_mass=1.08, stellar_luminosity=0.212, stellar_surface_gravity=4.25, stellar_age=None),
 'K2-187': Star(spectral_type=None, stellar_effective_temperature=5438.0, stellar_radius=0.83, 
                stellar_mass=0.97, stellar_luminosity=-0.21, stellar_surface_gravity=4.6, stellar_age=None),
 'WASP-47': Star(spectral_type=None, stellar_effective_temperature=5552.0, stellar_radius=1.14, 
                 stellar_mass=1.04, stellar_luminosity=0.032, stellar_surface_gravity=4.34, stellar_age=6.5)}
```

In [None]:
# replace the ... with your code

stars_2_dict = {} # initialize empty dictionary to store all stars

for row_idx in range(len(stars_2_rows)):
    star_name = star_cell(row_idx, 'Name', stars_2_rows)
    spectral_type = ...
    stellar_effective_temperature = ...
    # extract the other columns from 'stars_2_rows'
    
    star = ... # initialize the 'Star' object using the variables defined above
    stars_2_dict[...] = star

stars_2_dict

In [None]:
grader.check("q11")

**Question 12.1**: Create a **dictionary** mapping the `name` of each star in `"stars_3.csv"` to its `Star` object.

You **must** use the `star_cell` function to extract data from `stars_3.csv`.

Your output **must** look like this:
```python
{'K2-133': Star(spectral_type='M1.5 V', stellar_effective_temperature=3655.0, stellar_radius=0.46, 
                stellar_mass=0.46, stellar_luminosity=-1.479, stellar_surface_gravity=4.77, stellar_age=None),
 'K2-138': Star(spectral_type='G8 V', stellar_effective_temperature=5356.3, stellar_radius=0.86, 
                stellar_mass=0.94, stellar_luminosity=-0.287, stellar_surface_gravity=4.54, stellar_age=2.8),
 'GJ 667 C': Star(spectral_type='M1.5 V', stellar_effective_temperature=3350.0, stellar_radius=None, 
                  stellar_mass=0.33, stellar_luminosity=-1.863, stellar_surface_gravity=4.69, stellar_age=2.0)}
```

In [None]:
# compute and store the answer in the variable 'stars_3_dict', then display it


In [None]:
grader.check("q12-1")

**Question 12.2**: Combine the three **dictionaries** `stars_1_dict`, `stars_2_dict`, and `stars_3_dict` into a single **dictionary** with all the stars in the `small_data` directory.

In [None]:
# replace the ... with your code

stars_dict = ... # initialize an empty dictionary
stars_dict.update(...) # add stars_1_dict to stars_dict
# add stars_2_dict and stars_3_dict to stars_dict

stars_dict

In [None]:
grader.check("q12-2")

### Data Structure 2: namedtuple `Planet`

Just as you did with the stars, you will be using named tuples to store the data about the planets in the `planets_1.csv`, `planets_2.csv`, and `planets_3.csv` files. Before you start reading these files however, you **must** create a new `Planet` type (using namedtuple). It **must** have the following attributes:

* `planet_name`,
* `host_name`,
* `discovery_method`,
* `discovery_year`,
* `controversial_flag`,
* `orbital_period`,
* `planet_radius`,
* `planet_mass`,
* `semi_major_radius`,
* `eccentricity`,
* `equilibrium_temperature`
* `insolation_flux`.

In [None]:
# define the namedtuple 'Planet' here

planets_attributes = ... # initialize the list of attributes

# define the namedtuple 'Planet'


In [None]:
# run this following cell to initialize and test an example Planet object
# if this cell fails to execute, you have likely not defined the namedtuple 'Planet' correctly
jupiter = Planet('Jupiter', 'Sun', 'Imaging', 1610, False, 4333.0, 11.209, 317.828, 5.2038, 0.0489, 110, 0.0345)

jupiter

In [None]:
grader.check("planet_object")

### Segment 3.3: Creating `Planet` objects

We are now ready to read the files in the `small_data` directory and create `Planet` objects. Creating `Planet` objects however, is going to be more difficult than creating `Star` objects, because the data required to create a single `Planet` object is split up into different files.

The `planets_1.csv`, `planets_2.csv`, and `planets_3.csv` files contain all the data required to create `Planet` objects **except** for the `host_name`. The `host_name` for each planet is to be found in the `mapping_1.json`, `mapping_2.json`, and `mapping_3.json` files.

First, let us read the data in `planets_1.csv`. Since this is a CSV file, you can use the `process_csv` function from above to read this file. In the cell below, you **must** read the data in `planets_1.csv` and extract the **header** and the non-header **rows** of the file.

**Question 13.1**: Read the contents of `'planets_1.csv'` into a **list** of **lists** using the `process_csv` function, and extract the **header** and the **rows** in the file.

In [None]:
# replace the ... with your code

planets_1_csv = process_csv(...) # read the data in 'planets_1.csv'
planets_header = ...
planets_1_rows = ...

In [None]:
grader.check("q13-1")

Now, you are ready to read the data in `mapping_1.json`. Since this is a JSON file, you will need a new function to read this file:

In [None]:
# this function uses the 'load' function from the json module (already imported in this notebook) to read files
def read_json(path):
    with open(path, encoding="utf-8") as f:
        return json.load(f)

**Question 13.2**: Read the contents of `'mapping_1.json'` into a **dictionary** using the `read_json` function.

In [None]:
# we have done this for you

mapping_1_json = read_json(os.path.join("small_data", "mapping_1.json"))

mapping_1_json

In [None]:
grader.check("q13-2")

### Segment 3.4: Combining data from CSV and JSON files

We are now ready to combine the data from `planets_1_rows` and `mapping_1_json` to create `Planet` objects. Before we start, it might be useful to create a function similar to `star_cell` for preprocessing the values in the CSV files.

### Function 2: `planet_cell(row_idx, col_name, planets_rows, header=planets_header)`

Just like the data in `stars_1.csv`, `stars_2.csv`, and `stars_3.csv`, some of the data in `planets_1.csv`, `planets_2.csv`, and `planets_3.csv` is **missing**.  So, now would be a good time to create a function that takes in a `row_idx`, a `col_name` and a **list** of **lists** `planets_rows` (as well as the optional argument `header`) and returns the value of the column `col_name` at the row `row_idx`.

This function **must** typecast the values it returns based on the `col_name`. If the value in `planets_rows` is missing (i.e., it is `''`), then the value returned **must** be `None`.

The **column** of `planets_rows` where the value should be obtained from, and the correct **data type** for the value are listed in the table below:

|Column of `planets_rows`|Data Type|
|------|---------|
|Planet Name|**string**|
|Discovery Year|**int**|
|Discovery Method|**string**|
|Controversial Flag|**bool**|
|Orbital Period [days]|**float**|
|Planet Radius [Earth Radius]|**float**|
|Planet Mass [Earth Mass]|**float**|
|Orbit Semi-Major Axis [au]|**float**|
|Eccentricity|**float**|
|Equilibrium Temperature [K]|**float**|
|Insolation Flux [Earth Flux]|**float**|

**Important Warning:** Notice that the `Controversial Flag` column has to be converted into a **bool**. The data is stored in `planets_1.csv` (and consequently in `planets_rows`) as `"0"/"1"` values (with `"0"` representing `False` and `"1"` representing `True`). However typecasting **strings** to **bools** is not straightforward. Run the following cell and try to figure out what is happening:

In [None]:
strings = ["0", "1", "", " ", "True", "False"]
for string in strings:
    print(bool(string))

If you want to convert the **strings** into **bools**, you will have to explicitly use `if/else` statements to determine whether the value is `"0"` or `"1"`, as can be seen in the starter code below:

In [None]:
# replace the ... with your code

def planet_cell(row_idx, col_name, planets_rows, header=planets_header):
    col_idx = ... # extract col_idx from col_name and header
    val = ... # extract the value at row_idx and col_idx
    if val == '':
        return None
    if col_name in ["Controversial Flag"]:
        if val == "1":
            return ...
        else:
            return ...
    # for all other columns typecast 'val' and return it depending on col_name

**Question 14.1**: Use the `planet_cell` function to find the value of the column `"Planet Name"` of the **first** planet in `"planets_1.csv"`.

In [None]:
# we have done this one for you

first_planet_name = planet_cell(0, 'Planet Name', planets_1_rows)

first_planet_name

In [None]:
grader.check("q14-1")

**Question 14.2**: Use the `planet_cell` function to find the value of the column `"Insolation Flux [Earth Flux]"` of the **first** planet in `"planets_1.csv"`.

In [None]:
# we have done this one for you
# do not worry if there is no output, the variable is expected to hold the value None

first_planet_flux = planet_cell(0, 'Insolation Flux [Earth Flux]', planets_1_rows)

first_planet_flux

In [None]:
grader.check("q14-2")

**Question 14.3**: Use the `planet_cell` function to find the value of the column `"Controversial Flag"` of the **second** planet in `"planets_1.csv"`.

In [None]:
# compute and store the answer in the variable 'second_planet_controversy', then display it


In [None]:
grader.check("q14-3")

**Question 15**: Create a `Planet` object for the **first** star in `"planets_1.csv"`.

The **attribute** of the `Planet` namedtuple object, the corresponding **column** of the `planets_1.csv` file where the value should be obtained from, and the correct **data type** for the value are listed in the table below:

|Attribute of `Planet` object|Column of `planets_1.csv`|Data Type|
|---------|------|---------|
|`planet_name`|Planet Name|**string**|
|`host_name`| - |**string**|
|`discovery_method`|Discovery Method|**string**|
|`discovery_year`|Discovery Year|**int**|
|`controversial_flag`|Controversial Flag|**bool**|
|`orbital_period`|Orbital Period [days]|**float**|
|`planet_radius`|Planet Radius [Earth Radius]|**float**|
|`planet_mass`|Planet Mass [Earth Mass]|**float**|
|`semi_major_radius`|Orbit Semi-Major Axis [au]|**float**|
|`eccentricity`|Eccentricity|**float**|
|`equilibrium_temperature`|Equilibrium Temperature [K]|**float**|
|`insolation_flux`|Insolation Flux [Earth Flux]|**float**|


The value of the `host_name` attribute is found in `mapping_1.json`.

In [None]:
planets_header

In [None]:
# replace the ... with your code

row_idx = 0 # the index of the planet we want to convert into a Planet object

# extract the values from planets_1_rows
planet_name = planet_cell(row_idx, 'Planet Name', planets_1_rows)
host_name = mapping_1_json[planet_name]
discovery_method = planet_cell(row_idx, 'Discovery Method', planets_1_rows)
discovery_year = ...
controversial_flag = ...
orbital_period = ...
planet_radius = ...
planet_mass = ...
semi_major_radius = ...
eccentricity = ...
equilibrium_temperature = ...
insolation_flux = ...

# initialize 'first_planet'
first_planet = Planet(planet_name, host_name, discovery_method, discovery_year,\
                  controversial_flag, orbital_period, planet_radius, planet_mass,\
                  semi_major_radius, eccentricity, equilibrium_temperature, insolation_flux)

first_planet

In [None]:
grader.check("q15")

**Question 16**: Create a **list** of `Planet` objects of each planet in `"planets_1.csv"`.

Your output **must** look like this:
```python
[Planet(planet_name='55 Cnc b', host_name='55 Cnc', discovery_method='Radial Velocity', 
        discovery_year=1996, controversial_flag=False, orbital_period=14.6516, 
        planet_radius=13.9, planet_mass=263.9785, semi_major_radius=0.1134, eccentricity=0.0,
        equilibrium_temperature=700.0, insolation_flux=None),
 Planet(planet_name='55 Cnc c', host_name='55 Cnc', discovery_method='Radial Velocity', 
        discovery_year=2004, controversial_flag=False, orbital_period=44.3989, 
        planet_radius=8.51, planet_mass=54.4738, semi_major_radius=0.2373, eccentricity=0.03, 
        equilibrium_temperature=None, insolation_flux=None),
 Planet(planet_name='DMPP-1 b', host_name='DMPP-1', discovery_method='Radial Velocity', 
        discovery_year=2019, controversial_flag=False, orbital_period=18.57, 
        planet_radius=5.29, planet_mass=24.27, semi_major_radius=0.1462, eccentricity=0.083, 
        equilibrium_temperature=877.0, insolation_flux=None),
 Planet(planet_name='GJ 876 b', host_name='GJ 876', discovery_method='Radial Velocity', 
        discovery_year=1998, controversial_flag=False, orbital_period=61.1166, 
        planet_radius=13.3, planet_mass=723.2235, semi_major_radius=0.208317, eccentricity=0.0324,
        equilibrium_temperature=None, insolation_flux=None),
 Planet(planet_name='GJ 876 c', host_name='GJ 876', discovery_method='Radial Velocity', 
        discovery_year=2000, controversial_flag=False, orbital_period=30.0881, 
        planet_radius=14.0, planet_mass=226.9846, semi_major_radius=0.12959, eccentricity=0.25591, 
        equilibrium_temperature=None, insolation_flux=None)]
```

In [None]:
# compute and store the answer in the variable 'planets_1_list', then display it


In [None]:
grader.check("q16")

**Question 17.1**: What is the **fifth** `Planet` object in `'planets_1.csv'`?

You **must** access from the `planets_1_list` to answer this question.

In [None]:
# compute and store the answer in the variable 'fifth_planet', then display it


In [None]:
grader.check("q17-1")

**Question 17.2**: What is the `planet_name` of the **fifth** `Planet` in `'planets_1.csv'`?

You **must** access from the `planets_1_list` to answer this question.

In [None]:
# compute and store the answer in the variable 'fifth_planet_name', then display it


In [None]:
grader.check("q17-2")

**Question 17.3**: What is the `controversial_flag` of the **fourth** `Planet` in `'planets_1.csv'`?

You **must** access from the `planets_1_list` to answer this question.

In [None]:
# compute and store the answer in the variable 'fourth_planet_controversy', then display it


In [None]:
grader.check("q17-3")

### Segment 3.5: Data Cleaning - broken CSV rows

The code you have written worked well for reading the data in `planets_1.csv` and `mapping_1.json`. However, it will likely **not** work for `planets_2.csv` and `mapping_2.json`. This is because the file `planets_2.csv` is **broken**. For some reason, a few rows in `planets_2.csv` have their data jumbled up. This is what `planets_2.csv` looks like:

|Planet Name|Discovery Method|Discovery Year|Controversial Flag|Orbital Period [days]|Planet Radius [Earth Radius]|Planet Mass [Earth Mass]|Orbit Semi-Major Axis [au]|Eccentricity|Equilibrium Temperature [K]|Insolation Flux [Earth Flux]|
|-----------|----------------|--------------|------------------|---------------------|----------------------------|------------------------|--------------------------|------------|---------------------------|----------------------------|
|HD 158259 b|Radial Velocity|2020|0|2.17800000|1.292|2.22000|||1478|794.22|
|K2-187 b|Transit|2018|0|0.77401000|1.200|1.87000|0.016400||1815||
|K2-187 c|Transit|2018|0|2.87151200|1.400|2.54000|0.039200||1173||
|K2-187 d|K2-187|Transit|2018|0|7.14958400|2.400|6.35000|0.072000||865|
|WASP-47 b|2012|Transit|0|4.15914920|12.640|363.60000|0.052000|0.002800|1275|534.00|

We can see that for some reason, in the **fourth** row, the value under the column `Discovery Method` is the name of the planet's host star. This is causing all the other columns in the row to also take meaningless values.

Similarly, in the **fifth** row, we see that the values under the columns `Discovery Method` and `Discovery Year` are swapped.

We will call such a **row** in a CSV file where the values under a column do not match the expected format to be a **broken row**. While it is possible to sometimes extract useful data from broken rows, in this lab and in P10, we will simply **skip** broken rows.

In order to **skip** broken rows, you should first know how to recognize a **broken row**. In general, there is no general rule that helps you identify when a row is broken. This is because CSV rows can be **broken** in all sorts of different ways. Thankfully, we don't have to write code to catch all sorts of weird cases. It will suffice for us to manually **inspect** the file `planets_2.csv`, and identify **how** the rows are broken.

The simplest way to recognize if a row is broken is if you run into any **RunTime Errors** when you execute your code. So, one simple way to skip bad rows would be to use `try/except` blocks to avoid processing any rows that cause the code to crash.

**Important Note:** In this dataset, as you might have already noticed, it would be **significantly harder** to detect **broken rows** where some of the numerical values are swapped (for example, `Planet Radius [Earth Radius]` and `Planet Mass [Earth Mass]`). You may **assume** that the numerical values are **not** swapped in **any** row, and that **only the rows** in which the **data types** are not as expected are **broken**.

**Question 18**: Create a **list** of `Planet` objects of each planet in `"planets_2.csv"`.

You **must** skip any broken rows in the CSV file. Your output **must** look like this:
```python
[Planet(planet_name='HD 158259 b', host_name='HD 158259', discovery_method='Radial Velocity', 
        discovery_year=2020, controversial_flag=False, orbital_period=2.178, 
        planet_radius=1.292, planet_mass=2.22, semi_major_radius=None, eccentricity=None, 
        equilibrium_temperature=1478.0, insolation_flux=794.22),
 Planet(planet_name='K2-187 b', host_name='K2-187', discovery_method='Transit', 
        discovery_year=2018, controversial_flag=False, orbital_period=0.77401, 
        planet_radius=1.2, planet_mass=1.87, semi_major_radius=0.0164, eccentricity=None, 
        equilibrium_temperature=1815.0, insolation_flux=None),
 Planet(planet_name='K2-187 c', host_name='K2-187', discovery_method='Transit', 
        discovery_year=2018, controversial_flag=False, orbital_period=2.871512, 
        planet_radius=1.4, planet_mass=2.54, semi_major_radius=0.0392, eccentricity=None, 
        equilibrium_temperature=1173.0, insolation_flux=None)]
```

In [None]:
# replace the ... with your code

planets_2_data = ... # read planets_2.csv
planets_2_rows = ... # extract the rows from planets_2_data
mapping_2_json = ... # read mapping_2.json

planets_2_list = []
for row_idx in range(len(planets_2_rows)):
    try:
        pass # replace with your code
        # create a Planet object and append to 'planets_2_list'
    except ValueError:
        continue

planets_2_list

In [None]:
grader.check("q18")

**Important Warning:** It is considered a bad coding practice to use *bare* `try/except` blocks. This means that you should **never** write code like this:

```python
try:
    # some code
except:
    # some other code
```

If you use *bare* `try/except` blocks, your code will seemingly work even if there are bugs in there, and it can get very hard to debug. You should always **explicitly** catch for specific errors like this:

```python
try:
    # some code
except ValueError:
    # some other code
except IndexError:
    # some other code
```

This way, your code will still crash if there is some other unexpected bug in your code that needs to be fixed, and will only go to the `except` block if it runs into a `ValueError` or an `IndexError`. The starter code above already catches specifically for `ValueError`. You **must** continue this practice in P10 as well.

### Segment 3.6: Data Cleaning - broken JSON files

So far, we have written code that can read `planets_1.csv` and `mapping_1.json`, as well as `planets_2.csv` and `mapping_2.json`. However, if you try to read `mapping_3.json`, you are likely to run into some issues. This is because the file `mapping_3.json` is **broken**. Unlike **broken** CSV files, where we only had to skip the **broken rows**, it is much harder to parse **broken JSON files**. When a JSON file is **broken**, we often have no choice but to **skip the file entirely**.

It is also not easy to detect if a JSON file is **broken** using `if` statements. The easiest is to simply try to read the file using the `read_json` function and check if the code crashes.

**Question 19**: Determine if the `'mapping_3.json'` file is **broken** using a `try/except` block.

In [None]:
# we have done this one for you

try:
    mapping_3_json = read_json(os.path.join("small_data", "mapping_3.json"))
except json.JSONDecodeError:
    mapping_3_json = {}
    
mapping_3_json

In [None]:
grader.check("q19")

In the above cell, note that in the `try/except` block, we specifically checked for the `json.JSONDecodeError`. This is the error that is thrown when you try to call `json.load` on a **broken** JSON file.

## Segment 4: Data Analysis

We have now managed to read all the data in the `small_data` directory. Now is the time to test if our data structures work!

**Question 20.1**: What is the `host_name` of the **second** planet in `'planets_2.csv'`?

You **must** skip any broken rows. So, you can directly access from the list `planets_2_list` to answer this question.

In [None]:
# compute and store the answer in the variable 'second_planet_host', then display it


In [None]:
grader.check("q20-1")

**Question 20.2**: What is the `Star` object of the **third** planet in `'planets_2.csv'`?

You **must** skip any broken rows. So, you can directly access from the list `planets_2_list` to answer this question.

**Hint:** You can use the `stars_dict` **dictionary** defined in q12.2 to find the `Star` object.

In [None]:
# compute and store the answer in the variable 'third_planet_star', then display it


In [None]:
grader.check("q20-2")

**Question 20.3**: What is the `stellar_radius` of the star around which the **first** planet in `'planets_1.csv'` orbits?

You can directly access from the list `planets_1_list` to answer this question.

In [None]:
# compute and store the answer in the variable 'first_planet_star_radius', then display it


In [None]:
grader.check("q20-3")