Pandas DataFrames

Authors
Dr. Nicholas Del Grosso | Dr. Sangeetha Nandakumar | Dr. Ole Bialas | Dr. Atle E. Rimehaug

Setup

Import Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import Image

Section 1: What is a DataFrame?

A DataFrame, simply put, is a Table of data. It is a structure that contains multiple rows, each row containing the same labelled collection of data types. A DataFrame is a table with named rows (called the “index”). For example, a DataFrame might look like this:

(index) Name Age Height LikesIceCream
0 “Nick” 22 3.4 True
1 “Jenn” 55 1.2 True
2 “Joe” 25 2.2 True

Because each row contains the same data, DataFrames can also be thought of as a collection of same-length columns!

Pandas is a Python package that has a DataFrame class. Using either the DataFrame class constructor or one of Pandas’ many read_() functions, you can make your own DataFrame from a variety of sources.

Different Ways of Making DataFrames Directly

From a List of Dicts

Dicts are named collections. If you have many of the same dicts in a list, the DataFrame constructor can convert it to a Dataframe:

Exercises

friends = [
    {'Name': "Nick", "Age": 31, "Height": 2.9, "Weight": 20},
    {'Name': "Jenn", "Age": 55, "Height": 1.2},
    {"Name": "Joe", "Height": 1.2, "Age": 25, },
]
pd.DataFrame(friends)

Name Age Height Weight
0 Nick 31 2.9 20.0
1 Jenn 55 1.2 NaN
2 Joe 25 1.2 NaN

From a Dict of Lists

df = pd.DataFrame({
    'Name': ['Nick', 'Jenn', 'Joe'], 
    'Age': [31, 55, 25], 
    'Height': [2.9, 1.2, 1.2],
})
df

Name Age Height
0 Nick 31 2.9
1 Jenn 55 1.2
2 Joe 25 1.2

From a List of Lists

if you have a collection of same-length sequences, you essentially have a rectangular data structure already! All that’s needed is to add some column labels.

friends = [
    ['Nick', 31, 2.9],
    ['Jenn', 55, 1.2],
    ['Joe',  25, 1.2],
]
pd.DataFrame(friends, columns=["Name", "Age", "Height"])

Name Age Height
0 Nick 31 2.9
1 Jenn 55 1.2
2 Joe 25 1.2

From an empty DataFrame

If you prefer, you can also add columns one at a time, starting with an empty DataFrame:

df = pd.DataFrame()
df['Name'] = ['Nick', 'Jenn', 'Joe']
df['Age'] = [31, 55, 25]
df['Height'] = [2.9, 1.2, 1.2]
df

Name Age Height
0 Nick 31 2.9
1 Jenn 55 1.2
2 Joe 25 1.2

Exercise: Please use Pandas to recreate the table here as a Dataframe using one of the approaches detailed above:

Year Product Cost
2015 Apples 0.35
2016 Apples 0.45
2015 Bananas 0.75
2016 Bananas 1.10
Solution
df = pd.DataFrame()
df['Year'] = [2015, 2016, 2015, 2016]
df['Product'] = ['Apples', 'Apples', 'Bananas', 'Bananas']
df['Cost'] = [.35, .45, .75, 1.10]
df

Year Product Cost
0 2015 Apples 0.35
1 2016 Apples 0.45
2 2015 Bananas 0.75
3 2016 Bananas 1.10

Section 2: Reading Data from Files into a DataFrame

File Format File Extension read_xxx() function Dataframe Write Method
Comma-Seperated Values .csv pd.read_csv() df.to_csv()
Tab-seperated Values .tsv, .tabular, .csv pd.read_csv(sep='\t'), pd.read_table() df.to_csv(sep='\t') df.to_table()
Excel Spreadsheet .xls pd.read_excel() df.to_excel()
Excel Spreadsheet 2010 .xlsx pd.read_excel(engine='openpyxl') df.to_excel(engine='openpyxl')
JSON .json pd.read_json() df.to_json()
Tables in a Web Page (HTML) .html pd.read_html()[0] df.to_html()
HDF5 .hdf5, .h5, pd.read_hdf() df.to_hdf()

Understanding Different File Formats

Run the code below to download the Titanic passengers dataset, and transform it into different file formats

Note: Yep, that’s right, you can supply a web url and pandas reads it like a normal file!

Exercises

url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)
df[:5]

survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

Now run the code below to save the file to a comma-seperated file using the DataFrame.to_csv() method.

df.to_csv("titanic2001.csv", index=False)

Exercise: Use a text editor to examine the file that was saved on the computer. How is the file structured?

Solution

Exercise: Now read the file back into Python using the pd.read_csv() function:

Solution
pd.read_csv('titanic2001.csv')

survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 male 27.0 0 0 13.0000 S Second man True NaN Southampton no True
887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True
888 0 3 female NaN 1 2 23.4500 S Third woman False NaN Southampton no False
889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True
890 0 3 male 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows × 15 columns

JSON

Exercise: Save the dataframe to a JSON file using the df.to_json() method. Make sure to use the correct extension in the filename.

Solution
df.to_json('titanic.ignore.json', )

Exercise: Read the JSON file into Pandas again, using the pd.read_json() method.

Solution
pd.read_json('titanic.ignore.json')

survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True None Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False None Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True None Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 male 27.0 0 0 13.0000 S Second man True None Southampton no True
887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True
888 0 3 female NaN 1 2 23.4500 S Third woman False None Southampton no False
889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True
890 0 3 male 32.0 0 0 7.7500 Q Third man True None Queenstown no True

891 rows × 15 columns

HTML

Exercise: Save the dataframe to a HTML file, using the df.to_html() method.

Solution
df.to_html('titanic.ignore.html')

Exercise: Read the HTML file into Pandas again, using the pd.read_html() function.

Note: Because HTML files can contain multiple dataframes, you’ll get a list of dataframes instead of just one. Just add [0] onto the end of the line to get the first dataframe, and it will look the same as before.

Solution
pd.read_html('titanic.ignore.html')[0]

Unnamed: 0 survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 886 0 2 male 27.0 0 0 13.0000 S Second man True NaN Southampton no True
887 887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True
888 888 0 3 female NaN 1 2 23.4500 S Third woman False NaN Southampton no False
889 889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True
890 890 0 3 male 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows × 16 columns

Exercise: Try opening the file with a text editor, then with a web browser (e.g. Chrome, Firefox, etc). What does the file look like in each case? In text editor:

Solution
In the Editor:
In the Browser:

Exercise: Save the dataframe to an Excel file. DataFrame.to_excel(engine='openpyxl') Note: Because XLS and XLSX are proprietary formats, you may need to install a couple extra packages for this to work

Solution
df.to_excel('titanic.ignore.xlsx', engine='openpyxl')

Exercise: Read the Excel file into Pandas again, using the pd.read_excel(engine='openpyxl') function.

Solution
pd.read_excel('titanic.ignore.xlsx', engine='openpyxl')

Unnamed: 0 survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 886 0 2 male 27.0 0 0 13.0000 S Second man True NaN Southampton no True
887 887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True
888 888 0 3 female NaN 1 2 23.4500 S Third woman False NaN Southampton no False
889 889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True
890 890 0 3 male 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows × 16 columns

Exercise: Open it in your spreadsheet program. What does it look like?

Solution