Pandas DataFrames

Courses

Intro to Python for Scientists

Tabular Data

Pandas DataFrames

Authors

Dr. Nicholas Del Grosso | Dr. Sangeetha Nandakumar | Dr. Ole Bialas | Dr. Atle E. Rimehaug

Open in JupyterLite Download Materials

Setup

Import Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from IPython.display import Image

Section 1: What is a DataFrame?

A DataFrame, simply put, is a Table of data. It is a structure that contains multiple rows, each row containing the same labelled collection of data types. A DataFrame is a table with named rows (called the “index”). For example, a DataFrame might look like this:

(index)	Name	Age	Height	LikesIceCream
0	“Nick”	22	3.4	True
1	“Jenn”	55	1.2	True
2	“Joe”	25	2.2	True

Because each row contains the same data, DataFrames can also be thought of as a collection of same-length columns!

Pandas is a Python package that has a DataFrame class. Using either the DataFrame class constructor or one of Pandas’ many read_() functions, you can make your own DataFrame from a variety of sources.

Different Ways of Making DataFrames Directly

From a List of Dicts

Dicts are named collections. If you have many of the same dicts in a list, the DataFrame constructor can convert it to a Dataframe:

Exercises

friends = [
    {'Name': "Nick", "Age": 31, "Height": 2.9, "Weight": 20},
    {'Name': "Jenn", "Age": 55, "Height": 1.2},
    {"Name": "Joe", "Height": 1.2, "Age": 25, },
]
pd.DataFrame(friends)

	Name	Age	Height	Weight
0	Nick	31	2.9	20.0
1	Jenn	55	1.2	NaN
2	Joe	25	1.2	NaN

From a Dict of Lists

df = pd.DataFrame({
    'Name': ['Nick', 'Jenn', 'Joe'], 
    'Age': [31, 55, 25], 
    'Height': [2.9, 1.2, 1.2],
})
df

	Name	Age	Height
0	Nick	31	2.9
1	Jenn	55	1.2
2	Joe	25	1.2

From a List of Lists

if you have a collection of same-length sequences, you essentially have a rectangular data structure already! All that’s needed is to add some column labels.

friends = [
    ['Nick', 31, 2.9],
    ['Jenn', 55, 1.2],
    ['Joe',  25, 1.2],
]
pd.DataFrame(friends, columns=["Name", "Age", "Height"])

	Name	Age	Height
0	Nick	31	2.9
1	Jenn	55	1.2
2	Joe	25	1.2

From an empty DataFrame

If you prefer, you can also add columns one at a time, starting with an empty DataFrame:

df = pd.DataFrame()
df['Name'] = ['Nick', 'Jenn', 'Joe']
df['Age'] = [31, 55, 25]
df['Height'] = [2.9, 1.2, 1.2]
df

	Name	Age	Height
0	Nick	31	2.9
1	Jenn	55	1.2
2	Joe	25	1.2

Exercise: Please use Pandas to recreate the table here as a Dataframe using one of the approaches detailed above:

Year	Product	Cost
2015	Apples	0.35
2016	Apples	0.45
2015	Bananas	0.75
2016	Bananas	1.10

Solution

df = pd.DataFrame()
df['Year'] = [2015, 2016, 2015, 2016]
df['Product'] = ['Apples', 'Apples', 'Bananas', 'Bananas']
df['Cost'] = [.35, .45, .75, 1.10]
df

	Year	Product	Cost
0	2015	Apples	0.35
1	2016	Apples	0.45
2	2015	Bananas	0.75
3	2016	Bananas	1.10

Section 2: Reading Data from Files into a DataFrame

File Format	File Extension	`read_xxx()` function	Dataframe Write Method
Comma-Seperated Values	.csv	`pd.read_csv()`	`df.to_csv()`
Tab-seperated Values	.tsv, .tabular, .csv	`pd.read_csv(sep='\t')`, `pd.read_table()`	`df.to_csv(sep='\t')` `df.to_table()`
Excel Spreadsheet	.xls	`pd.read_excel()`	`df.to_excel()`
Excel Spreadsheet 2010	.xlsx	`pd.read_excel(engine='openpyxl')`	`df.to_excel(engine='openpyxl')`
JSON	.json	`pd.read_json()`	`df.to_json()`
Tables in a Web Page (HTML)	.html	`pd.read_html()[0]`	`df.to_html()`
HDF5	.hdf5, .h5,	`pd.read_hdf()`	`df.to_hdf()`

Understanding Different File Formats

Run the code below to download the Titanic passengers dataset, and transform it into different file formats

Note: Yep, that’s right, you can supply a web url and pandas reads it like a normal file!

Exercises

url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)
df[:5]

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

Now run the code below to save the file to a comma-seperated file using the DataFrame.to_csv() method.

df.to_csv("titanic2001.csv", index=False)

Exercise: Use a text editor to examine the file that was saved on the computer. How is the file structured?

Solution

Exercise: Now read the file back into Python using the pd.read_csv() function:

Solution

pd.read_csv('titanic2001.csv')

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	0	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	0	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	0	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	0	2	male	27.0	0	0	13.0000	S	Second	man	True	NaN	Southampton	no	True
887	1	1	female	19.0	0	0	30.0000	S	First	woman	False	B	Southampton	yes	True
888	0	3	female	NaN	1	2	23.4500	S	Third	woman	False	NaN	Southampton	no	False
889	1	1	male	26.0	0	0	30.0000	C	First	man	True	C	Cherbourg	yes	True
890	0	3	male	32.0	0	0	7.7500	Q	Third	man	True	NaN	Queenstown	no	True

891 rows × 15 columns

JSON

Exercise: Save the dataframe to a JSON file using the df.to_json() method. Make sure to use the correct extension in the filename.

Solution

df.to_json('titanic.ignore.json', )

Exercise: Read the JSON file into Pandas again, using the pd.read_json() method.

Solution

pd.read_json('titanic.ignore.json')

	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	0	7.2500	S	Third	man	True	None	Southampton	no	False
1	1	1	female	38.0	1	0	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	0	7.9250	S	Third	woman	False	None	Southampton	yes	True
3	1	1	female	35.0	1	0	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	0	8.0500	S	Third	man	True	None	Southampton	no	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	0	2	male	27.0	0	0	13.0000	S	Second	man	True	None	Southampton	no	True
887	1	1	female	19.0	0	0	30.0000	S	First	woman	False	B	Southampton	yes	True
888	0	3	female	NaN	1	2	23.4500	S	Third	woman	False	None	Southampton	no	False
889	1	1	male	26.0	0	0	30.0000	C	First	man	True	C	Cherbourg	yes	True
890	0	3	male	32.0	0	0	7.7500	Q	Third	man	True	None	Queenstown	no	True

891 rows × 15 columns

HTML

Exercise: Save the dataframe to a HTML file, using the df.to_html() method.

Solution

df.to_html('titanic.ignore.html')

Exercise: Read the HTML file into Pandas again, using the pd.read_html() function.

Note: Because HTML files can contain multiple dataframes, you’ll get a list of dataframes instead of just one. Just add [0] onto the end of the line to get the first dataframe, and it will look the same as before.

Solution

pd.read_html('titanic.ignore.html')[0]

	Unnamed: 0	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	0	3	male	22.0	1	0	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	1	female	38.0	1	0	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	2	1	3	female	26.0	0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	3	1	1	female	35.0	1	0	53.1000	S	First	woman	False	C	Southampton	yes	False
4	4	0	3	male	35.0	0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	886	0	2	male	27.0	0	0	13.0000	S	Second	man	True	NaN	Southampton	no	True
887	887	1	1	female	19.0	0	0	30.0000	S	First	woman	False	B	Southampton	yes	True
888	888	0	3	female	NaN	1	2	23.4500	S	Third	woman	False	NaN	Southampton	no	False
889	889	1	1	male	26.0	0	0	30.0000	C	First	man	True	C	Cherbourg	yes	True
890	890	0	3	male	32.0	0	0	7.7500	Q	Third	man	True	NaN	Queenstown	no	True

891 rows × 16 columns

Exercise: Try opening the file with a text editor, then with a web browser (e.g. Chrome, Firefox, etc). What does the file look like in each case? In text editor:

Solution

In the Editor:
In the Browser:

Exercise: Save the dataframe to an Excel file. DataFrame.to_excel(engine='openpyxl') Note: Because XLS and XLSX are proprietary formats, you may need to install a couple extra packages for this to work

Solution

df.to_excel('titanic.ignore.xlsx', engine='openpyxl')

Exercise: Read the Excel file into Pandas again, using the pd.read_excel(engine='openpyxl') function.

Solution

pd.read_excel('titanic.ignore.xlsx', engine='openpyxl')

	Unnamed: 0	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	0	3	male	22.0	1	0	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	1	female	38.0	1	0	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	2	1	3	female	26.0	0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	3	1	1	female	35.0	1	0	53.1000	S	First	woman	False	C	Southampton	yes	False
4	4	0	3	male	35.0	0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	886	0	2	male	27.0	0	0	13.0000	S	Second	man	True	NaN	Southampton	no	True
887	887	1	1	female	19.0	0	0	30.0000	S	First	woman	False	B	Southampton	yes	True
888	888	0	3	female	NaN	1	2	23.4500	S	Third	woman	False	NaN	Southampton	no	False
889	889	1	1	male	26.0	0	0	30.0000	C	First	man	True	C	Cherbourg	yes	True
890	890	0	3	male	32.0	0	0	7.7500	Q	Third	man	True	NaN	Queenstown	no	True

891 rows × 16 columns

Exercise: Open it in your spreadsheet program. What does it look like?

Solution