Pandas DataFrames
Authors
Setup
Import Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pdfrom IPython.display import ImageSection 1: What is a DataFrame?
A DataFrame, simply put, is a Table of data. It is a structure that contains multiple rows, each row containing the same labelled collection of data types. A DataFrame is a table with named rows (called the “index”). For example, a DataFrame might look like this:
| (index) | Name | Age | Height | LikesIceCream |
|---|---|---|---|---|
| 0 | “Nick” | 22 | 3.4 | True |
| 1 | “Jenn” | 55 | 1.2 | True |
| 2 | “Joe” | 25 | 2.2 | True |
Because each row contains the same data, DataFrames can also be thought of as a collection of same-length columns!
Pandas is a Python package that has a DataFrame class. Using either the DataFrame class constructor or one of Pandas’ many read_() functions, you can make your own DataFrame from a variety of sources.
Different Ways of Making DataFrames Directly
From a List of Dicts
Dicts are named collections. If you have many of the same dicts in a list, the DataFrame constructor can convert it to a Dataframe:
Exercises
friends = [
{'Name': "Nick", "Age": 31, "Height": 2.9, "Weight": 20},
{'Name': "Jenn", "Age": 55, "Height": 1.2},
{"Name": "Joe", "Height": 1.2, "Age": 25, },
]
pd.DataFrame(friends)| Name | Age | Height | Weight | |
|---|---|---|---|---|
| 0 | Nick | 31 | 2.9 | 20.0 |
| 1 | Jenn | 55 | 1.2 | NaN |
| 2 | Joe | 25 | 1.2 | NaN |
From a Dict of Lists
df = pd.DataFrame({
'Name': ['Nick', 'Jenn', 'Joe'],
'Age': [31, 55, 25],
'Height': [2.9, 1.2, 1.2],
})
df| Name | Age | Height | |
|---|---|---|---|
| 0 | Nick | 31 | 2.9 |
| 1 | Jenn | 55 | 1.2 |
| 2 | Joe | 25 | 1.2 |
From a List of Lists
if you have a collection of same-length sequences, you essentially have a rectangular data structure already! All that’s needed is to add some column labels.
friends = [
['Nick', 31, 2.9],
['Jenn', 55, 1.2],
['Joe', 25, 1.2],
]
pd.DataFrame(friends, columns=["Name", "Age", "Height"])| Name | Age | Height | |
|---|---|---|---|
| 0 | Nick | 31 | 2.9 |
| 1 | Jenn | 55 | 1.2 |
| 2 | Joe | 25 | 1.2 |
From an empty DataFrame
If you prefer, you can also add columns one at a time, starting with an empty DataFrame:
df = pd.DataFrame()
df['Name'] = ['Nick', 'Jenn', 'Joe']
df['Age'] = [31, 55, 25]
df['Height'] = [2.9, 1.2, 1.2]
df| Name | Age | Height | |
|---|---|---|---|
| 0 | Nick | 31 | 2.9 |
| 1 | Jenn | 55 | 1.2 |
| 2 | Joe | 25 | 1.2 |
Exercise: Please use Pandas to recreate the table here as a Dataframe using one of the approaches detailed above:
| Year | Product | Cost |
|---|---|---|
| 2015 | Apples | 0.35 |
| 2016 | Apples | 0.45 |
| 2015 | Bananas | 0.75 |
| 2016 | Bananas | 1.10 |
Solution
df = pd.DataFrame()
df['Year'] = [2015, 2016, 2015, 2016]
df['Product'] = ['Apples', 'Apples', 'Bananas', 'Bananas']
df['Cost'] = [.35, .45, .75, 1.10]
df| Year | Product | Cost | |
|---|---|---|---|
| 0 | 2015 | Apples | 0.35 |
| 1 | 2016 | Apples | 0.45 |
| 2 | 2015 | Bananas | 0.75 |
| 3 | 2016 | Bananas | 1.10 |
Section 2: Reading Data from Files into a DataFrame
| File Format | File Extension | read_xxx() function |
Dataframe Write Method |
|---|---|---|---|
| Comma-Seperated Values | .csv | pd.read_csv() |
df.to_csv() |
| Tab-seperated Values | .tsv, .tabular, .csv | pd.read_csv(sep='\t'), pd.read_table() |
df.to_csv(sep='\t') df.to_table() |
| Excel Spreadsheet | .xls | pd.read_excel() |
df.to_excel() |
| Excel Spreadsheet 2010 | .xlsx | pd.read_excel(engine='openpyxl') |
df.to_excel(engine='openpyxl') |
| JSON | .json | pd.read_json() |
df.to_json() |
| Tables in a Web Page (HTML) | .html | pd.read_html()[0] |
df.to_html() |
| HDF5 | .hdf5, .h5, | pd.read_hdf() |
df.to_hdf() |
Understanding Different File Formats
Run the code below to download the Titanic passengers dataset, and transform it into different file formats
Note: Yep, that’s right, you can supply a web url and pandas reads it like a normal file!
Exercises
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)
df[:5]| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
Now run the code below to save the file to a comma-seperated file using the DataFrame.to_csv() method.
df.to_csv("titanic2001.csv", index=False)Exercise: Use a text editor to examine the file that was saved on the computer. How is the file structured?
Solution
Exercise: Now read the file back into Python using the pd.read_csv() function:
Solution
pd.read_csv('titanic2001.csv')| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 886 | 0 | 2 | male | 27.0 | 0 | 0 | 13.0000 | S | Second | man | True | NaN | Southampton | no | True |
| 887 | 1 | 1 | female | 19.0 | 0 | 0 | 30.0000 | S | First | woman | False | B | Southampton | yes | True |
| 888 | 0 | 3 | female | NaN | 1 | 2 | 23.4500 | S | Third | woman | False | NaN | Southampton | no | False |
| 889 | 1 | 1 | male | 26.0 | 0 | 0 | 30.0000 | C | First | man | True | C | Cherbourg | yes | True |
| 890 | 0 | 3 | male | 32.0 | 0 | 0 | 7.7500 | Q | Third | man | True | NaN | Queenstown | no | True |
891 rows × 15 columns
JSON
Exercise: Save the dataframe to a JSON file using the df.to_json() method. Make sure to use the correct extension in the filename.
Solution
df.to_json('titanic.ignore.json', )Exercise: Read the JSON file into Pandas again, using the pd.read_json() method.
Solution
pd.read_json('titanic.ignore.json')| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | None | Southampton | no | False |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | None | Southampton | yes | True |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | None | Southampton | no | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 886 | 0 | 2 | male | 27.0 | 0 | 0 | 13.0000 | S | Second | man | True | None | Southampton | no | True |
| 887 | 1 | 1 | female | 19.0 | 0 | 0 | 30.0000 | S | First | woman | False | B | Southampton | yes | True |
| 888 | 0 | 3 | female | NaN | 1 | 2 | 23.4500 | S | Third | woman | False | None | Southampton | no | False |
| 889 | 1 | 1 | male | 26.0 | 0 | 0 | 30.0000 | C | First | man | True | C | Cherbourg | yes | True |
| 890 | 0 | 3 | male | 32.0 | 0 | 0 | 7.7500 | Q | Third | man | True | None | Queenstown | no | True |
891 rows × 15 columns
HTML
Exercise: Save the dataframe to a HTML file, using the df.to_html() method.
Solution
df.to_html('titanic.ignore.html')Exercise: Read the HTML file into Pandas again, using the pd.read_html() function.
Note: Because HTML files can contain multiple dataframes, you’ll get a list of dataframes instead of just one. Just add [0] onto the end of the line to get the first dataframe, and it will look the same as before.
Solution
pd.read_html('titanic.ignore.html')[0]| Unnamed: 0 | survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
| 3 | 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 886 | 886 | 0 | 2 | male | 27.0 | 0 | 0 | 13.0000 | S | Second | man | True | NaN | Southampton | no | True |
| 887 | 887 | 1 | 1 | female | 19.0 | 0 | 0 | 30.0000 | S | First | woman | False | B | Southampton | yes | True |
| 888 | 888 | 0 | 3 | female | NaN | 1 | 2 | 23.4500 | S | Third | woman | False | NaN | Southampton | no | False |
| 889 | 889 | 1 | 1 | male | 26.0 | 0 | 0 | 30.0000 | C | First | man | True | C | Cherbourg | yes | True |
| 890 | 890 | 0 | 3 | male | 32.0 | 0 | 0 | 7.7500 | Q | Third | man | True | NaN | Queenstown | no | True |
891 rows × 16 columns
Exercise: Try opening the file with a text editor, then with a web browser (e.g. Chrome, Firefox, etc). What does the file look like in each case? In text editor:
Solution
In the Editor:
In the Browser:Exercise: Save the dataframe to an Excel file. DataFrame.to_excel(engine='openpyxl')
Note: Because XLS and XLSX are proprietary formats, you may need to install a couple extra packages for this to work
Solution
df.to_excel('titanic.ignore.xlsx', engine='openpyxl')Exercise: Read the Excel file into Pandas again, using the pd.read_excel(engine='openpyxl') function.
Solution
pd.read_excel('titanic.ignore.xlsx', engine='openpyxl')| Unnamed: 0 | survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
| 3 | 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 886 | 886 | 0 | 2 | male | 27.0 | 0 | 0 | 13.0000 | S | Second | man | True | NaN | Southampton | no | True |
| 887 | 887 | 1 | 1 | female | 19.0 | 0 | 0 | 30.0000 | S | First | woman | False | B | Southampton | yes | True |
| 888 | 888 | 0 | 3 | female | NaN | 1 | 2 | 23.4500 | S | Third | woman | False | NaN | Southampton | no | False |
| 889 | 889 | 1 | 1 | male | 26.0 | 0 | 0 | 30.0000 | C | First | man | True | C | Cherbourg | yes | True |
| 890 | 890 | 0 | 3 | male | 32.0 | 0 | 0 | 7.7500 | Q | Third | man | True | NaN | Queenstown | no | True |
891 rows × 16 columns
Exercise: Open it in your spreadsheet program. What does it look like?