Analyzing Tabular Data With Pandas
Authors
Setup
Download Data
import owncloud
from pathlib import Path
Path('data').mkdir(exist_ok=True, parents=True)
owncloud.Client.from_public_link('https://uni-bonn.sciebo.de/s/Jzmi8zBPsbn8AHs').get_file('/', 'data/salaries.csv')
owncloud.Client.from_public_link('https://uni-bonn.sciebo.de/s/ddzMt8N7oBPCaT5').get_file('/', 'data/titanic.csv')TrueIn many analyses, it is useful to organize data in a tabular format.
Tables are great because they allow us to store heterogeneous data and
assign names to columns which makes exploration more intuitive. The most
popular library for working with tabular data in Python is called
pandas which derives from panel data. In this section we are
going to learn how to use pandas and how it interfaces with other
tools to create a powerful ecosystem for data analysis. At the core of
pandas is the DataFrame - an object that stores two-dimensional
tabular data. Variables that store data frames are commonly called df.
Most of pandas functionality come as so-called methods of the data frame
object. Methods are called by typing the variable name and the method
name separated by a dot. For example: df.head() will call the
.head() method of the data frame df. If you are not used to this
syntax, don’t worry - all sections contain examples for how the
respective methods are used.
Section 1: Reading and Writing Tabular Data
There are many different file formats for tabular data that pandas
supports (for a full list, see this
website). One of the
most commonly used formats is CSV which stands for comma-separated
values. A CSV file contains plain text where items are separated by
commas. Because it is plain text, CSV is independent of the programming
language which makes it a useful interoperable standard. In this
section, we are going to learn how to read data from a CSV file into
pandas and how to write it back to the CSV format.
| Code | Description |
|---|---|
import pandas as pd |
Import the pandas module under the alias pd |
df = pd.read_csv("mydata.csv") |
Read the file "mydata.csv" into a pandas data frame and assign it to the variable df |
df.head(5) |
Get the first 5 rows of df |
df.to_csv("mydata.csv") |
Write the data frame df to the file "mydata.csv" |
Exercise: Import the pandas library under the alias pd.
Solution
import pandas as pdExercise: Read the file salaries.csv into a data frame and assign it to a variable called df.
Solution
df = pd.read_csv('data/salaries.csv')
df| Name | Age | City | Salary | Department | |
|---|---|---|---|---|---|
| 0 | Alice | 25 | New York | 75000 | Marketing |
| 1 | Bob | 30 | San Francisco | 85000 | Engineering |
| 2 | Charlie | 35 | Chicago | 92000 | Sales |
| 3 | David | 40 | Boston | 78000 | HR |
| 4 | Emma | 28 | Seattle | 88000 | Engineering |
| 5 | Frank | 33 | Austin | 81000 | Marketing |
| 6 | Grace | 29 | Denver | 79000 | Sales |
| 7 | Henry | 42 | Portland | 95000 | Engineering |
| 8 | Irene | 31 | Atlanta | 72000 | HR |
| 9 | Jack | 38 | Miami | 86000 | Sales |
| 10 | Karen | 27 | Dallas | 77000 | Marketing |
| 11 | Leo | 36 | Chicago | 89000 | Engineering |
| 12 | Mia | 34 | Seattle | 83000 | Sales |
| 13 | Noah | 41 | New York | 91000 | HR |
| 14 | Olivia | 26 | San Francisco | 76000 | Marketing |
| 15 | Paul | 39 | Boston | 94000 | Engineering |
| 16 | Quinn | 32 | Austin | 80000 | Sales |
| 17 | Rachel | 37 | Denver | 87000 | HR |
| 18 | Sam | 29 | Portland | 82000 | Marketing |
| 19 | Tina | 43 | Atlanta | 96000 | Engineering |
| 20 | Uma | 31 | Miami | 79000 | Sales |
| 21 | Victor | 28 | Dallas | 84000 | HR |
| 22 | Wendy | 35 | Chicago | 90000 | Marketing |
| 23 | Xavier | 40 | Seattle | 93000 | Engineering |
| 24 | Yara | 33 | New York | 81000 | Sales |
| 25 | Zack | 27 | San Francisco | 78000 | HR |
| 26 | Amy | 36 | Boston | 85000 | Marketing |
| 27 | Ben | 30 | Austin | 89000 | Engineering |
| 28 | Cara | 34 | Denver | 77000 | Sales |
| 29 | Dan | 39 | Portland | 92000 | HR |
Exercise: Print the first 3 rows of the data frame df.
Solution
df.head(3)| Name | Age | City | Salary | Department | |
|---|---|---|---|---|---|
| 0 | Alice | 25 | New York | 75000 | Marketing |
| 1 | Bob | 30 | San Francisco | 85000 | Engineering |
| 2 | Charlie | 35 | Chicago | 92000 | Sales |
Exercise: Get the first 3 rows of df and assign them to another variable called new_df. Write new_df to a file called new_salaries.csv.
Solution
new_df = df[:3]
new_df.to_csv('new_salaries.csv')Section 2: Exploring a Data Set in Pandas
Now that we know how to read CSV files, we can start working with some
actual data! In this section, we are going to analyze demographic data
from passengers of the Titanic which are stored in the file
titanic.csv. Once we have loaded the data into a data frame, we can
access a given column by providing its name as a string inside square
brackets. For example, df["age"] will return the column "age" in the
data frame df. By extracting individual columns and using methods like
.max() or .mean(), one can quickly get an overview of the data
stored in a table.
| Code | Description |
|---|---|
df["var1"] |
Get the column with the name "var1" in df |
df["var1"].min() |
Get the minimum value of column "var1" |
df["var1"].max() |
Get the maximum value of column "var1" |
df["var1"].mean() |
Compute the mean value of column "var1" |
df["var1"].unique() |
Get all unique values of column "var1" |
df["var1"].value_counts() |
Get the count of all unique values of column "var1" |
Exercise: Read the file titanic.csv into a pandas data frame and assign it to the variable df. Display the first three rows of the dataframe.
Solution
df = pd.read_csv('data/titanic.csv')
df.head(3)| survived | pclass | sex | age | fare | embark_town | deck | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 7.2500 | Southampton | NaN |
| 1 | 1 | 1 | female | 38.0 | 71.2833 | Cherbourg | C |
| 2 | 1 | 3 | female | 26.0 | 7.9250 | Southampton | NaN |
Exercise: Get the column "survived" from df.
Solution
df['survived']0 0
1 1
2 1
3 1
4 0
..
886 0
887 1
888 0
889 1
890 0
Name: survived, Length: 891, dtype: int64Exercise: What is the mean "fare" that passengers on the Titanic paid?
Solution
df['fare'].mean()32.204207968574636Exercise: What is the mean "age" of passengers on the Titanic?
Solution
df['age'].mean()29.69911764705882Exercise: What is the minimum and maximum "age" of the Titanic’s passengers?
Solution
df['age'].min(), df['age'].max()(0.42, 80.0)Exercise: What are the different "deck"s on the Titanic?
Solution
df['deck'].unique()array([nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F'], dtype=object)Exercise: What is the most common "embark_town" for passengers of the Titanic? Hint: The value_counts() function may be useful here.
Solution
df['embark_town'].value_counts()embark_town
Southampton 644
Cherbourg 168
Queenstown 77
Name: count, dtype: int64Section 3: Indexing and Filtering a Data Frame
Sometimes, we want to access elements in specific elements in a data
frame. This can be done by using the df.loc attribute which gets the
data located at a specific index. The index of a data frame is the
unnamed column you see on the very left. By combining indices and column
names, df.loc can get the value stored at a specific index, in a
specific column. Also, just like arrays, a data frame can be filtered
using Boolean masks. The process is exactly the same: you create a mask
of True and False values by applying a logical condition to a data
frame. Then, you apply it to the data frame to extract the values where
the mask equals True. One can also combine indexing and filtering by
using the mask as an input for df.loc. In this section, we will use
indexing and filtering to do simple statistical analyses of the Titanic
data.
| Code | Description |
|---|---|
mask = df["var1"]=="a" |
Create a mask of Boolean values that is True where the value of column "var1" is "a" and False otherwise |
mask = df["var1"]>=0 |
Create a mask of Boolean values that is True where the value of column "var1" is greater or equal to 0 and False otherwise |
mask = df["var1"].isin([1,2,3]) |
Create a mask of Boolean values that is True where the value of column "var1" is contained in the list [1,2,3] and False otherwise |
~mask |
Invert the Boolean mask (i.e. turn all values that are True to False and vice versa) |
df.loc[2] |
Get the row at the index 2 in the data frame df |
df.loc[1:10, "var1"] |
Get the value stored in rows 1 to 10 in df |
df.loc[1:10, "var1"] |
Get the value stored in rows 1 to 10 in the column "var1" in df |
df.loc[mask, "var1"] |
Get the values in the column "var1" for all rows where the mask is True |
Exercise: Get the row at the index 4 in df.
Solution
df.loc[4]survived 0
pclass 3
sex male
age 35.0
fare 8.05
embark_town Southampton
deck NaN
Name: 4, dtype: objectExercise: Get the rows at indices 10 to 15 in df.
Solution
df.loc[10:15]| survived | pclass | sex | age | fare | embark_town | deck | |
|---|---|---|---|---|---|---|---|
| 10 | 1 | 3 | female | 4.0 | 16.7000 | Southampton | G |
| 11 | 1 | 1 | female | 58.0 | 26.5500 | Southampton | C |
| 12 | 0 | 3 | male | 20.0 | 8.0500 | Southampton | NaN |
| 13 | 0 | 3 | male | 39.0 | 31.2750 | Southampton | NaN |
| 14 | 0 | 3 | female | 14.0 | 7.8542 | Southampton | NaN |
| 15 | 1 | 2 | female | 55.0 | 16.0000 | Southampton | NaN |
Exercise: Get the "sex" of the first 6 passengers in df.
Solution
df.loc[:6, 'sex']0 male
1 female
2 female
3 female
4 male
5 male
6 male
Name: sex, dtype: objectExercise: Get the data from all passengers on deck E.
Solution
# solution without ".isin()"
mask = df['deck']=='E'
print(df[mask])
# solution with ".isin()"
mask = df['deck'].isin(['E'])
df[mask] survived pclass sex age fare embark_town deck
6 0 1 male 54.0 51.8625 Southampton E
92 0 1 male 46.0 61.1750 Southampton E
123 1 2 female 32.5 13.0000 Southampton E
166 1 1 female NaN 55.0000 Southampton E
262 0 1 male 52.0 79.6500 Southampton E
303 1 2 female NaN 12.3500 Queenstown E
309 1 1 female 30.0 56.9292 Cherbourg E
319 1 1 female 40.0 134.5000 Cherbourg E
337 1 1 female 41.0 134.5000 Cherbourg E
356 1 1 female 22.0 55.0000 Southampton E
370 1 1 male 25.0 55.4417 Cherbourg E
429 1 3 male 32.0 8.0500 Southampton E
434 0 1 male 50.0 55.9000 Southampton E
456 0 1 male 65.0 26.5500 Southampton E
460 1 1 male 48.0 26.5500 Southampton E
462 0 1 male 47.0 38.5000 Southampton E
512 1 1 male 36.0 26.2875 Southampton E
558 1 1 female 39.0 79.6500 Southampton E
572 1 1 male 36.0 26.3875 Southampton E
577 1 1 female 39.0 55.9000 Southampton E
585 1 1 female 18.0 79.6500 Southampton E
662 0 1 male 47.0 25.5875 Southampton E
701 1 1 male 35.0 26.2875 Southampton E
707 1 1 male 42.0 26.2875 Southampton E
717 1 2 female 27.0 10.5000 Southampton E
724 1 1 male 27.0 53.1000 Southampton E
751 1 3 male 6.0 12.4750 Southampton E
772 0 2 female 57.0 10.5000 Southampton E
809 1 1 female 33.0 53.1000 Southampton E
823 1 3 female 27.0 12.4750 Southampton E
835 1 1 female 39.0 83.1583 Cherbourg E
857 1 1 male 51.0 26.5500 Southampton E| survived | pclass | sex | age | fare | embark_town | deck | |
|---|---|---|---|---|---|---|---|
| 6 | 0 | 1 | male | 54.0 | 51.8625 | Southampton | E |
| 92 | 0 | 1 | male | 46.0 | 61.1750 | Southampton | E |
| 123 | 1 | 2 | female | 32.5 | 13.0000 | Southampton | E |
| 166 | 1 | 1 | female | NaN | 55.0000 | Southampton | E |
| 262 | 0 | 1 | male | 52.0 | 79.6500 | Southampton | E |
| 303 | 1 | 2 | female | NaN | 12.3500 | Queenstown | E |
| 309 | 1 | 1 | female | 30.0 | 56.9292 | Cherbourg | E |
| 319 | 1 | 1 | female | 40.0 | 134.5000 | Cherbourg | E |
| 337 | 1 | 1 | female | 41.0 | 134.5000 | Cherbourg | E |
| 356 | 1 | 1 | female | 22.0 | 55.0000 | Southampton | E |
| 370 | 1 | 1 | male | 25.0 | 55.4417 | Cherbourg | E |
| 429 | 1 | 3 | male | 32.0 | 8.0500 | Southampton | E |
| 434 | 0 | 1 | male | 50.0 | 55.9000 | Southampton | E |
| 456 | 0 | 1 | male | 65.0 | 26.5500 | Southampton | E |
| 460 | 1 | 1 | male | 48.0 | 26.5500 | Southampton | E |
| 462 | 0 | 1 | male | 47.0 | 38.5000 | Southampton | E |
| 512 | 1 | 1 | male | 36.0 | 26.2875 | Southampton | E |
| 558 | 1 | 1 | female | 39.0 | 79.6500 | Southampton | E |
| 572 | 1 | 1 | male | 36.0 | 26.3875 | Southampton | E |
| 577 | 1 | 1 | female | 39.0 | 55.9000 | Southampton | E |
| 585 | 1 | 1 | female | 18.0 | 79.6500 | Southampton | E |
| 662 | 0 | 1 | male | 47.0 | 25.5875 | Southampton | E |
| 701 | 1 | 1 | male | 35.0 | 26.2875 | Southampton | E |
| 707 | 1 | 1 | male | 42.0 | 26.2875 | Southampton | E |
| 717 | 1 | 2 | female | 27.0 | 10.5000 | Southampton | E |
| 724 | 1 | 1 | male | 27.0 | 53.1000 | Southampton | E |
| 751 | 1 | 3 | male | 6.0 | 12.4750 | Southampton | E |
| 772 | 0 | 2 | female | 57.0 | 10.5000 | Southampton | E |
| 809 | 1 | 1 | female | 33.0 | 53.1000 | Southampton | E |
| 823 | 1 | 3 | female | 27.0 | 12.4750 | Southampton | E |
| 835 | 1 | 1 | female | 39.0 | 83.1583 | Cherbourg | E |
| 857 | 1 | 1 | male | 51.0 | 26.5500 | Southampton | E |
Example: Is the survival rate higher for male or female passengers?
mask = df['sex'] == 'male'
survival_male = df[mask]['survived'].mean()
survival_female = df[~mask]['survived'].mean()
print(survival_male, survival_female)0.18890814558058924 0.7420382165605095Exercise: Is the survival rate higher for passengers below or above the age of 40?
Solution
mask = df['age'] > 40
df[mask]['survived'].mean(), df[~mask]['survived'].mean()(0.36666666666666664, 0.3873144399460189)Exercise: What was the average fare paid by passengers who survived and those who didn’t?
Solution
# this is a solution that requires slightly less code
df.groupby('survived')['fare'].mean()survived
0 22.117887
1 48.395408
Name: fare, dtype: float64Exercise: What was the average fare paid by passengers in "pclass" 1 and "pclass" 3?
Solution
# this is a solution that requires slightly less code
df.groupby('pclass')['fare'].mean()pclass
1 84.154687
2 20.662183
3 13.675550
Name: fare, dtype: float64Section 4: The Pandas Ecosystem
Another advantage of pandas is its integration with other software
which creates a powerful ecosystem for data analysis. In this section we
are going to explore two such libraries: seaborn, a tool for
visualizations and pingouin, a package for statistical analysis. Both
libraries provide a similar interface where you pass a data frame as
data and specify variables based on their column name. This
homogeneous interface makes data analysis within the pandas framework
very convenient.
Visualization with Seaborn
The Seaborn library allows us to create detailed visualizations, with
little effort. We just have to specify the data we want to
visualize and the software will take care of the details like adding
colors and labeling axes. In this section, we are going to use the
sns.catplot() function which is for plotting categorical data. This
function takes in a data frame and the labels of columns we wish to
plot on the x and y axis. It also takes other arguments to further
configure the plot such as hue or kind (for a full list of
parameters, see the
documentation).
| Code | Description |
|---|---|
import seaborn as sns |
Import the seaborn library under the alias sns |
sns.catplot(data=df, x='var1', y='var2') |
Create a categorical plot for the data frame df with "var1" on the x-axis and "var2" on the y-axis |
sns.catplot(data=df, x='var1', y='var2', kind="bar") |
Plot the same data but as a bar plot |
sns.catplot(data=df, x='var1', y='var2', hue="var3") |
Plot the same data but add color or hue to encode "var3" |
Exercise: Import the seaborn library under the alias sns.
Solution
import seaborn as snsExample: Create a catplot to visualize the distribution of "fare" across the different passenger classes ("pclass").
sns.catplot(df, x = 'pclass', y = 'fare', )Exercise: Create a bar plot for the same data by using the kind argument.
Solution
sns.catplot(df, x = 'pclass', y = 'fare', kind='bar')Exercise: Add a color or hue to the bar plot to encode passenger sex.
Solution
sns.catplot(df, x = 'pclass', y = 'fare', kind='bar', hue='sex')Exercise: Create a bar plot to show differences in survival rates ("survived") between the different passenger classes ("pclass") and add a hue to encode "sex".
Solution
sns.catplot(df, x = 'pclass', y = 'survived', kind='bar', hue='sex')Statistical Analysis with Pingouin
Pingouin is a library for performing statistical tests that is neatly integrated with pandas. In this section we are going to use Pingouin to perform an analysis of variance (ANOVA) which is a test that compares the variance between groups to the variance within groups with respect to a specific variable. This allows us to estimate if the difference between groups exceeds the “background noise” in the data. We are also going to compute pairwise correlations for all variables in the data set. By checking which variables are correlated, we can identify possible confounds in our analysis.
| Code | Description |
|---|---|
import pingouin as pg |
Import the pingouin library under the alias pg |
pg.anova(data=df, dv="var1", between="var2") |
Perform an analysis of variance (ANOVA) to check if the difference in "var1" between groups in "var2" exceeds the variance within each group of "var2" |
pg.pairwise_corr(data=df) |
Compute pairwise correlations for all |
Exercise: Import the pingouin library under the alias pg.
Solution
import pingouin as pgExample: Perform an analysis of variance (ANOVA) to check if the variance in "fare" between "embark_town"s exceeds the variance within each "embark_town".
pg.anova(df, dv='fare', between='embark_town')| Source | ddof1 | ddof2 | F | p-unc | np2 | |
|---|---|---|---|---|---|---|
| 0 | embark_town | 2 | 886 | 38.140305 | 1.289645e-16 | 0.079271 |
Exercise: Perform an analysis of variance (ANOVA) to check if the variance in "age" between "pclass"es exceeds the variance within each "pclass".
Solution
pg.anova(df, dv='age', between='pclass')| Source | ddof1 | ddof2 | F | p-unc | np2 | |
|---|---|---|---|---|---|---|
| 0 | pclass | 2 | 711 | 57.443484 | 7.487984e-24 | 0.139107 |
Exercise: Calculate the pairwise correlations for all columns in df. Assign the data frame returned by pg.pariwise_corr to a new variable called df_corr and print it.
Solution
df_corr = pg.pairwise_corr(df, )
df_corr| X | Y | method | alternative | n | r | CI95% | p-unc | BF10 | power | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | survived | pclass | pearson | two-sided | 891 | -0.338481 | [-0.4, -0.28] | 2.537047e-25 | 1.149e+22 | 1.000000 |
| 1 | survived | age | pearson | two-sided | 714 | -0.077221 | [-0.15, -0.0] | 3.912465e-02 | 0.392 | 0.541628 |
| 2 | survived | fare | pearson | two-sided | 891 | 0.257307 | [0.19, 0.32] | 6.120189e-15 | 6.56e+11 | 1.000000 |
| 3 | pclass | age | pearson | two-sided | 714 | -0.369226 | [-0.43, -0.3] | 1.756699e-24 | 1.852e+21 | 1.000000 |
| 4 | pclass | fare | pearson | two-sided | 891 | -0.549500 | [-0.59, -0.5] | 1.967386e-71 | 7.243e+67 | 1.000000 |
| 5 | age | fare | pearson | two-sided | 714 | 0.096067 | [0.02, 0.17] | 1.021628e-02 | 1.26 | 0.729450 |
Exercise: Get all rows of df_corr where "p-unc" is smaller than 0.001 and print the columns "X", "Y" and "r" (Tip: to select multiple columns, use a lists, e.g. df.loc[mask, ["var1", "var2"]]).
Solution
mask = df_corr['p-unc'] < 0.001
df_corr.loc[mask, ["X", "Y", 'r']]| X | Y | r | |
|---|---|---|---|
| 0 | survived | pclass | -0.338481 |
| 2 | survived | fare | 0.257307 |
| 3 | pclass | age | -0.369226 |
| 4 | pclass | fare | -0.549500 |