Analyzing Tabular Data With Pandas

Authors
Dr. Nicholas Del Grosso | Dr. Sangeetha Nandakumar | Dr. Ole Bialas | Dr. Atle E. Rimehaug

Setup

Download Data

import owncloud
from pathlib import Path

Path('data').mkdir(exist_ok=True, parents=True)

owncloud.Client.from_public_link('https://uni-bonn.sciebo.de/s/Jzmi8zBPsbn8AHs').get_file('/', 'data/salaries.csv')

owncloud.Client.from_public_link('https://uni-bonn.sciebo.de/s/ddzMt8N7oBPCaT5').get_file('/', 'data/titanic.csv')
True

In many analyses, it is useful to organize data in a tabular format. Tables are great because they allow us to store heterogeneous data and assign names to columns which makes exploration more intuitive. The most popular library for working with tabular data in Python is called pandas which derives from panel data. In this section we are going to learn how to use pandas and how it interfaces with other tools to create a powerful ecosystem for data analysis. At the core of pandas is the DataFrame - an object that stores two-dimensional tabular data. Variables that store data frames are commonly called df. Most of pandas functionality come as so-called methods of the data frame object. Methods are called by typing the variable name and the method name separated by a dot. For example: df.head() will call the .head() method of the data frame df. If you are not used to this syntax, don’t worry - all sections contain examples for how the respective methods are used.

Section 1: Reading and Writing Tabular Data

There are many different file formats for tabular data that pandas supports (for a full list, see this website ). One of the most commonly used formats is CSV which stands for comma-separated values. A CSV file contains plain text where items are separated by commas. Because it is plain text, CSV is independent of the programming language which makes it a useful interoperable standard. In this section, we are going to learn how to read data from a CSV file into pandas and how to write it back to the CSV format.

Code Description
import pandas as pd Import the pandas module under the alias pd
df = pd.read_csv("mydata.csv") Read the file "mydata.csv" into a pandas data frame and assign it to the variable df
df.head(5) Get the first 5 rows of df
df.to_csv("mydata.csv") Write the data frame df to the file "mydata.csv"

Exercise: Import the pandas library under the alias pd.

Solution
import pandas as pd

Exercise: Read the file salaries.csv into a data frame and assign it to a variable called df.

Solution
df = pd.read_csv('data/salaries.csv')
df

Name Age City Salary Department
0 Alice 25 New York 75000 Marketing
1 Bob 30 San Francisco 85000 Engineering
2 Charlie 35 Chicago 92000 Sales
3 David 40 Boston 78000 HR
4 Emma 28 Seattle 88000 Engineering
5 Frank 33 Austin 81000 Marketing
6 Grace 29 Denver 79000 Sales
7 Henry 42 Portland 95000 Engineering
8 Irene 31 Atlanta 72000 HR
9 Jack 38 Miami 86000 Sales
10 Karen 27 Dallas 77000 Marketing
11 Leo 36 Chicago 89000 Engineering
12 Mia 34 Seattle 83000 Sales
13 Noah 41 New York 91000 HR
14 Olivia 26 San Francisco 76000 Marketing
15 Paul 39 Boston 94000 Engineering
16 Quinn 32 Austin 80000 Sales
17 Rachel 37 Denver 87000 HR
18 Sam 29 Portland 82000 Marketing
19 Tina 43 Atlanta 96000 Engineering
20 Uma 31 Miami 79000 Sales
21 Victor 28 Dallas 84000 HR
22 Wendy 35 Chicago 90000 Marketing
23 Xavier 40 Seattle 93000 Engineering
24 Yara 33 New York 81000 Sales
25 Zack 27 San Francisco 78000 HR
26 Amy 36 Boston 85000 Marketing
27 Ben 30 Austin 89000 Engineering
28 Cara 34 Denver 77000 Sales
29 Dan 39 Portland 92000 HR

Exercise: Print the first 3 rows of the data frame df.

Solution
df.head(3)

Name Age City Salary Department
0 Alice 25 New York 75000 Marketing
1 Bob 30 San Francisco 85000 Engineering
2 Charlie 35 Chicago 92000 Sales

Exercise: Get the first 3 rows of df and assign them to another variable called new_df. Write new_df to a file called new_salaries.csv.

Solution
new_df = df[:3]
new_df.to_csv('new_salaries.csv')

Section 2: Exploring a Data Set in Pandas

Now that we know how to read CSV files, we can start working with some actual data! In this section, we are going to analyze demographic data from passengers of the Titanic which are stored in the file titanic.csv. Once we have loaded the data into a data frame, we can access a given column by providing its name as a string inside square brackets. For example, df["age"] will return the column "age" in the data frame df. By extracting individual columns and using methods like .max() or .mean(), one can quickly get an overview of the data stored in a table.

Code Description
df["var1"] Get the column with the name "var1" in df
df["var1"].min() Get the minimum value of column "var1"
df["var1"].max() Get the maximum value of column "var1"
df["var1"].mean() Compute the mean value of column "var1"
df["var1"].unique() Get all unique values of column "var1"
df["var1"].value_counts() Get the count of all unique values of column "var1"

Exercise: Read the file titanic.csv into a pandas data frame and assign it to the variable df. Display the first three rows of the dataframe.

Solution
df = pd.read_csv('data/titanic.csv')
df.head(3)

survived pclass sex age fare embark_town deck
0 0 3 male 22.0 7.2500 Southampton NaN
1 1 1 female 38.0 71.2833 Cherbourg C
2 1 3 female 26.0 7.9250 Southampton NaN

Exercise: Get the column "survived" from df.

Solution
df['survived']
0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: survived, Length: 891, dtype: int64

Exercise: What is the mean "fare" that passengers on the Titanic paid?

Solution
df['fare'].mean()
32.204207968574636

Exercise: What is the mean "age" of passengers on the Titanic?

Solution
df['age'].mean()
29.69911764705882

Exercise: What is the minimum and maximum "age" of the Titanic’s passengers?

Solution
df['age'].min(), df['age'].max()
(0.42, 80.0)

Exercise: What are the different "deck"s on the Titanic?

Solution
df['deck'].unique()
array([nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F'], dtype=object)

Exercise: What is the most common "embark_town" for passengers of the Titanic? Hint: The value_counts() function may be useful here.

Solution
df['embark_town'].value_counts()
embark_town
Southampton    644
Cherbourg      168
Queenstown      77
Name: count, dtype: int64

Section 3: Indexing and Filtering a Data Frame

Sometimes, we want to access elements in specific elements in a data frame. This can be done by using the df.loc attribute which gets the data located at a specific index. The index of a data frame is the unnamed column you see on the very left. By combining indices and column names, df.loc can get the value stored at a specific index, in a specific column. Also, just like arrays, a data frame can be filtered using Boolean masks. The process is exactly the same: you create a mask of True and False values by applying a logical condition to a data frame. Then, you apply it to the data frame to extract the values where the mask equals True. One can also combine indexing and filtering by using the mask as an input for df.loc. In this section, we will use indexing and filtering to do simple statistical analyses of the Titanic data.

Code Description
mask = df["var1"]=="a" Create a mask of Boolean values that is True where the value of column "var1" is "a" and False otherwise
mask = df["var1"]>=0 Create a mask of Boolean values that is True where the value of column "var1" is greater or equal to 0 and False otherwise
mask = df["var1"].isin([1,2,3]) Create a mask of Boolean values that is True where the value of column "var1" is contained in the list [1,2,3] and False otherwise
~mask Invert the Boolean mask (i.e. turn all values that are True to False and vice versa)
df.loc[2] Get the row at the index 2 in the data frame df
df.loc[1:10, "var1"] Get the value stored in rows 1 to 10 in df
df.loc[1:10, "var1"] Get the value stored in rows 1 to 10 in the column "var1" in df
df.loc[mask, "var1"] Get the values in the column "var1" for all rows where the mask is True

Exercise: Get the row at the index 4 in df.

Solution
df.loc[4]
survived                 0
pclass                   3
sex                   male
age                   35.0
fare                  8.05
embark_town    Southampton
deck                   NaN
Name: 4, dtype: object

Exercise: Get the rows at indices 10 to 15 in df.

Solution
df.loc[10:15]

survived pclass sex age fare embark_town deck
10 1 3 female 4.0 16.7000 Southampton G
11 1 1 female 58.0 26.5500 Southampton C
12 0 3 male 20.0 8.0500 Southampton NaN
13 0 3 male 39.0 31.2750 Southampton NaN
14 0 3 female 14.0 7.8542 Southampton NaN
15 1 2 female 55.0 16.0000 Southampton NaN

Exercise: Get the "sex" of the first 6 passengers in df.

Solution
df.loc[:6, 'sex']
0      male
1    female
2    female
3    female
4      male
5      male
6      male
Name: sex, dtype: object

Exercise: Get the data from all passengers on deck E.

Solution
# solution without ".isin()"
mask = df['deck']=='E'
print(df[mask])

# solution with ".isin()"
mask = df['deck'].isin(['E'])
df[mask]
     survived  pclass     sex   age      fare  embark_town deck
6           0       1    male  54.0   51.8625  Southampton    E
92          0       1    male  46.0   61.1750  Southampton    E
123         1       2  female  32.5   13.0000  Southampton    E
166         1       1  female   NaN   55.0000  Southampton    E
262         0       1    male  52.0   79.6500  Southampton    E
303         1       2  female   NaN   12.3500   Queenstown    E
309         1       1  female  30.0   56.9292    Cherbourg    E
319         1       1  female  40.0  134.5000    Cherbourg    E
337         1       1  female  41.0  134.5000    Cherbourg    E
356         1       1  female  22.0   55.0000  Southampton    E
370         1       1    male  25.0   55.4417    Cherbourg    E
429         1       3    male  32.0    8.0500  Southampton    E
434         0       1    male  50.0   55.9000  Southampton    E
456         0       1    male  65.0   26.5500  Southampton    E
460         1       1    male  48.0   26.5500  Southampton    E
462         0       1    male  47.0   38.5000  Southampton    E
512         1       1    male  36.0   26.2875  Southampton    E
558         1       1  female  39.0   79.6500  Southampton    E
572         1       1    male  36.0   26.3875  Southampton    E
577         1       1  female  39.0   55.9000  Southampton    E
585         1       1  female  18.0   79.6500  Southampton    E
662         0       1    male  47.0   25.5875  Southampton    E
701         1       1    male  35.0   26.2875  Southampton    E
707         1       1    male  42.0   26.2875  Southampton    E
717         1       2  female  27.0   10.5000  Southampton    E
724         1       1    male  27.0   53.1000  Southampton    E
751         1       3    male   6.0   12.4750  Southampton    E
772         0       2  female  57.0   10.5000  Southampton    E
809         1       1  female  33.0   53.1000  Southampton    E
823         1       3  female  27.0   12.4750  Southampton    E
835         1       1  female  39.0   83.1583    Cherbourg    E
857         1       1    male  51.0   26.5500  Southampton    E

survived pclass sex age fare embark_town deck
6 0 1 male 54.0 51.8625 Southampton E
92 0 1 male 46.0 61.1750 Southampton E
123 1 2 female 32.5 13.0000 Southampton E
166 1 1 female NaN 55.0000 Southampton E
262 0 1 male 52.0 79.6500 Southampton E
303 1 2 female NaN 12.3500 Queenstown E
309 1 1 female 30.0 56.9292 Cherbourg E
319 1 1 female 40.0 134.5000 Cherbourg E
337 1 1 female 41.0 134.5000 Cherbourg E
356 1 1 female 22.0 55.0000 Southampton E
370 1 1 male 25.0 55.4417 Cherbourg E
429 1 3 male 32.0 8.0500 Southampton E
434 0 1 male 50.0 55.9000 Southampton E
456 0 1 male 65.0 26.5500 Southampton E
460 1 1 male 48.0 26.5500 Southampton E
462 0 1 male 47.0 38.5000 Southampton E
512 1 1 male 36.0 26.2875 Southampton E
558 1 1 female 39.0 79.6500 Southampton E
572 1 1 male 36.0 26.3875 Southampton E
577 1 1 female 39.0 55.9000 Southampton E
585 1 1 female 18.0 79.6500 Southampton E
662 0 1 male 47.0 25.5875 Southampton E
701 1 1 male 35.0 26.2875 Southampton E
707 1 1 male 42.0 26.2875 Southampton E
717 1 2 female 27.0 10.5000 Southampton E
724 1 1 male 27.0 53.1000 Southampton E
751 1 3 male 6.0 12.4750 Southampton E
772 0 2 female 57.0 10.5000 Southampton E
809 1 1 female 33.0 53.1000 Southampton E
823 1 3 female 27.0 12.4750 Southampton E
835 1 1 female 39.0 83.1583 Cherbourg E
857 1 1 male 51.0 26.5500 Southampton E

Example: Is the survival rate higher for male or female passengers?

mask = df['sex'] == 'male'
survival_male = df[mask]['survived'].mean()
survival_female = df[~mask]['survived'].mean()
print(survival_male, survival_female)
0.18890814558058924 0.7420382165605095

Exercise: Is the survival rate higher for passengers below or above the age of 40?

Solution
mask = df['age'] > 40
df[mask]['survived'].mean(), df[~mask]['survived'].mean()
(0.36666666666666664, 0.3873144399460189)

Exercise: What was the average fare paid by passengers who survived and those who didn’t?

Solution
# this is a solution that requires slightly less code
df.groupby('survived')['fare'].mean()
survived
0    22.117887
1    48.395408
Name: fare, dtype: float64

Exercise: What was the average fare paid by passengers in "pclass" 1 and "pclass" 3?

Solution
# this is a solution that requires slightly less code
df.groupby('pclass')['fare'].mean()
pclass
1    84.154687
2    20.662183
3    13.675550
Name: fare, dtype: float64

Section 4: The Pandas Ecosystem

Another advantage of pandas is its integration with other software which creates a powerful ecosystem for data analysis. In this section we are going to explore two such libraries: seaborn, a tool for visualizations and pingouin, a package for statistical analysis. Both libraries provide a similar interface where you pass a data frame as data and specify variables based on their column name. This homogeneous interface makes data analysis within the pandas framework very convenient.

Visualization with Seaborn

The Seaborn library allows us to create detailed visualizations, with little effort. We just have to specify the data we want to visualize and the software will take care of the details like adding colors and labeling axes. In this section, we are going to use the sns.catplot() function which is for plotting categorical data. This function takes in a data frame and the labels of columns we wish to plot on the x and y axis. It also takes other arguments to further configure the plot such as hue or kind (for a full list of parameters, see the documentation ).

Code Description
import seaborn as sns Import the seaborn library under the alias sns
sns.catplot(data=df, x='var1', y='var2') Create a categorical plot for the data frame df with "var1" on the x-axis and "var2" on the y-axis
sns.catplot(data=df, x='var1', y='var2', kind="bar") Plot the same data but as a bar plot
sns.catplot(data=df, x='var1', y='var2', hue="var3") Plot the same data but add color or hue to encode "var3"

Exercise: Import the seaborn library under the alias sns.

Solution
import seaborn as sns

Example: Create a catplot to visualize the distribution of "fare" across the different passenger classes ("pclass").

sns.catplot(df, x = 'pclass', y = 'fare', )

Exercise: Create a bar plot for the same data by using the kind argument.

Solution
sns.catplot(df, x = 'pclass', y = 'fare', kind='bar')

Exercise: Add a color or hue to the bar plot to encode passenger sex.

Solution
sns.catplot(df, x = 'pclass', y = 'fare', kind='bar', hue='sex')

Exercise: Create a bar plot to show differences in survival rates ("survived") between the different passenger classes ("pclass") and add a hue to encode "sex".

Solution
sns.catplot(df, x = 'pclass', y = 'survived', kind='bar', hue='sex')

Statistical Analysis with Pingouin

Pingouin is a library for performing statistical tests that is neatly integrated with pandas. In this section we are going to use Pingouin to perform an analysis of variance (ANOVA) which is a test that compares the variance between groups to the variance within groups with respect to a specific variable. This allows us to estimate if the difference between groups exceeds the “background noise” in the data. We are also going to compute pairwise correlations for all variables in the data set. By checking which variables are correlated, we can identify possible confounds in our analysis.

Code Description
import pingouin as pg Import the pingouin library under the alias pg
pg.anova(data=df, dv="var1", between="var2") Perform an analysis of variance (ANOVA) to check if the difference in "var1" between groups in "var2" exceeds the variance within each group of "var2"
pg.pairwise_corr(data=df) Compute pairwise correlations for all

Exercise: Import the pingouin library under the alias pg.

Solution
import pingouin as pg

Example: Perform an analysis of variance (ANOVA) to check if the variance in "fare" between "embark_town"s exceeds the variance within each "embark_town".

pg.anova(df, dv='fare', between='embark_town')

Source ddof1 ddof2 F p-unc np2
0 embark_town 2 886 38.140305 1.289645e-16 0.079271

Exercise: Perform an analysis of variance (ANOVA) to check if the variance in "age" between "pclass"es exceeds the variance within each "pclass".

Solution
pg.anova(df, dv='age', between='pclass')

Source ddof1 ddof2 F p-unc np2
0 pclass 2 711 57.443484 7.487984e-24 0.139107

Exercise: Calculate the pairwise correlations for all columns in df. Assign the data frame returned by pg.pariwise_corr to a new variable called df_corr and print it.

Solution
df_corr = pg.pairwise_corr(df, )
df_corr

X Y method alternative n r CI95% p-unc BF10 power
0 survived pclass pearson two-sided 891 -0.338481 [-0.4, -0.28] 2.537047e-25 1.149e+22 1.000000
1 survived age pearson two-sided 714 -0.077221 [-0.15, -0.0] 3.912465e-02 0.392 0.541628
2 survived fare pearson two-sided 891 0.257307 [0.19, 0.32] 6.120189e-15 6.56e+11 1.000000
3 pclass age pearson two-sided 714 -0.369226 [-0.43, -0.3] 1.756699e-24 1.852e+21 1.000000
4 pclass fare pearson two-sided 891 -0.549500 [-0.59, -0.5] 1.967386e-71 7.243e+67 1.000000
5 age fare pearson two-sided 714 0.096067 [0.02, 0.17] 1.021628e-02 1.26 0.729450

Exercise: Get all rows of df_corr where "p-unc" is smaller than 0.001 and print the columns "X", "Y" and "r" (Tip: to select multiple columns, use a lists, e.g. df.loc[mask, ["var1", "var2"]]).

Solution
mask = df_corr['p-unc'] < 0.001
df_corr.loc[mask, ["X", "Y", 'r']]

X Y r
0 survived pclass -0.338481
2 survived fare 0.257307
3 pclass age -0.369226
4 pclass fare -0.549500