Analyzing Tabular Data With Pandas

Courses

Crash Course on Python

From Adding Numbers to Analyzing Real Data in Python

Authors

Dr. Nicholas Del Grosso | Dr. Sangeetha Nandakumar | Dr. Ole Bialas | Dr. Atle E. Rimehaug

Download Materials

Setup

Download Data

import owncloud
from pathlib import Path

Path('data').mkdir(exist_ok=True, parents=True)

owncloud.Client.from_public_link('https://uni-bonn.sciebo.de/s/Jzmi8zBPsbn8AHs').get_file('/', 'data/salaries.csv')

owncloud.Client.from_public_link('https://uni-bonn.sciebo.de/s/ddzMt8N7oBPCaT5').get_file('/', 'data/titanic.csv')

True

In many analyses, it is useful to organize data in a tabular format. Tables are great because they allow us to store heterogeneous data and assign names to columns which makes exploration more intuitive. The most popular library for working with tabular data in Python is called pandas which derives from panel data. In this section we are going to learn how to use pandas and how it interfaces with other tools to create a powerful ecosystem for data analysis. At the core of pandas is the DataFrame - an object that stores two-dimensional tabular data. Variables that store data frames are commonly called df. Most of pandas functionality come as so-called methods of the data frame object. Methods are called by typing the variable name and the method name separated by a dot. For example: df.head() will call the .head() method of the data frame df. If you are not used to this syntax, don’t worry - all sections contain examples for how the respective methods are used.

Section 1: Reading and Writing Tabular Data

There are many different file formats for tabular data that pandas supports (for a full list, see this website). One of the most commonly used formats is CSV which stands for comma-separated values. A CSV file contains plain text where items are separated by commas. Because it is plain text, CSV is independent of the programming language which makes it a useful interoperable standard. In this section, we are going to learn how to read data from a CSV file into pandas and how to write it back to the CSV format.

Code	Description
`import pandas as pd`	Import the `pandas` module under the alias `pd`
`df = pd.read_csv("mydata.csv")`	Read the file `"mydata.csv"` into a pandas data frame and assign it to the variable `df`
`df.head(5)`	Get the first 5 rows of `df`
`df.to_csv("mydata.csv")`	Write the data frame `df` to the file `"mydata.csv"`

Exercise: Import the pandas library under the alias pd.

Solution

import pandas as pd

Exercise: Read the file salaries.csv into a data frame and assign it to a variable called df.

Solution

df = pd.read_csv('data/salaries.csv')
df

	Name	Age	City	Salary	Department
0	Alice	25	New York	75000	Marketing
1	Bob	30	San Francisco	85000	Engineering
2	Charlie	35	Chicago	92000	Sales
3	David	40	Boston	78000	HR
4	Emma	28	Seattle	88000	Engineering
5	Frank	33	Austin	81000	Marketing
6	Grace	29	Denver	79000	Sales
7	Henry	42	Portland	95000	Engineering
8	Irene	31	Atlanta	72000	HR
9	Jack	38	Miami	86000	Sales
10	Karen	27	Dallas	77000	Marketing
11	Leo	36	Chicago	89000	Engineering
12	Mia	34	Seattle	83000	Sales
13	Noah	41	New York	91000	HR
14	Olivia	26	San Francisco	76000	Marketing
15	Paul	39	Boston	94000	Engineering
16	Quinn	32	Austin	80000	Sales
17	Rachel	37	Denver	87000	HR
18	Sam	29	Portland	82000	Marketing
19	Tina	43	Atlanta	96000	Engineering
20	Uma	31	Miami	79000	Sales
21	Victor	28	Dallas	84000	HR
22	Wendy	35	Chicago	90000	Marketing
23	Xavier	40	Seattle	93000	Engineering
24	Yara	33	New York	81000	Sales
25	Zack	27	San Francisco	78000	HR
26	Amy	36	Boston	85000	Marketing
27	Ben	30	Austin	89000	Engineering
28	Cara	34	Denver	77000	Sales
29	Dan	39	Portland	92000	HR

Exercise: Print the first 3 rows of the data frame df.

Solution

df.head(3)

	Name	Age	City	Salary	Department
0	Alice	25	New York	75000	Marketing
1	Bob	30	San Francisco	85000	Engineering
2	Charlie	35	Chicago	92000	Sales

Exercise: Get the first 3 rows of df and assign them to another variable called new_df. Write new_df to a file called new_salaries.csv.

Solution

new_df = df[:3]
new_df.to_csv('new_salaries.csv')

Section 2: Exploring a Data Set in Pandas

Now that we know how to read CSV files, we can start working with some actual data! In this section, we are going to analyze demographic data from passengers of the Titanic which are stored in the file titanic.csv. Once we have loaded the data into a data frame, we can access a given column by providing its name as a string inside square brackets. For example, df["age"] will return the column "age" in the data frame df. By extracting individual columns and using methods like .max() or .mean(), one can quickly get an overview of the data stored in a table.

Code	Description
`df["var1"]`	Get the column with the name `"var1"` in `df`
`df["var1"].min()`	Get the minimum value of column `"var1"`
`df["var1"].max()`	Get the maximum value of column `"var1"`
`df["var1"].mean()`	Compute the mean value of column `"var1"`
`df["var1"].unique()`	Get all unique values of column `"var1"`
`df["var1"].value_counts()`	Get the count of all unique values of column `"var1"`

Exercise: Read the file titanic.csv into a pandas data frame and assign it to the variable df. Display the first three rows of the dataframe.

Solution

df = pd.read_csv('data/titanic.csv')
df.head(3)

	survived	pclass	sex	age	fare	embark_town	deck
0	0	3	male	22.0	7.2500	Southampton	NaN
1	1	1	female	38.0	71.2833	Cherbourg	C
2	1	3	female	26.0	7.9250	Southampton	NaN

Exercise: Get the column "survived" from df.

Solution

df['survived']

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: survived, Length: 891, dtype: int64

Exercise: What is the mean "fare" that passengers on the Titanic paid?

Solution

df['fare'].mean()

32.204207968574636

Exercise: What is the mean "age" of passengers on the Titanic?

Solution

df['age'].mean()

29.69911764705882

Exercise: What is the minimum and maximum "age" of the Titanic’s passengers?

Solution

df['age'].min(), df['age'].max()

(0.42, 80.0)

Exercise: What are the different "deck"s on the Titanic?

Solution

df['deck'].unique()

array([nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F'], dtype=object)

Exercise: What is the most common "embark_town" for passengers of the Titanic? Hint: The value_counts() function may be useful here.

Solution

df['embark_town'].value_counts()

embark_town
Southampton    644
Cherbourg      168
Queenstown      77
Name: count, dtype: int64

Section 3: Indexing and Filtering a Data Frame

Sometimes, we want to access elements in specific elements in a data frame. This can be done by using the df.loc attribute which gets the data located at a specific index. The index of a data frame is the unnamed column you see on the very left. By combining indices and column names, df.loc can get the value stored at a specific index, in a specific column. Also, just like arrays, a data frame can be filtered using Boolean masks. The process is exactly the same: you create a mask of True and False values by applying a logical condition to a data frame. Then, you apply it to the data frame to extract the values where the mask equals True. One can also combine indexing and filtering by using the mask as an input for df.loc. In this section, we will use indexing and filtering to do simple statistical analyses of the Titanic data.

Code	Description
`mask = df["var1"]=="a"`	Create a `mask` of Boolean values that is `True` where the value of column `"var1"` is `"a"` and `False` otherwise
`mask = df["var1"]>=0`	Create a `mask` of Boolean values that is `True` where the value of column `"var1"` is greater or equal to `0` and `False` otherwise
`mask = df["var1"].isin([1,2,3])`	Create a `mask` of Boolean values that is `True` where the value of column `"var1"` is contained in the list `[1,2,3]` and `False` otherwise
`~mask`	Invert the Boolean `mask` (i.e. turn all values that are `True` to `False` and vice versa)
`df.loc[2]`	Get the row at the index `2` in the data frame `df`
`df.loc[1:10, "var1"]`	Get the value stored in rows 1 to 10 in `df`
`df.loc[1:10, "var1"]`	Get the value stored in rows 1 to 10 in the column `"var1"` in `df`
`df.loc[mask, "var1"]`	Get the values in the column `"var1"` for all rows where the `mask` is `True`

Exercise: Get the row at the index 4 in df.

Solution

df.loc[4]

survived                 0
pclass                   3
sex                   male
age                   35.0
fare                  8.05
embark_town    Southampton
deck                   NaN
Name: 4, dtype: object

Exercise: Get the rows at indices 10 to 15 in df.

Solution

df.loc[10:15]

	survived	pclass	sex	age	fare	embark_town	deck
10	1	3	female	4.0	16.7000	Southampton	G
11	1	1	female	58.0	26.5500	Southampton	C
12	0	3	male	20.0	8.0500	Southampton	NaN
13	0	3	male	39.0	31.2750	Southampton	NaN
14	0	3	female	14.0	7.8542	Southampton	NaN
15	1	2	female	55.0	16.0000	Southampton	NaN

Exercise: Get the "sex" of the first 6 passengers in df.

Solution

df.loc[:6, 'sex']

0      male
1    female
2    female
3    female
4      male
5      male
6      male
Name: sex, dtype: object

Exercise: Get the data from all passengers on deck E.

Solution

# solution without ".isin()"
mask = df['deck']=='E'
print(df[mask])

# solution with ".isin()"
mask = df['deck'].isin(['E'])
df[mask]

     survived  pclass     sex   age      fare  embark_town deck
6           0       1    male  54.0   51.8625  Southampton    E
92          0       1    male  46.0   61.1750  Southampton    E
123         1       2  female  32.5   13.0000  Southampton    E
166         1       1  female   NaN   55.0000  Southampton    E
262         0       1    male  52.0   79.6500  Southampton    E
303         1       2  female   NaN   12.3500   Queenstown    E
309         1       1  female  30.0   56.9292    Cherbourg    E
319         1       1  female  40.0  134.5000    Cherbourg    E
337         1       1  female  41.0  134.5000    Cherbourg    E
356         1       1  female  22.0   55.0000  Southampton    E
370         1       1    male  25.0   55.4417    Cherbourg    E
429         1       3    male  32.0    8.0500  Southampton    E
434         0       1    male  50.0   55.9000  Southampton    E
456         0       1    male  65.0   26.5500  Southampton    E
460         1       1    male  48.0   26.5500  Southampton    E
462         0       1    male  47.0   38.5000  Southampton    E
512         1       1    male  36.0   26.2875  Southampton    E
558         1       1  female  39.0   79.6500  Southampton    E
572         1       1    male  36.0   26.3875  Southampton    E
577         1       1  female  39.0   55.9000  Southampton    E
585         1       1  female  18.0   79.6500  Southampton    E
662         0       1    male  47.0   25.5875  Southampton    E
701         1       1    male  35.0   26.2875  Southampton    E
707         1       1    male  42.0   26.2875  Southampton    E
717         1       2  female  27.0   10.5000  Southampton    E
724         1       1    male  27.0   53.1000  Southampton    E
751         1       3    male   6.0   12.4750  Southampton    E
772         0       2  female  57.0   10.5000  Southampton    E
809         1       1  female  33.0   53.1000  Southampton    E
823         1       3  female  27.0   12.4750  Southampton    E
835         1       1  female  39.0   83.1583    Cherbourg    E
857         1       1    male  51.0   26.5500  Southampton    E

	survived	pclass	sex	age	fare	embark_town	deck
6	0	1	male	54.0	51.8625	Southampton	E
92	0	1	male	46.0	61.1750	Southampton	E
123	1	2	female	32.5	13.0000	Southampton	E
166	1	1	female	NaN	55.0000	Southampton	E
262	0	1	male	52.0	79.6500	Southampton	E
303	1	2	female	NaN	12.3500	Queenstown	E
309	1	1	female	30.0	56.9292	Cherbourg	E
319	1	1	female	40.0	134.5000	Cherbourg	E
337	1	1	female	41.0	134.5000	Cherbourg	E
356	1	1	female	22.0	55.0000	Southampton	E
370	1	1	male	25.0	55.4417	Cherbourg	E
429	1	3	male	32.0	8.0500	Southampton	E
434	0	1	male	50.0	55.9000	Southampton	E
456	0	1	male	65.0	26.5500	Southampton	E
460	1	1	male	48.0	26.5500	Southampton	E
462	0	1	male	47.0	38.5000	Southampton	E
512	1	1	male	36.0	26.2875	Southampton	E
558	1	1	female	39.0	79.6500	Southampton	E
572	1	1	male	36.0	26.3875	Southampton	E
577	1	1	female	39.0	55.9000	Southampton	E
585	1	1	female	18.0	79.6500	Southampton	E
662	0	1	male	47.0	25.5875	Southampton	E
701	1	1	male	35.0	26.2875	Southampton	E
707	1	1	male	42.0	26.2875	Southampton	E
717	1	2	female	27.0	10.5000	Southampton	E
724	1	1	male	27.0	53.1000	Southampton	E
751	1	3	male	6.0	12.4750	Southampton	E
772	0	2	female	57.0	10.5000	Southampton	E
809	1	1	female	33.0	53.1000	Southampton	E
823	1	3	female	27.0	12.4750	Southampton	E
835	1	1	female	39.0	83.1583	Cherbourg	E
857	1	1	male	51.0	26.5500	Southampton	E

Example: Is the survival rate higher for male or female passengers?

mask = df['sex'] == 'male'
survival_male = df[mask]['survived'].mean()
survival_female = df[~mask]['survived'].mean()
print(survival_male, survival_female)

0.18890814558058924 0.7420382165605095

Exercise: Is the survival rate higher for passengers below or above the age of 40?

Solution

mask = df['age'] > 40
df[mask]['survived'].mean(), df[~mask]['survived'].mean()

(0.36666666666666664, 0.3873144399460189)

Exercise: What was the average fare paid by passengers who survived and those who didn’t?

Solution

# this is a solution that requires slightly less code
df.groupby('survived')['fare'].mean()

survived
0    22.117887
1    48.395408
Name: fare, dtype: float64

Exercise: What was the average fare paid by passengers in "pclass" 1 and "pclass" 3?

Solution

# this is a solution that requires slightly less code
df.groupby('pclass')['fare'].mean()

pclass
1    84.154687
2    20.662183
3    13.675550
Name: fare, dtype: float64

Section 4: The Pandas Ecosystem

Another advantage of pandas is its integration with other software which creates a powerful ecosystem for data analysis. In this section we are going to explore two such libraries: seaborn, a tool for visualizations and pingouin, a package for statistical analysis. Both libraries provide a similar interface where you pass a data frame as data and specify variables based on their column name. This homogeneous interface makes data analysis within the pandas framework very convenient.

Visualization with Seaborn

The Seaborn library allows us to create detailed visualizations, with little effort. We just have to specify the data we want to visualize and the software will take care of the details like adding colors and labeling axes. In this section, we are going to use the sns.catplot() function which is for plotting categorical data. This function takes in a data frame and the labels of columns we wish to plot on the x and y axis. It also takes other arguments to further configure the plot such as hue or kind (for a full list of parameters, see the documentation).

Code	Description
`import seaborn as sns`	Import the seaborn library under the alias `sns`
`sns.catplot(data=df, x='var1', y='var2')`	Create a `cat`egorical plot for the data frame `df` with `"var1"` on the x-axis and `"var2"` on the y-axis
`sns.catplot(data=df, x='var1', y='var2', kind="bar")`	Plot the same data but as a bar plot
`sns.catplot(data=df, x='var1', y='var2', hue="var3")`	Plot the same data but add color or `hue` to encode `"var3"`

Exercise: Import the seaborn library under the alias sns.

Solution

import seaborn as sns

Example: Create a catplot to visualize the distribution of "fare" across the different passenger classes ("pclass").

sns.catplot(df, x = 'pclass', y = 'fare', )

Exercise: Create a bar plot for the same data by using the kind argument.

Solution

sns.catplot(df, x = 'pclass', y = 'fare', kind='bar')

Exercise: Add a color or hue to the bar plot to encode passenger sex.

Solution

sns.catplot(df, x = 'pclass', y = 'fare', kind='bar', hue='sex')

Exercise: Create a bar plot to show differences in survival rates ("survived") between the different passenger classes ("pclass") and add a hue to encode "sex".

Solution

sns.catplot(df, x = 'pclass', y = 'survived', kind='bar', hue='sex')

Statistical Analysis with Pingouin

Pingouin is a library for performing statistical tests that is neatly integrated with pandas. In this section we are going to use Pingouin to perform an analysis of variance (ANOVA) which is a test that compares the variance between groups to the variance within groups with respect to a specific variable. This allows us to estimate if the difference between groups exceeds the “background noise” in the data. We are also going to compute pairwise correlations for all variables in the data set. By checking which variables are correlated, we can identify possible confounds in our analysis.

Code	Description
`import pingouin as pg`	Import the `pingouin` library under the alias `pg`
`pg.anova(data=df, dv="var1", between="var2")`	Perform an analysis of variance (ANOVA) to check if the difference in `"var1"` between groups in `"var2"` exceeds the variance within each group of `"var2"`
`pg.pairwise_corr(data=df)`	Compute pairwise correlations for all

Exercise: Import the pingouin library under the alias pg.

Solution

import pingouin as pg

Example: Perform an analysis of variance (ANOVA) to check if the variance in "fare" between "embark_town"s exceeds the variance within each "embark_town".

pg.anova(df, dv='fare', between='embark_town')

	Source	ddof1	ddof2	F	p-unc	np2
0	embark_town	2	886	38.140305	1.289645e-16	0.079271

Exercise: Perform an analysis of variance (ANOVA) to check if the variance in "age" between "pclass"es exceeds the variance within each "pclass".

Solution

pg.anova(df, dv='age', between='pclass')

	Source	ddof1	ddof2	F	p-unc	np2
0	pclass	2	711	57.443484	7.487984e-24	0.139107

Exercise: Calculate the pairwise correlations for all columns in df. Assign the data frame returned by pg.pariwise_corr to a new variable called df_corr and print it.

Solution

df_corr = pg.pairwise_corr(df, )
df_corr

	X	Y	method	alternative	n	r	CI95%	p-unc	BF10	power
0	survived	pclass	pearson	two-sided	891	-0.338481	[-0.4, -0.28]	2.537047e-25	1.149e+22	1.000000
1	survived	age	pearson	two-sided	714	-0.077221	[-0.15, -0.0]	3.912465e-02	0.392	0.541628
2	survived	fare	pearson	two-sided	891	0.257307	[0.19, 0.32]	6.120189e-15	6.56e+11	1.000000
3	pclass	age	pearson	two-sided	714	-0.369226	[-0.43, -0.3]	1.756699e-24	1.852e+21	1.000000
4	pclass	fare	pearson	two-sided	891	-0.549500	[-0.59, -0.5]	1.967386e-71	7.243e+67	1.000000
5	age	fare	pearson	two-sided	714	0.096067	[0.02, 0.17]	1.021628e-02	1.26	0.729450

Exercise: Get all rows of df_corr where "p-unc" is smaller than 0.001 and print the columns "X", "Y" and "r" (Tip: to select multiple columns, use a lists, e.g. df.loc[mask, ["var1", "var2"]]).

Solution

mask = df_corr['p-unc'] < 0.001
df_corr.loc[mask, ["X", "Y", 'r']]

	X	Y	r
0	survived	pclass	-0.338481
2	survived	fare	0.257307
3	pclass	age	-0.369226
4	pclass	fare	-0.549500