Groupby Operations: Applying Aggregations to Groups of Data

Courses

Intro to Python for Scientists

Tabular Data

Authors

Dr. Nicholas Del Grosso | Dr. Sangeetha Nandakumar | Dr. Ole Bialas | Dr. Atle E. Rimehaug

Download Materials

Setup

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Section 1: Get Summary Statistics for Different Subgroups in Titanic Dataset Using Groupby

Usually, you don’t just want to get a single metric from a dataset–you want to compare that metric between different subgroups of your data. For example, you want the mean monthly temperature, or the maximum firing rate of each neuron, or the total income of each country, and so on.

The groupby() method lets you specify that an operation will be done on each same-valued row for a given column. For example, to ask for the mean temperature by month:

>>> df.groupby('month').temperature.mean()

To get the maximum firing rate of each neuron:

>>> df.groupby('neuron_id').firing_rate.max()

You can also group by as many columns as you like, getting as many groups as unique combinations between the columns:

>>> df.groupby(['year', 'month']).temperature.mean()

Groupby objects are lazy, meaning they don’t start calculating anything until they know the full pipeline. This approach is called the “Split-Apply-Combine” workflow. You can get more info on it here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

Let’s try this out on the Titanic Dataset!

Code	Description
`df.head()`	Display the first 5 rows of a DataFrame.
`df.groupby('column')['column2'].mean()`	Group a DataFrame by `column` and calculate the mean of `column2` for each group.
`df.groupby('column')['column2'].median()`	Group a DataFrame by `column` and calculate the median of `column2` for each group.
`df.groupby(['col1', 'col2'])['col3'].mean()`	Group a DataFrame by multiple columns `col1` and `col2` and calculate the mean.
`df[['col1', 'col2']]`	Select multiple columns from a DataFrame.
`df.value_counts()`	Count unique combinations of values in a DataFrame.

Run the code below to get the titanic dataset

Exercises

df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
df.head()

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

Example: What was the mean age, grouped by class?

df.groupby('class').age.mean()

class
First     38.233441
Second    29.877630
Third     25.140620
Name: age, dtype: float64

Exercise: What was the median ticket fare for each class?

Solution

df.groupby('class')['fare'].median()

class
First     60.2875
Second    14.2500
Third      8.0500
Name: fare, dtype: float64

Exercise: What was the survival rate for each class?

Solution

df.groupby('class')['survived'].mean()

class
First     0.629630
Second    0.472826
Third     0.242363
Name: survived, dtype: float64

Exercise: What was the survival rate for each sex?

Solution

df.groupby('sex')['survived'].mean()

sex
female    0.742038
male      0.188908
Name: survived, dtype: float64

Exercise: What was the survival rate, broken down by both sex and class?

Solution

df.groupby(['class', 'sex'])['survived'].mean()

class   sex   
First   female    0.968085
        male      0.368852
Second  female    0.921053
        male      0.157407
Third   female    0.500000
        male      0.135447
Name: survived, dtype: float64

Exercise: Which class tended to travel alone more often? Did it matter where they were embarking from?

Solution

df.groupby(['embark_town', 'class'])['alone'].mean()

embark_town  class 
Cherbourg    First     0.482353
             Second    0.352941
             Third     0.575758
Queenstown   First     0.000000
             Second    1.000000
             Third     0.750000
Southampton  First     0.519685
             Second    0.579268
             Third     0.657224
Name: alone, dtype: float64

Exercise: What was the median ticket fare for each embarking town?

Solution

df.groupby('embark_town').fare.median()

embark_town
Cherbourg      29.70
Queenstown      7.75
Southampton    13.00
Name: fare, dtype: float64

Exercise: What was the median age of the survivors vs non-survivors, when sex is considered as a factor?

Solution

df.groupby(['survived', 'sex']).age.median()

survived  sex   
0         female    24.5
          male      29.0
1         female    28.0
          male      28.0
Name: age, dtype: float64

Mini-Demo:

df.groupby(['sex'])[['survived', 'age']].mean()

	survived	age
sex
female	0.742038	27.915709
male	0.188908	30.726645

Section 2: Plotting with Seaborn

Seaborn is a data visualization library that uses Pandas Dataframes to produce statistical plots; in other words, it takes Dataframes and does Groupby automatically for you

(https://seaborn.pydata.org/examples/index.html)

Most of its functions have a similar interface:

import seaborn as sns
sns.catplot(data=df, x="ColName", y="ColName", hue="ColName", row="ColName", col="ColName", kind="bar")
# kind can be "point", "bar", "violin", "box", "boxen", "count", "strip", "swarm"

| df.sort_index() | Sort a DataFrame by its index. | | df.unstack() | Pivot the innermost level of index to columns. | | sns.catplot(data=df, x='col', y='col') | Create a categorical plot using Seaborn. | | sns.catplot(data=df, x='col', y='col', kind='bar') | Specify the plot type. Options: "point", "bar", "violin", "box", "boxen", "count", "strip", "swarm". | | sns.catplot(data=df, x='col', y='col', order=[...]) | Specify the order of categories on the x-axis. | | sns.catplot(data=df, x='col', y='col', hue='col') | Add a categorical variable to create grouped plots with different colors. | | sns.catplot(data=df, x='col', y='col', col='col') | Create separate subplot columns for each category. | | sns.catplot(data=df, x='col', y='col', row='col') | Create separate subplot rows for each category. |

Instead of making text tables, let’s try out seaborn’s catplot function to make plots that compare one or more “categorical” variables to a “quantitative” variable.

Exercises

Example: What was the average age of the people in each class?

class_order = ['Third', 'Second', 'First']
sns.catplot(data=df, x="class", y='age', order=class_order, kind='bar');  # kind= can be "point", "bar", "violin", "box", "boxen", "count", "strip", "swarm"

Exercise: What was the average ticket fare for each class?

Solution

sns.catplot(data=df, x='class', y='fare', order=class_order, kind='bar');

Exercise: What was the average survival rate for each class?

Solution

sns.catplot(data=df, x='class', order=class_order, y='survived', kind='bar');

Exercise: What was the survival rate, broken down by both sex and class? Hint: hue=, row=, col=

Solution

df[['class', 'survived', 'sex']].value_counts().sort_index().unstack()

	sex	female	male
class	survived
First	0	3	77
First	1	91	45
Second	0	6	91
Second	1	70	17
Third	0	72	300
Third	1	72	47

sns.catplot(data=df, x='class', y='survived', order=class_order, hue='sex', kind='bar');

Exercise: Which class tended to travel alone more often? Did it matter where they were embarking from? What if sex was also factored in?

Solution

sns.catplot(data=df, x='class', y='alone', order=class_order, hue='embark_town', col='sex', kind='bar');