Groupby Operations: Applying Aggregations to Groups of Data

Groupby Operations: Applying Aggregations to Groups of Data

Authors
Dr. Nicholas Del Grosso | Dr. Sangeetha Nandakumar | Dr. Ole Bialas | Dr. Atle E. Rimehaug

Setup

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Section 1: Get Summary Statistics for Different Subgroups in Titanic Dataset Using Groupby

Usually, you don’t just want to get a single metric from a dataset–you want to compare that metric between different subgroups of your data. For example, you want the mean monthly temperature, or the maximum firing rate of each neuron, or the total income of each country, and so on.

The groupby() method lets you specify that an operation will be done on each same-valued row for a given column. For example, to ask for the mean temperature by month:

>>> df.groupby('month').temperature.mean()

To get the maximum firing rate of each neuron:

>>> df.groupby('neuron_id').firing_rate.max()

You can also group by as many columns as you like, getting as many groups as unique combinations between the columns:

>>> df.groupby(['year', 'month']).temperature.mean()

Groupby objects are lazy, meaning they don’t start calculating anything until they know the full pipeline. This approach is called the “Split-Apply-Combine” workflow. You can get more info on it here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

Let’s try this out on the Titanic Dataset!

Code Description
df.head() Display the first 5 rows of a DataFrame.
df.groupby('column')['column2'].mean() Group a DataFrame by column and calculate the mean of column2 for each group.
df.groupby('column')['column2'].median() Group a DataFrame by column and calculate the median of column2 for each group.
df.groupby(['col1', 'col2'])['col3'].mean() Group a DataFrame by multiple columns col1 and col2 and calculate the mean.
df[['col1', 'col2']] Select multiple columns from a DataFrame.
df.value_counts() Count unique combinations of values in a DataFrame.

Run the code below to get the titanic dataset

Exercises

df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
df.head()

survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

Example: What was the mean age, grouped by class?

df.groupby('class').age.mean()
class
First     38.233441
Second    29.877630
Third     25.140620
Name: age, dtype: float64

Exercise: What was the median ticket fare for each class?

Solution
df.groupby('class')['fare'].median()
class
First     60.2875
Second    14.2500
Third      8.0500
Name: fare, dtype: float64

Exercise: What was the survival rate for each class?

Solution
df.groupby('class')['survived'].mean()
class
First     0.629630
Second    0.472826
Third     0.242363
Name: survived, dtype: float64

Exercise: What was the survival rate for each sex?

Solution
df.groupby('sex')['survived'].mean()
sex
female    0.742038
male      0.188908
Name: survived, dtype: float64

Exercise: What was the survival rate, broken down by both sex and class?

Solution
df.groupby(['class', 'sex'])['survived'].mean()
class   sex   
First   female    0.968085
        male      0.368852
Second  female    0.921053
        male      0.157407
Third   female    0.500000
        male      0.135447
Name: survived, dtype: float64

Exercise: Which class tended to travel alone more often? Did it matter where they were embarking from?

Solution
df.groupby(['embark_town', 'class'])['alone'].mean()
embark_town  class 
Cherbourg    First     0.482353
             Second    0.352941
             Third     0.575758
Queenstown   First     0.000000
             Second    1.000000
             Third     0.750000
Southampton  First     0.519685
             Second    0.579268
             Third     0.657224
Name: alone, dtype: float64

Exercise: What was the median ticket fare for each embarking town?

Solution
df.groupby('embark_town').fare.median()
embark_town
Cherbourg      29.70
Queenstown      7.75
Southampton    13.00
Name: fare, dtype: float64

Exercise: What was the median age of the survivors vs non-survivors, when sex is considered as a factor?

Solution
df.groupby(['survived', 'sex']).age.median()
survived  sex   
0         female    24.5
          male      29.0
1         female    28.0
          male      28.0
Name: age, dtype: float64

Mini-Demo:

df.groupby(['sex'])[['survived', 'age']].mean()

survived age
sex
female 0.742038 27.915709
male 0.188908 30.726645

Section 2: Plotting with Seaborn

Seaborn is a data visualization library that uses Pandas Dataframes to produce statistical plots; in other words, it takes Dataframes and does Groupby automatically for you

(https://seaborn.pydata.org/examples/index.html )

Most of its functions have a similar interface:

import seaborn as sns
sns.catplot(data=df, x="ColName", y="ColName", hue="ColName", row="ColName", col="ColName", kind="bar")
# kind can be "point", "bar", "violin", "box", "boxen", "count", "strip", "swarm"

| df.sort_index() | Sort a DataFrame by its index. | | df.unstack() | Pivot the innermost level of index to columns. | | sns.catplot(data=df, x='col', y='col') | Create a categorical plot using Seaborn. | | sns.catplot(data=df, x='col', y='col', kind='bar') | Specify the plot type. Options: "point", "bar", "violin", "box", "boxen", "count", "strip", "swarm". | | sns.catplot(data=df, x='col', y='col', order=[...]) | Specify the order of categories on the x-axis. | | sns.catplot(data=df, x='col', y='col', hue='col') | Add a categorical variable to create grouped plots with different colors. | | sns.catplot(data=df, x='col', y='col', col='col') | Create separate subplot columns for each category. | | sns.catplot(data=df, x='col', y='col', row='col') | Create separate subplot rows for each category. |

Instead of making text tables, let’s try out seaborn’s catplot function to make plots that compare one or more “categorical” variables to a “quantitative” variable.

Exercises

Example: What was the average age of the people in each class?

class_order = ['Third', 'Second', 'First']
sns.catplot(data=df, x="class", y='age', order=class_order, kind='bar');  # kind= can be "point", "bar", "violin", "box", "boxen", "count", "strip", "swarm"

Exercise: What was the average ticket fare for each class?

Solution
sns.catplot(data=df, x='class', y='fare', order=class_order, kind='bar');

Exercise: What was the average survival rate for each class?

Solution
sns.catplot(data=df, x='class', order=class_order, y='survived', kind='bar');

Exercise: What was the survival rate, broken down by both sex and class? Hint: hue=, row=, col=

Solution
df[['class', 'survived', 'sex']].value_counts().sort_index().unstack()

sex female male
class survived
First 0 3 77
1 91 45
Second 0 6 91
1 70 17
Third 0 72 300
1 72 47
sns.catplot(data=df, x='class', y='survived', order=class_order, hue='sex', kind='bar');

Exercise: Which class tended to travel alone more often? Did it matter where they were embarking from? What if sex was also factored in?

Solution
sns.catplot(data=df, x='class', y='alone', order=class_order, hue='embark_town', col='sex', kind='bar');