Groupby Operations: Applying Aggregations to Groups of Data
Authors
Setup
Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsSection 1: Get Summary Statistics for Different Subgroups in Titanic Dataset Using Groupby
Usually, you don’t just want to get a single metric from a dataset–you want to compare that metric between different subgroups of your data. For example, you want the mean monthly temperature, or the maximum firing rate of each neuron, or the total income of each country, and so on.
The groupby() method lets you specify that an operation will be done on each same-valued row for a given column. For example, to ask for the mean temperature by month:
>>> df.groupby('month').temperature.mean()To get the maximum firing rate of each neuron:
>>> df.groupby('neuron_id').firing_rate.max()You can also group by as many columns as you like, getting as many groups as unique combinations between the columns:
>>> df.groupby(['year', 'month']).temperature.mean()Groupby objects are lazy, meaning they don’t start calculating anything until they know the full pipeline. This approach is called the “Split-Apply-Combine” workflow. You can get more info on it here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
Let’s try this out on the Titanic Dataset!
| Code | Description |
|---|---|
df.head() |
Display the first 5 rows of a DataFrame. |
df.groupby('column')['column2'].mean() |
Group a DataFrame by column and calculate the mean of column2 for each group. |
df.groupby('column')['column2'].median() |
Group a DataFrame by column and calculate the median of column2 for each group. |
df.groupby(['col1', 'col2'])['col3'].mean() |
Group a DataFrame by multiple columns col1 and col2 and calculate the mean. |
df[['col1', 'col2']] |
Select multiple columns from a DataFrame. |
df.value_counts() |
Count unique combinations of values in a DataFrame. |
Run the code below to get the titanic dataset
Exercises
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
df.head()| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
Example: What was the mean age, grouped by class?
df.groupby('class').age.mean()class
First 38.233441
Second 29.877630
Third 25.140620
Name: age, dtype: float64Exercise: What was the median ticket fare for each class?
Solution
df.groupby('class')['fare'].median()class
First 60.2875
Second 14.2500
Third 8.0500
Name: fare, dtype: float64Exercise: What was the survival rate for each class?
Solution
df.groupby('class')['survived'].mean()class
First 0.629630
Second 0.472826
Third 0.242363
Name: survived, dtype: float64Exercise: What was the survival rate for each sex?
Solution
df.groupby('sex')['survived'].mean()sex
female 0.742038
male 0.188908
Name: survived, dtype: float64Exercise: What was the survival rate, broken down by both sex and class?
Solution
df.groupby(['class', 'sex'])['survived'].mean()class sex
First female 0.968085
male 0.368852
Second female 0.921053
male 0.157407
Third female 0.500000
male 0.135447
Name: survived, dtype: float64Exercise: Which class tended to travel alone more often? Did it matter where they were embarking from?
Solution
df.groupby(['embark_town', 'class'])['alone'].mean()embark_town class
Cherbourg First 0.482353
Second 0.352941
Third 0.575758
Queenstown First 0.000000
Second 1.000000
Third 0.750000
Southampton First 0.519685
Second 0.579268
Third 0.657224
Name: alone, dtype: float64Exercise: What was the median ticket fare for each embarking town?
Solution
df.groupby('embark_town').fare.median()embark_town
Cherbourg 29.70
Queenstown 7.75
Southampton 13.00
Name: fare, dtype: float64Exercise: What was the median age of the survivors vs non-survivors, when sex is considered as a factor?
Solution
df.groupby(['survived', 'sex']).age.median()survived sex
0 female 24.5
male 29.0
1 female 28.0
male 28.0
Name: age, dtype: float64Mini-Demo:
df.groupby(['sex'])[['survived', 'age']].mean()| survived | age | |
|---|---|---|
| sex | ||
| female | 0.742038 | 27.915709 |
| male | 0.188908 | 30.726645 |
Section 2: Plotting with Seaborn
Seaborn is a data visualization library that uses Pandas Dataframes to produce statistical plots; in other words, it takes Dataframes and does Groupby automatically for you
(https://seaborn.pydata.org/examples/index.html)
Most of its functions have a similar interface:
import seaborn as sns
sns.catplot(data=df, x="ColName", y="ColName", hue="ColName", row="ColName", col="ColName", kind="bar")
# kind can be "point", "bar", "violin", "box", "boxen", "count", "strip", "swarm"| df.sort_index() | Sort a DataFrame by its index. |
| df.unstack() | Pivot the innermost level of index to columns. |
| sns.catplot(data=df, x='col', y='col') | Create a categorical plot using Seaborn. |
| sns.catplot(data=df, x='col', y='col', kind='bar') | Specify the plot type. Options: "point", "bar", "violin", "box", "boxen", "count", "strip", "swarm". |
| sns.catplot(data=df, x='col', y='col', order=[...]) | Specify the order of categories on the x-axis. |
| sns.catplot(data=df, x='col', y='col', hue='col') | Add a categorical variable to create grouped plots with different colors. |
| sns.catplot(data=df, x='col', y='col', col='col') | Create separate subplot columns for each category. |
| sns.catplot(data=df, x='col', y='col', row='col') | Create separate subplot rows for each category. |
Instead of making text tables, let’s try out seaborn’s catplot function to make plots that compare one or more “categorical” variables to a “quantitative” variable.
Exercises
Example: What was the average age of the people in each class?
class_order = ['Third', 'Second', 'First']
sns.catplot(data=df, x="class", y='age', order=class_order, kind='bar'); # kind= can be "point", "bar", "violin", "box", "boxen", "count", "strip", "swarm"Exercise: What was the average ticket fare for each class?
Solution
sns.catplot(data=df, x='class', y='fare', order=class_order, kind='bar');Exercise: What was the average survival rate for each class?
Solution
sns.catplot(data=df, x='class', order=class_order, y='survived', kind='bar');Exercise: What was the survival rate, broken down by both sex and class? Hint: hue=, row=, col=
Solution
df[['class', 'survived', 'sex']].value_counts().sort_index().unstack()| sex | female | male | |
|---|---|---|---|
| class | survived | ||
| First | 0 | 3 | 77 |
| 1 | 91 | 45 | |
| Second | 0 | 6 | 91 |
| 1 | 70 | 17 | |
| Third | 0 | 72 | 300 |
| 1 | 72 | 47 |
sns.catplot(data=df, x='class', y='survived', order=class_order, hue='sex', kind='bar');Exercise: Which class tended to travel alone more often? Did it matter where they were embarking from? What if sex was also factored in?
Solution
sns.catplot(data=df, x='class', y='alone', order=class_order, hue='embark_town', col='sex', kind='bar');