Statistics with Pingouin
Authors
Pingouin is a statistics package in Python that wraps pandas and scipy-stats, creating more-complete statistics reports and presenting the results in a readable way.
Setup
Import Libraries
import numpy as np
import pandas as pd
import pingouin as pg
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path
import xarray as xr
import owncloudDownload the dataset and load it into a Pandas DataFrame
Path('data').mkdir(exist_ok=True, parents=True)
owncloud.Client.from_public_link('https://uni-bonn.sciebo.de/s/3Uf2gScrvuTPQhB').get_file('/', f'data/steinmetz_2017-01-08_Muller.nc')TrueSection 1: T-Tests
T-tests compare the means of two samples of data generated from a normally-distributed population and compute the probability that they have the same mean.
| Code | Statistical Test | Description |
|---|---|---|
pg.ttest(x, 0) |
One-Sampled T-Test | “Is the mean of my data unlikely to be zero?” |
pg.ttest(x, y) |
Independent-Samples T-Test | “Do the means of x and y indicate they are unlikely to be from the same distribution?” |
pg.ttest(x, y, paired=True) |
Paired-Samples T-test | “Comparing each value directly to its neighbor, are the mean differences of x and y’s data unlikely to be zero?” |
Let’s do some analysis on some fake data to get a feel for these two statistics tools.
Generate the Data: Run the code below to create the dataset df.
Exercises
randomizer = np.random.RandomState(17) # Makes sure the pseudorandom number generators reproduce the same data for us all.
df = pd.DataFrame()
df['A'] = randomizer.normal(0, 1, size=20)
df['B'] = randomizer.normal(0.2, 1, size=20)
df['C'] = randomizer.normal(0.7, 1, size=20)
df['D'] = df.A * 0.3 + randomizer.normal(0, 0.3, size=20)
df.plot.box()Analyze the Data with T-Tests in Pingouin
Example: A vs 0, One-Sampled T-Test: Is the mean of the normally-distributed population that the dataset A is generated from unlikely to be zero?
pg.ttest(df['A'], 0)| T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
|---|---|---|---|---|---|---|---|---|
| T-test | 0.378652 | 19 | two-sided | 0.709144 | [-0.45, 0.65] | 0.084669 | 0.248 | 0.064966 |
Exercise: B vs 1, One-Sampled T-Test: Is the mean of the normally-distributed population that the dataset B is generated from unlikely to be one?
Solution
pg.ttest(df['B'], 1)| T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
|---|---|---|---|---|---|---|---|---|
| T-test | -4.449603 | 19 | two-sided | 0.000275 | [-0.26, 0.55] | 0.994961 | 113.454 | 0.98794 |
A vs B, Independent Samples T-Test: Is the mean of the normally-distributed population that the dataset A is generated from unlikely to be the same as the mean of the normally-distributed population that the dataset B is generated from?
pg.ttest(df['A'], df['B'])| T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
|---|---|---|---|---|---|---|---|---|
| T-test | -0.137189 | 38 | two-sided | 0.891606 | [-0.7, 0.61] | 0.043383 | 0.311 | 0.052052 |
Exercise: A vs C, Independent Samples T-Test: Is the mean of the normally-distributed population that the dataset A is generated from unlikely to be the same as the mean of the normally-distributed population that the dataset C is generated from?
Solution
pg.ttest(df['A'], df['C'])| T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
|---|---|---|---|---|---|---|---|---|
| T-test | -1.978874 | 38 | two-sided | 0.055111 | [-1.42, 0.02] | 0.625775 | 1.409 | 0.487608 |
Exercise: A vs C, Paired Samples T-Test: Is the mean of the differences between each pair of samples generated from the two normally-distributed populations A and C unlikely to be 0?
Solution
pg.ttest(df['A'], df['C'], paired=True)| T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
|---|---|---|---|---|---|---|---|---|
| T-test | -1.728432 | 19 | two-sided | 0.100128 | [-1.55, 0.15] | 0.625775 | 0.818 | 0.756449 |
Exercise: A vs D, Paired Samples T-Test: Is the mean of the differences between each pair of samples generated from the two normally-distributed populations A and D unlikely to be 0?
Solution
pg.ttest(df['A'], df['D'], paired=True)| T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
|---|---|---|---|---|---|---|---|---|
| T-test | 0.082615 | 19 | two-sided | 0.935022 | [-0.45, 0.49] | 0.02152 | 0.233 | 0.050959 |
Section 2: T-Tests on Real-World Data: Pupil Position Analysis
In the experiments reported by Steinmetz et al, 2019 in Nature, mice performed a discrimination task where they moved the position of a stimulus using a steering wheel. During the experiment, a camera recorded the pupil position of the subject in the x and y directions.
| Code | Description |
|---|---|
np.logical_and(condition1, condition2) |
Element-wise logical AND of two boolean arrays |
sns.kdeplot(data, label='name') |
Create a kernel density estimate plot |
plt.axvline(x, color, linestyle) |
Add a vertical line to the plot |
plt.legend() |
Display the legend on the plot |
pg.ttest(x, y, paired=True/False) |
Perform t-test (paired or independent) |
Load and Filter the Data
Run the code below, which loads the file, extracts the variables, filters the data, and transforms it into a dataframe.
Exercises
dset = xr.load_dataset('data/steinmetz_2017-01-08_Muller.nc')
dd = dset[['pupil_x', 'pupil_y', 'contrast_left', 'contrast_right', 'response_type']]
dd = dd.where(
(dset.active_trials==True)
& (dset.response_type != 0)
& (dset.time >= 0.5)
& (dset.time <= 1.3)
, drop=True)
dd['contrast_diff'] = dd['contrast_right'] - dd['contrast_left']
trials = dd.median(dim='time')
rdf = trials.to_dataframe()
rdf.head()| pupil_x | pupil_y | contrast_left | contrast_right | response_type | contrast_diff | |
|---|---|---|---|---|---|---|
| trial | ||||||
| 1 | -1.019465 | 0.754972 | 50.0 | 0.0 | 1.0 | -50.0 |
| 2 | -0.184417 | 0.509199 | 0.0 | 50.0 | -1.0 | 50.0 |
| 3 | 0.799240 | -0.486496 | 100.0 | 25.0 | 1.0 | -75.0 |
| 4 | -0.294746 | -0.218956 | 0.0 | 100.0 | -1.0 | 100.0 |
| 5 | -0.367058 | -0.259306 | 50.0 | 50.0 | 1.0 | 0.0 |
Extract data from the dataframe rdf, plot its distribution(s), and use a t-test to answer the questions below:
Example: For the trials where contrast_left was greater or equal to 50 and contrast_right was 0, was the average pupil_x position for the trial zero?
left = rdf[(rdf.contrast_left >= 50) & (rdf.contrast_right == 0)]
sns.kdeplot(left.pupil_x, label='left')
plt.axvline(0, color='black', linestyle='dotted')
plt.legend()
pg.ttest(left.pupil_x, 0)| T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
|---|---|---|---|---|---|---|---|---|
| T-test | 3.447425 | 24 | two-sided | 0.002098 | [0.22, 0.88] | 0.689485 | 18.185 | 0.910939 |
Exercise: Where the mice looking at the stimulus that was presented? In trials where there was only one stimulus present (i.e. contrast_left or contrast_right was zero), and the stimulus had at least a contrast level of 50 (quite visible), was there a difference in the average pupil_x position?
Solution
mask_left = np.logical_and(rdf['contrast_left'] >= 50, rdf['contrast_right'] == 0)
df_left = rdf[mask_left]
sns.kdeplot(df_left['pupil_x'], label = 'left')
mask_right = np.logical_and(rdf['contrast_right'] >= 50, rdf['contrast_left'] == 0)
df_right = rdf[mask_right]
sns.kdeplot(df_right['pupil_x'], label = 'right')
plt.legend()
pg.ttest(df_left['pupil_x'], df_right['pupil_x'], paired = True)| T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
|---|---|---|---|---|---|---|---|---|
| T-test | 2.900766 | 41.883443 | two-sided | 0.005908 | [0.19, 1.07] | 0.855661 | 7.42 | 0.784252 |
Exercise: As a sanity check, was there a difference in vertical eye position between the left-stimulus trials and the right-stimulus trials? In trials where there was only one stimulus present (i.e. contrast_left or contrast_right was zero), and the stimulus had at least a contrast level of 50 (quite visible), was there a difference in the average pupil_y position?
Solution
mask_left = np.logical_and(rdf['contrast_left'] >= 50, rdf['contrast_right'] == 0)
df_left = rdf[mask_left]
sns.kdeplot(df_left['pupil_y'], label = 'left')
mask_right = np.logical_and(rdf['contrast_right'] >= 50, rdf['contrast_left'] == 0)
df_right = rdf[mask_right]
sns.kdeplot(df_right['pupil_y'], label = 'right')
plt.legend()
pg.ttest(df_left['pupil_y'], df_right['pupil_y'], paired = False)| T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
|---|---|---|---|---|---|---|---|---|
| T-test | 0.81158 | 38.482526 | two-sided | 0.422024 | [-0.2, 0.47] | 0.247714 | 0.39 | 0.125031 |
Exercise: Did mice look in the direction they turned the wheel? Was there a difference in average pupil_x position between left-turning trials (i.e. response_type is -1) and right-turning trials (i.e. response_type is 1)?
Solution
mask_left = rdf['response_type'] == -1
df_left = rdf[mask_left]
sns.kdeplot(df_left['pupil_x'], label = 'left')
mask_right = rdf['response_type'] == 1
df_right = rdf[mask_right]
sns.kdeplot(df_right['pupil_x'], label = 'right')
plt.legend()
pg.ttest(df_left['pupil_x'], df_right['pupil_x'], paired = False)| T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
|---|---|---|---|---|---|---|---|---|
| T-test | -2.472363 | 78.49085 | two-sided | 0.015585 | [-0.59, -0.06] | 0.50892 | 3.159 | 0.610308 |
Exercise: As a sanity check, did vertical eye position matter? Was there a difference in average pupil_y position between left-turning trials (i.e. response_type is -1) and right-turning trials (i.e. response_type is 1)?
Solution
mask_left = rdf['response_type'] == -1
df_left = rdf[mask_left]
sns.kdeplot(df_left['pupil_y'], label = 'left')
mask_right = rdf['response_type'] == 1
df_right = rdf[mask_right]
sns.kdeplot(df_right['pupil_y'], label = 'right')
plt.legend()
pg.ttest(df_left['pupil_y'], df_right['pupil_y'], paired = False)| T | dof | alternative | p-val | CI95% | cohen-d | BF10 | power | |
|---|---|---|---|---|---|---|---|---|
| T-test | 1.218473 | 61.22414 | two-sided | 0.227719 | [-0.1, 0.42] | 0.275322 | 0.443 | 0.22797 |
Section 3: Working with Long DataFrames
If data is organized as a long dataframe, it is also possible to have pingouin simply do all the comparisons between the groups. Not only is this very convenient; it also makes it possible for pingouin to do richer analyses.
| Code | Description |
|---|---|
pg.anova(data=dfl, dv='value', between=['variable'], detailed=True) |
Analysis of Variance (ANOVA). |
pg.pairwise_tests(data=dfl, dv='value', between='variable', padjust='fdr_bh', parametric=True/False) |
Pairwise T-tests. |
Run the cell below to melt the DataFrame df into a long DataFrame dfl to be used in the following exercises.
Exercises
dfl = df.melt(ignore_index=False).reset_index()
dfl.sample(5)| index | variable | value | |
|---|---|---|---|
| 63 | 3 | D | -0.544443 |
| 36 | 16 | B | 0.767783 |
| 61 | 1 | D | -0.520330 |
| 2 | 2 | A | 0.623901 |
| 10 | 10 | A | 2.171257 |
Example: Before doing pairwise tests, it’s often helpful to run an ANOVA analysis, which can tell you quickly whether a pairwise difference exists within a collection of potential pairs. Below, run an ANOVA test comparing the dv ‘value’ between values of ‘variable’ to see if a significant difference is expected.
pg.anova(dfl, dv='value', between=['variable'], detailed=True, )| Source | SS | DF | MS | F | p-unc | np2 | |
|---|---|---|---|---|---|---|---|
| 0 | variable | 7.259916 | 3 | 2.419972 | 2.851046 | 0.042862 | 0.101157 |
| 1 | Within | 64.508906 | 76 | 0.848801 | NaN | NaN | NaN |
Exercise: Do a t-tests for every possible pairwise comparison of “variable” (e.g. “A” vs “B”, “A” vs “C”, etc). (Tip: it works similar to the pg.anova function.)
Solution
pg.pairwise_tests(dfl, dv = 'value', between = ['variable'])| Contrast | A | B | Paired | Parametric | T | dof | alternative | p-unc | BF10 | hedges | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | variable | A | B | False | True | -0.137189 | 38.0 | two-sided | 0.891606 | 0.311 | -0.042521 |
| 1 | variable | A | C | False | True | -1.978874 | 38.0 | two-sided | 0.055111 | 1.409 | -0.613342 |
| 2 | variable | A | D | False | True | 0.068052 | 38.0 | two-sided | 0.946101 | 0.309 | 0.021092 |
| 3 | variable | B | C | False | True | -2.139828 | 38.0 | two-sided | 0.038851 | 1.806 | -0.663229 |
| 4 | variable | B | D | False | True | 0.302159 | 38.0 | two-sided | 0.764179 | 0.32 | 0.093653 |
| 5 | variable | C | D | False | True | 2.845162 | 38.0 | two-sided | 0.007113 | 6.461 | 0.881844 |
Exercise: Repeat the tests, but this time, adjust the p-value to compensate for multiple comparisons using the ‘fdr’ method. Are the significant differences still significant? How do the corrected p-values compare to the uncorrected p-values?
Solution
pg.pairwise_tests(dfl, dv = 'value', between = ['variable'], padjust = 'fdr_bh')| Contrast | A | B | Paired | Parametric | T | dof | alternative | p-unc | p-corr | p-adjust | BF10 | hedges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | variable | A | B | False | True | -0.137189 | 38.0 | two-sided | 0.891606 | 0.946101 | fdr_bh | 0.311 | -0.042521 |
| 1 | variable | A | C | False | True | -1.978874 | 38.0 | two-sided | 0.055111 | 0.110222 | fdr_bh | 1.409 | -0.613342 |
| 2 | variable | A | D | False | True | 0.068052 | 38.0 | two-sided | 0.946101 | 0.946101 | fdr_bh | 0.309 | 0.021092 |
| 3 | variable | B | C | False | True | -2.139828 | 38.0 | two-sided | 0.038851 | 0.110222 | fdr_bh | 1.806 | -0.663229 |
| 4 | variable | B | D | False | True | 0.302159 | 38.0 | two-sided | 0.764179 | 0.946101 | fdr_bh | 0.32 | 0.093653 |
| 5 | variable | C | D | False | True | 2.845162 | 38.0 | two-sided | 0.007113 | 0.042681 | fdr_bh | 6.461 | 0.881844 |
Exercise: Do a pairwise test again on the long dataframe, comparing the value of ‘value’ between each value ‘variable’. This time, use a non-parametric comparison (not assuming a normal distribution of each variable), and still adjust for multiple comparisons. Are your results the same?
Solution
pg.pairwise_tests(dfl, dv = 'value', between = ['variable'], padjust = 'fdr_bh', parametric = False)| Contrast | A | B | Paired | Parametric | U-val | alternative | p-unc | p-corr | p-adjust | hedges | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | variable | A | B | False | False | 192.0 | two-sided | 0.839232 | 0.967635 | fdr_bh | -0.042521 |
| 1 | variable | A | C | False | False | 136.0 | two-sided | 0.085855 | 0.171710 | fdr_bh | -0.613342 |
| 2 | variable | A | D | False | False | 202.0 | two-sided | 0.967635 | 0.967635 | fdr_bh | 0.021092 |
| 3 | variable | B | C | False | False | 132.0 | two-sided | 0.067868 | 0.171710 | fdr_bh | -0.663229 |
| 4 | variable | B | D | False | False | 232.0 | two-sided | 0.394171 | 0.591256 | fdr_bh | 0.093653 |
| 5 | variable | C | D | False | False | 273.0 | two-sided | 0.049864 | 0.171710 | fdr_bh | 0.881844 |
Use the rdf dataframe from the pupil measurement data. Is there a significant different in average horizontal position, when comparing all the contrast_diff levels in the study pairwise?
pg.pairwise_tests(rdf, dv = 'pupil_x', between = ['contrast_diff'], padjust = 'fdr_bh', parametric = False)| Contrast | A | B | Paired | Parametric | U-val | alternative | p-unc | p-corr | p-adjust | hedges | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | contrast_diff | -100.0 | -75.0 | False | False | 39.0 | two-sided | 0.841125 | 1.000000 | fdr_bh | 0.003482 |
| 1 | contrast_diff | -100.0 | -50.0 | False | False | 168.0 | two-sided | 0.054208 | 0.252391 | fdr_bh | 0.707904 |
| 2 | contrast_diff | -100.0 | -25.0 | False | False | 18.0 | two-sided | 0.600000 | 0.964286 | fdr_bh | 0.399767 |
| 3 | contrast_diff | -100.0 | 0.0 | False | False | 147.0 | two-sided | 0.070109 | 0.252391 | fdr_bh | 0.717534 |
| 4 | contrast_diff | -100.0 | 25.0 | False | False | 42.0 | two-sided | 0.157516 | 0.472549 | fdr_bh | 0.644120 |
| 5 | contrast_diff | -100.0 | 50.0 | False | False | 162.0 | two-sided | 0.003526 | 0.042318 | fdr_bh | 1.179199 |
| 6 | contrast_diff | -100.0 | 75.0 | False | False | 40.0 | two-sided | 0.232680 | 0.598319 | fdr_bh | 0.645830 |
| 7 | contrast_diff | -100.0 | 100.0 | False | False | 112.0 | two-sided | 0.015099 | 0.135893 | fdr_bh | 0.998026 |
| 8 | contrast_diff | -75.0 | -50.0 | False | False | 81.0 | two-sided | 0.036257 | 0.218709 | fdr_bh | 0.904273 |
| 9 | contrast_diff | -75.0 | -25.0 | False | False | 8.0 | two-sided | 0.642857 | 0.964286 | fdr_bh | 0.759261 |
| 10 | contrast_diff | -75.0 | 0.0 | False | False | 72.0 | two-sided | 0.036451 | 0.218709 | fdr_bh | 1.122901 |
| 11 | contrast_diff | -75.0 | 25.0 | False | False | 21.0 | two-sided | 0.066667 | 0.252391 | fdr_bh | 1.397436 |
| 12 | contrast_diff | -75.0 | 50.0 | False | False | 76.0 | two-sided | 0.003302 | 0.042318 | fdr_bh | 1.577027 |
| 13 | contrast_diff | -75.0 | 75.0 | False | False | 19.0 | two-sided | 0.171429 | 0.474725 | fdr_bh | 1.106513 |
| 14 | contrast_diff | -75.0 | 100.0 | False | False | 56.0 | two-sided | 0.002997 | 0.042318 | fdr_bh | 1.534350 |
| 15 | contrast_diff | -50.0 | -25.0 | False | False | 15.0 | two-sided | 0.842105 | 1.000000 | fdr_bh | -0.266070 |
| 16 | contrast_diff | -50.0 | 0.0 | False | False | 124.0 | two-sided | 0.909802 | 1.000000 | fdr_bh | -0.060808 |
| 17 | contrast_diff | -50.0 | 25.0 | False | False | 34.0 | two-sided | 1.000000 | 1.000000 | fdr_bh | -0.005404 |
| 18 | contrast_diff | -50.0 | 50.0 | False | False | 163.0 | two-sided | 0.084222 | 0.275635 | fdr_bh | 0.568939 |
| 19 | contrast_diff | -50.0 | 75.0 | False | False | 36.0 | two-sided | 0.897243 | 1.000000 | fdr_bh | 0.023733 |
| 20 | contrast_diff | -50.0 | 100.0 | False | False | 101.0 | two-sided | 0.436422 | 0.785560 | fdr_bh | 0.368721 |
| 21 | contrast_diff | -25.0 | 0.0 | False | False | 18.0 | two-sided | 0.720588 | 1.000000 | fdr_bh | 0.275528 |
| 22 | contrast_diff | -25.0 | 25.0 | False | False | 5.0 | two-sided | 0.800000 | 1.000000 | fdr_bh | 0.258185 |
| 23 | contrast_diff | -25.0 | 50.0 | False | False | 22.0 | two-sided | 0.266667 | 0.640000 | fdr_bh | 0.814132 |
| 24 | contrast_diff | -25.0 | 75.0 | False | False | 4.0 | two-sided | 1.000000 | 1.000000 | fdr_bh | 0.228942 |
| 25 | contrast_diff | -25.0 | 100.0 | False | False | 14.0 | two-sided | 0.484848 | 0.831169 | fdr_bh | 0.653006 |
| 26 | contrast_diff | 0.0 | 25.0 | False | False | 31.0 | two-sided | 0.961300 | 1.000000 | fdr_bh | 0.065340 |
| 27 | contrast_diff | 0.0 | 50.0 | False | False | 149.0 | two-sided | 0.057631 | 0.252391 | fdr_bh | 0.713574 |
| 28 | contrast_diff | 0.0 | 75.0 | False | False | 33.0 | two-sided | 0.809598 | 1.000000 | fdr_bh | 0.096288 |
| 29 | contrast_diff | 0.0 | 100.0 | False | False | 93.0 | two-sided | 0.331685 | 0.663370 | fdr_bh | 0.510201 |
| 30 | contrast_diff | 25.0 | 50.0 | False | False | 38.0 | two-sided | 0.327451 | 0.663370 | fdr_bh | 0.596998 |
| 31 | contrast_diff | 25.0 | 75.0 | False | False | 8.0 | two-sided | 1.000000 | 1.000000 | fdr_bh | 0.030278 |
| 32 | contrast_diff | 25.0 | 100.0 | False | False | 24.0 | two-sided | 0.635365 | 0.964286 | fdr_bh | 0.416863 |
| 33 | contrast_diff | 50.0 | 75.0 | False | False | 18.0 | two-sided | 0.327451 | 0.663370 | fdr_bh | -0.535031 |
| 34 | contrast_diff | 50.0 | 100.0 | False | False | 55.0 | two-sided | 0.395863 | 0.750056 | fdr_bh | -0.222583 |
| 35 | contrast_diff | 75.0 | 100.0 | False | False | 21.0 | two-sided | 0.945055 | 1.000000 | fdr_bh | 0.348996 |
Section 4: A Gallery of Plots : 2D Histograms with sns.jointplot()
Let’s try out a few different plots, for comparing two different continuous variables’ distributions.
| Code | Description |
|---|---|
sns.jointplot(df, x='colA', y='colB') |
Makes a joint histogram |
sns.jointplot(df, x='colA', y='colB', kind='kde') |
Makes a joint plot with kernel density estimate |
sns.jointplot(df, x='colA', y='colB', kind='hex') |
Makes a joint plot with hexagonal bins |
sns.jointplot(df, x='colA', y='colB', hue='colC') |
Makes a joint plot colored by a categorical variable |
Exercises
Example: Make a joint plot comparing the pupil_x and pupil_y variables.
sns.jointplot(rdf, x="pupil_x", y="pupil_y");Exercise: Make a joint plot comparing the pupil_x and pupil_y variables, using kind='kde'.
Solution
sns.jointplot(rdf, x='pupil_x', y='pupil_y', kind = 'kde')Exercise: Make a joint plot comparing the pupil_x and pupil_y variables, using kind='kde' and setting hue to 'response_type'.
Solution
sns.jointplot(rdf, x='pupil_x', y='pupil_y', kind = 'kde', hue='response_type')Exercise: Make a joint plot comparing the pupil_x and pupil_y variables, using kind='hex'.
Solution
sns.jointplot(rdf, x='pupil_x', y='pupil_y', kind='hex')Section 5: Further Reading
Nice article on Pingouin here: https://towardsdatascience.com/the-new-kid-on-the-statistics-in-python-block-pingouin-6b353a1db57c
Nice summary of the different effect size metrics and when to pick which: https://www.socscistatistics.com/effectsize/default3.aspx
Section 6: Extra: Correlations and Simpson’s Paradox
| Code | Description |
|---|---|
pg.corr(A, B) |
The correlation between variables A and B |
pg.pairwise_corr(flowers) |
Pairwise correlations |
df.plot.scatter(x='A', y='B') |
Make a scatter plot |
sns.lmplot(data=df, x='Col1', y='Col2') |
Make a scatter plot with a regression line |
sns.lmplot(data=df, x='Col1', y='Col2', hue='Col3') |
Make a scatter plot with a regression line, colored by a categorical variable |
Run the cell below to download the iris dataset for this section:
Exercises
from bokeh.sampledata import iris
flowers = iris.flowers
flowers.head()| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Example: Is there a correlation between flowers’ sepal widths and their petal lengths?
pg.corr(flowers['sepal_width'], flowers['petal_length'])| n | r | CI95% | p-val | BF10 | power | |
|---|---|---|---|---|---|---|
| pearson | 150 | -0.42844 | [-0.55, -0.29] | 4.513314e-08 | 2.762e+05 | 0.999847 |
Exercise: Use seaborn’s lmplot() function to plot a regression line showing the sepal widths and petal lengths. Does the direction of this line match the results? Looking closely at the data’s distribution, what is missing from this story?
Solution
sns.lmplot(data=flowers, x='sepal_width', y = 'petal_length')Exercise: This time factor in the species variable into the lmplot() visualization. What kind of relationship do you see between sepal_width and petal_length now?
Solution
sns.lmplot(data=flowers, x='sepal_width', y='petal_length', hue='species')Demo
Example: What’s happening here is that a third variable, "species", is influencing both the petal and the sepal shape; this effect is called “Simpson’s Paradox”. Unfortunately, at this time (as far as I know) pengouin doesn’t have a way to split a correlation into multiple groups; one way to do this, though, is to use a df.groupby() analysis, concatenating the results together.
analyses = []
for name, group in flowers.groupby('species'):
analysis = pg.corr(group['sepal_width'], group['petal_length'])
analysis['name'] = name
analyses.append(analysis)
results = pd.concat(analyses)
results| n | r | CI95% | p-val | BF10 | power | name | |
|---|---|---|---|---|---|---|---|
| pearson | 50 | 0.177700 | [-0.11, 0.43] | 0.216979 | 0.37 | 0.236771 | setosa |
| pearson | 50 | 0.560522 | [0.33, 0.73] | 0.000023 | 1061.04 | 0.992245 | versicolor |
| pearson | 50 | 0.401045 | [0.14, 0.61] | 0.003898 | 10.094 | 0.836041 | virginica |
To summarize, accounting for the right variables is a big part of science; we should always be aware that the patterns we see in our data can be turned completely upside down when new aspects of a problem are understood and factored in!