Aggregating Data Using Statistical Functions in SQL

Author
Dr. Nicholas Del Grosso

Setup

Data Download

We’ll be exploring some data from a Steinmetz et al NeuroPixel experiment, processed here into JSON files for our tabular analysis, along with some other familiar file types. Please run the code below to download the data. It will take 5-10 minutes to download.

from tqdm import tqdm
from pathlib import Path
from webdav4.fsspec import WebdavFileSystem

# https://uni-bonn.sciebo.de/s/oZql1bk0p1AvK0w
fs = WebdavFileSystem("https://uni-bonn.sciebo.de/public.php/webdav", auth=("oZql1bk0p1AvK0w", ""))
for name in tqdm(fs.ls("/", detail=False), desc="Downloading Data to data/stenmetz"):
    if not Path(f"data/steinmetz/{name}").exists():
        fs.download(name, f"data/steinmetz/{name}", recursive=True)
Downloading Data to data/stenmetz:   0%|                                                                                                                                     | 0/37 [00:00<?, ?it/s]
Downloading Data to data/stenmetz: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [00:00<00:00, 14227.10it/s]

Import Libraries

import duckdb
from duckdb import sql

Section 1: Transformations in SQL SELECT Statements

Simple transformations and renaming operations can be done in-line in a query; DuckDB (and most SQL RDBMS flavors in general) support a wide variety of operations on a wide variety of date types.

Here, we’ll look mainly at numeric and text data:

Code Description
SELECT cos(x) FROM my_table Get the cosine of the “x” column.
SELECT cos(x) as cs FROM my_table Get the cosine of the “x” column and rename it to “cs”.
SELECT cos(x) as cs, sin(x) as sc FROM my_table Calculate the sine and cosine of x and make two columns out of them.
SELECT round(x, 2), floor(x), ceil(x) from my_table Round x to 2 decimal places, round it down, and round it up.
SELECT x * 2, x + 2, x - 2, x / 2 FROM my_table Do arithmetic on x.
SELECT upper(t), lower(t), reverse(t) FROM my_table Uppercase, lowercase, and reverse the text in the t column.
SELECT left(t, 4), right(t, 4) FROM my_table Extract the left-most 4 characters and right right-most 4 characters from t.

Exercises

Example: Get the response time in milliseconds for all trials in the experiment.

query = """
SELECT                                    -- Select Specific Columns
    response_time * 1000 as resp_msecs    -- Multiply the response time column by 1000 and rename it.
FROM '**/trials.csv';                     -- Read all the CSV files into a table.
"""
sql(query).to_df()

resp_msecs
0 1433.915727
1 2001.065710
2 1251.114625
3 1334.219182
4 683.367530
... ...
13612 NaN
13613 NaN
13614 NaN
13615 NaN
13616 NaN

13617 rows × 1 columns

Exercise: Get the feedback time in milliseconds for all trials in the experiment.

Solution
query = """
SELECT 
    feedback_time * 1000 as feedback_msecs 
FROM '**/trials.csv';
"""
sql(query).to_df()

feedback_msecs
0 1471.223129
1 2004.431558
2 1286.820215
3 1372.021561
4 718.411230
... ...
13612 NaN
13613 NaN
13614 NaN
13615 NaN
13616 NaN

13617 rows × 1 columns

Exercise: Get both the response time and feedback time in milliseconds for all trials in the experiment.

Solution
query = """
SELECT 
    feedback_time * 1000 as feedback_msecs,
    response_time * 1000 as resp_msecs,
FROM '**/trials.csv';
"""
sql(query).to_df()

feedback_msecs resp_msecs
0 1471.223129 1433.915727
1 2004.431558 2001.065710
2 1286.820215 1251.114625
3 1372.021561 1334.219182
4 718.411230 683.367530
... ... ...
13612 NaN NaN
13613 NaN NaN
13614 NaN NaN
13615 NaN NaN
13616 NaN NaN

13617 rows × 2 columns

Exercise: Lowercase the mouse name in the all sessions

Solution
query = """
SELECT
    lower(mouse) as mouse,
    * EXCLUDE (mouse),
FROM '**/session.json';
"""
sql(query).to_df().head()

mouse session_date stim_onset bin_size id
0 muller 2017-01-07 0.5 0.01 b5b6
1 muller 2017-01-08 0.5 0.01 49bb
2 radnitz 2017-01-08 0.5 0.01 769e
3 muller 2017-01-09 0.5 0.01 31dc
4 radnitz 2017-01-09 0.5 0.01 99f4

Exercise: In this experiment, the mouse’s task is to compare the contrast levels of the left and right stimulus, and to decide which contrast is higher. Use SQL to make a new contrast_diff column that subtracts contrast_right from contrast_left.

Solution
query = """
SELECT
    contrast_left - contrast_right AS contrast_diff
FROM '**/trials.csv';
"""
sql(query).to_df().head()

contrast_diff
0 0
1 -100
2 -100
3 -100
4 0

Section 2: Full-Table Aggregations

We can also run some basic statistics in SQL, making it useful for some quick summaries of our data. This works the same way as transformations. Let’s try it out!

Code Description
SELECT count(*) FROM my_table The number of rows in the table.
SELECT count(colA) FROM my_table The number of values in colA in the table (same as the number of rows in the whole table.)
SELECT min(colA), max(colA) FROM my_table The minimum and maximum values of colA
SELECT avg(colA), median(colA) FROM my_table The mean and median values of colA
SELECT first(colA), last(colA) FROM my_table The first and last values of colA (based on the ordering of the data)

Exercises

Example: How many sessions were recorded in the study?

sql('SELECT count(*) FROM "data/steinmetz/**/session.json"')
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│           37 │
└──────────────┘

Exercise: How many total trials were done in the experiment, across all sessions?

Solution
sql('SELECT count(*) FROM "data/steinmetz/*/trials.csv"')
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        13617 │
└──────────────┘

Exercise: How many total cells were recorded in this experiment?

Solution
sql(
"""
SELECT 
    count(*)
FROM 
    'data/steinmetz/*/cells.parquet'
""")
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        32233 │
└──────────────┘

Exercise: How many total sessions were done by the mouse named “Richards”?

Solution
sql(
"""
SELECT count(*) FROM  "data/steinmetz/*/session.json" WHERE mouse = 'Richards'
""")
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│            5 │
└──────────────┘

Exercise: How many cells were recorded in the hippocampus in this experiment?

Solution
sql(
"""
SELECT 
    count(*)
FROM 
    'data/steinmetz/*/cells.parquet'
WHERE brain_groups = 'hippocampus'
""")
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│         3456 │
└──────────────┘

Exercise: What was the min and maximum values of the Left-Right CCF dimension recorded in the data (i.e. the most left coordinate and the most right coordinate)?

Solution
sql(
"""
SELECT
    min(ccf_lr),
    max(ccf_lr)
FROM 
    "data/steinmetz/*/cells.parquet"
""")
┌─────────────┬─────────────┐
│ min(ccf_lr) │ max(ccf_lr) │
│   double    │   double    │
├─────────────┼─────────────┤
│      1078.8 │      6346.6 │
└─────────────┴─────────────┘

Exercise: What was the mean (i.e. “average”) value for each of the CCF coordinates?

Solution
sql(
"""
SELECT
    avg(ccf_lr),
    avg(ccf_ap),
    avg(ccf_dv)
FROM 
    "data/steinmetz/*/cells.parquet"
""")
┌────────────────────┬───────────────────┬────────────────────┐
│    avg(ccf_lr)     │    avg(ccf_ap)    │    avg(ccf_dv)     │
│       double       │      double       │       double       │
├────────────────────┼───────────────────┼────────────────────┤
│ 4085.9178622147992 │ 6645.750891532806 │ 2932.2004544450415 │
└────────────────────┴───────────────────┴────────────────────┘

Exercise: What was the mean (i.e. “average”) value for each of the CCF coordinates in the thalamus?

Solution
sql(
"""
SELECT
    first(brain_groups),
    avg(ccf_lr),
    avg(ccf_ap),
    avg(ccf_dv)
FROM 
    "data/steinmetz/*/cells.parquet"
WHERE brain_groups = 'thalamus'
""")
┌─────────────────────┬───────────────────┬───────────────────┬───────────────────┐
│ first(brain_groups) │    avg(ccf_lr)    │    avg(ccf_ap)    │    avg(ccf_dv)    │
│       varchar       │      double       │      double       │      double       │
├─────────────────────┼───────────────────┼───────────────────┼───────────────────┤
│ thalamus            │ 3897.681323755769 │ 7337.369557037793 │ 3314.598170001713 │
└─────────────────────┴───────────────────┴───────────────────┴───────────────────┘

Exercise: What was the average response_type (-1 is leftward wheel turn, +1 is rightward wheel turn, 0 is no wheel turn), where it was considered an “Active Trial” (i.e. when the mouse wasn’t expected to just passively observe)?

Solution
sql(
"""
SELECT
    avg(response_type)
FROM "**/trials.csv"
WHERE active_trials = true
"""
)
┌──────────────────────┐
│  avg(response_type)  │
│        double        │
├──────────────────────┤
│ 0.002165618232443024 │
└──────────────────────┘

Section 3: SQL GROUP BY for Groupwise Statistics

Code Description
SELECT name, avg(value) FROM my_table GROUP BY name Get the mean value for every unique value of name
SELECT name, avg(value) FROM my_table GROUP BY ALL Group the data by anything requested but not aggregated.
SELECT name, avg(value) FROM my_table GROUP BY name ORDER BY name … and order the rows by the name column.
SELECT name, avg(value) FROM my_table GROUP BY name HAVING name LIKE "G%" Only use groups where the name starts with “G”

Exercises

Example: For each contrast_left level, what was the average response_type (-1 is leftward wheel turn, +1 is rightward wheel turn, 0 is no wheel turn), where it was considered an “Active Trial” (i.e. when the mouse wasn’t expected to just passively observe)?

sql(
"""
SELECT
    contrast_left,
    round(avg(response_type), 2) AS mean_response_type
FROM "**/trials.csv"
WHERE active_trials = true
GROUP BY contrast_left
ORDER BY contrast_left
"""
)
┌───────────────┬────────────────────┐
│ contrast_left │ mean_response_type │
│     int64     │       double       │
├───────────────┼────────────────────┤
│             0 │              -0.35 │
│            25 │              -0.28 │
│            50 │               0.41 │
│           100 │               0.63 │
└───────────────┴────────────────────┘

Exercise: For each contrast_right level, what was the average response_type (-1 is leftward wheel turn, +1 is rightward wheel turn, 0 is no wheel turn), where it was considered an “Active Trial” (i.e. when the mouse wasn’t expected to just passively observe)?

Solution
sql(
"""
SELECT
    contrast_right,
    round(avg(response_type), 2) AS mean_response_type
FROM "**/trials.csv"
WHERE active_trials = true
GROUP BY contrast_right
ORDER BY contrast_right
"""
)
┌────────────────┬────────────────────┐
│ contrast_right │ mean_response_type │
│     int64      │       double       │
├────────────────┼────────────────────┤
│              0 │               0.27 │
│             25 │               0.26 │
│             50 │               -0.3 │
│            100 │              -0.58 │
└────────────────┴────────────────────┘

Exercise: For each combination of contrast_left and contrast_right level, what was the average response_type (-1 is leftward wheel turn, +1 is rightward wheel turn, 0 is no wheel turn), where it was considered an “Active Trial” (i.e. when the mouse wasn’t expected to just passively observe)?

Solution
sql(
"""
SELECT
    contrast_left,
    contrast_right,
    round(avg(response_type), 2) AS mean_response_type
FROM "**/trials.csv"
WHERE active_trials = true
GROUP BY contrast_left, contrast_right
ORDER BY contrast_left, contrast_right
"""
)
┌───────────────┬────────────────┬────────────────────┐
│ contrast_left │ contrast_right │ mean_response_type │
│     int64     │     int64      │       double       │
├───────────────┼────────────────┼────────────────────┤
│             0 │              0 │              -0.07 │
│             0 │             25 │              -0.65 │
│             0 │             50 │              -0.77 │
│             0 │            100 │              -0.75 │
│            25 │              0 │               0.56 │
│            25 │             25 │              -0.04 │
│            25 │             50 │              -0.46 │
│            25 │            100 │               -0.7 │
│            50 │              0 │                0.8 │
│            50 │             25 │               0.44 │
│            50 │             50 │               0.12 │
│            50 │            100 │              -0.36 │
│           100 │              0 │               0.74 │
│           100 │             25 │               0.65 │
│           100 │             50 │               0.55 │
│           100 │            100 │               0.22 │
├───────────────┴────────────────┴────────────────────┤
│ 16 rows                                   3 columns │
└─────────────────────────────────────────────────────┘

Exercise: For each brain group, how many cells were recorded?

Solution
sql(
"""
SELECT
    brain_groups,
    count(*) AS num_cells,
FROM "**/cells.parquet"
GROUP BY brain_groups
"""
)
┌───────────────────┬───────────┐
│   brain_groups    │ num_cells │
│      varchar      │   int64   │
├───────────────────┼───────────┤
│ visual cortex     │      2971 │
│ non-visual cortex │      7771 │
│ thalamus          │      5847 │
│ cortical subplate │       605 │
│ basal ganglia     │      3060 │
│ hippocampus       │      3456 │
│ TT                │       181 │
│ midbrain          │      4139 │
│ root              │      4203 │
└───────────────────┴───────────┘

Section 4: Joining Multiple Tables: JOIN

Code Description
FROM "a.csv" a Give a variable name a to a table loaded from “a.csv”
FROM "a.csv" a JOIN "b.csv" b ON a.id = b.a_id A two-way join, matching the “id” column in table a with the “a_id” column in table b.
FROM "a.csv" a JOIN "b.csv" b ON a.id = b.a_id JOIN "c.csv" c ON a.id = c.a_id A three-way join.

Docs around Join: https://duckdb.org/docs/sql/query_syntax/from

Exercises

Example: Join the sessions and trials tables on the session ids, and show the resulting column names:

query = """
DESCRIBE
SELECT
    *
FROM "**/session.json" sessions
JOIN "**/trials.csv" trials 
ON sessions.id = trials.session_id
"""
sql(query)
┌────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│  column_name   │ column_type │  null   │   key   │ default │  extra  │
│    varchar     │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ session_date   │ DATE        │ YES     │ NULL    │ NULL    │ NULL    │
│ mouse          │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ stim_onset     │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ bin_size       │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ id             │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ contrast_left  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ contrast_right │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ gocue          │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ stim_onset     │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ feedback_type  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ feedback_time  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ response_type  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ response_time  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ reaction_type  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ reaction_time  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ prev_reward    │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ active_trials  │ BOOLEAN     │ YES     │ NULL    │ NULL    │ NULL    │
│ session_id     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
├────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 18 rows                                                    6 columns │
└──────────────────────────────────────────────────────────────────────┘

Example: Join the sessions and cells tables on the session ids, and show the resulting column names:

query = """
DESCRIBE
SELECT
    *
FROM "**/session.json" sessions
JOIN "**/cells.parquet" cells 
ON sessions.id = cells.session_id
"""
sql(query)
┌────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│  column_name   │ column_type │  null   │   key   │ default │  extra  │
│    varchar     │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ session_date   │ DATE        │ YES     │ NULL    │ NULL    │ NULL    │
│ mouse          │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ stim_onset     │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ bin_size       │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ id             │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ ccf_ap         │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ ccf_dv         │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ ccf_lr         │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ brain_area     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ brain_groups   │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ trough_to_peak │ TINYINT     │ YES     │ NULL    │ NULL    │ NULL    │
│ session_id     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ cell           │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │
├────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 13 rows                                                    6 columns │
└──────────────────────────────────────────────────────────────────────┘

Example: Join the sessions, trials, and cells tables on the session ids, and show the resulting column names:

query = """
DESCRIBE
SELECT
    *
FROM "**/session.json" sessions
JOIN "**/cells.parquet" cells 
ON sessions.id = cells.session_id
JOIN "**/trials.csv" trials
ON sessions.id = trials.session_id
"""
sql(query).to_df()

column_name column_type null key default extra
0 session_date DATE YES None None None
1 mouse VARCHAR YES None None None
2 stim_onset DOUBLE YES None None None
3 bin_size DOUBLE YES None None None
4 id VARCHAR YES None None None
5 ccf_ap DOUBLE YES None None None
6 ccf_dv DOUBLE YES None None None
7 ccf_lr DOUBLE YES None None None
8 brain_area VARCHAR YES None None None
9 brain_groups VARCHAR YES None None None
10 trough_to_peak TINYINT YES None None None
11 session_id VARCHAR YES None None None
12 cell INTEGER YES None None None
13 contrast_left BIGINT YES None None None
14 contrast_right BIGINT YES None None None
15 gocue DOUBLE YES None None None
16 stim_onset DOUBLE YES None None None
17 feedback_type DOUBLE YES None None None
18 feedback_time DOUBLE YES None None None
19 response_type DOUBLE YES None None None
20 response_time DOUBLE YES None None None
21 reaction_type DOUBLE YES None None None
22 reaction_time DOUBLE YES None None None
23 prev_reward DOUBLE YES None None None
24 active_trials BOOLEAN YES None None None
25 session_id VARCHAR YES None None None

Section 5: (Demos) Pivoting Tables and Nesting SQL Statements

Analysis pipelines can get quite complex–there’s a lot that we want from our data! Below are some examples of analyses that uses nesting, pivoting, and exporting to Pandas to break down the analysis into multiple steps.

duckdb.sql(
"""
SELECT 
    contrast_right, 
    "0", "25", "50", "100"
FROM (
    PIVOT (
        SELECT
            contrast_left,
            contrast_right,
            round(avg(response_type), 2) AS mean_response_type
        FROM "**/trials.csv"
        WHERE active_trials = true AND response_type <> 0
        GROUP BY contrast_left, contrast_right
        ORDER BY contrast_left, contrast_right
    )
    ON contrast_left
    USING first(mean_response_type)
    GROUP BY contrast_right
)
"""
).to_df().set_index('contrast_right').style.format("{:.2f}")
result = duckdb.sql(
"""
PIVOT (
    SELECT
        mouse,
        session_id,
        session_date,
        brain_groups,
        count(cell) as num_cells,
    FROM "**/cells.parquet" cells
    INNER JOIN "**/session.json" sessions ON cells.session_id = sessions.id
    GROUP BY ALL
)
ON brain_groups
USING sum(num_cells)
GROUP BY mouse
"""
).to_df()
result.set_index('mouse').convert_dtypes().style.format()
  TT basal ganglia cortical subplate hippocampus midbrain non-visual cortex root thalamus visual cortex
mouse                  
Theiler 209 497 442 59 141
Forssmann 494 52 661 1068 1241 864 219
Radnitz 34 430 111 1017 1229 348 204 430
Moniz 297 185 377 44 801 379
Hench 137 450 506 389 1019 60 819 593
Muller 10 92 459 714 408 583 366 619
Lederberg 681 173 429 787 951 816 1250 221
Richards 562 195 235 344 1387 136 957 19
Tatum 351 185 329 703 757 433 372 205
None 220 78 100 155 145