Aggregating Data Using Statistical Functions in SQL

Courses

File and Data Management

Analyzing JSON-, CSV, and Parquet data using SQL in DuckDB

Author

Dr. Nicholas Del Grosso

Download Materials

Setup

Data Download

We’ll be exploring some data from a Steinmetz et al NeuroPixel experiment, processed here into JSON files for our tabular analysis, along with some other familiar file types. Please run the code below to download the data. It will take 5-10 minutes to download.

from tqdm import tqdm
from pathlib import Path
from webdav4.fsspec import WebdavFileSystem

# https://uni-bonn.sciebo.de/s/oZql1bk0p1AvK0w
fs = WebdavFileSystem("https://uni-bonn.sciebo.de/public.php/webdav", auth=("oZql1bk0p1AvK0w", ""))
for name in tqdm(fs.ls("/", detail=False), desc="Downloading Data to data/stenmetz"):
    if not Path(f"data/steinmetz/{name}").exists():
        fs.download(name, f"data/steinmetz/{name}", recursive=True)


Downloading Data to data/stenmetz:   0%|                                                                                                                                     | 0/37 [00:00<?, ?it/s]


Downloading Data to data/stenmetz: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [00:00<00:00, 14227.10it/s]

Import Libraries

import duckdb
from duckdb import sql

Section 1: Transformations in SQL SELECT Statements

Simple transformations and renaming operations can be done in-line in a query; DuckDB (and most SQL RDBMS flavors in general) support a wide variety of operations on a wide variety of date types.

Here, we’ll look mainly at numeric and text data:

Code	Description
`SELECT cos(x) FROM my_table`	Get the cosine of the “x” column.
`SELECT cos(x) as cs FROM my_table`	Get the cosine of the “x” column and rename it to “cs”.
`SELECT cos(x) as cs, sin(x) as sc FROM my_table`	Calculate the sine and cosine of x and make two columns out of them.
`SELECT round(x, 2), floor(x), ceil(x) from my_table`	Round x to 2 decimal places, round it down, and round it up.
*`SELECT x 2, x + 2, x - 2, x / 2 FROM my_table`**	Do arithmetic on x.
`SELECT upper(t), lower(t), reverse(t) FROM my_table`	Uppercase, lowercase, and reverse the text in the t column.
`SELECT left(t, 4), right(t, 4) FROM my_table`	Extract the left-most 4 characters and right right-most 4 characters from t.

Numeric Function: https://duckdb.org/docs/sql/functions/numeric
Text Functions: https://duckdb.org/docs/sql/functions/char
Overview over all supported types: https://duckdb.org/docs/sql/functions/overview

Exercises

Example: Get the response time in milliseconds for all trials in the experiment.

query = """
SELECT                                    -- Select Specific Columns
    response_time * 1000 as resp_msecs    -- Multiply the response time column by 1000 and rename it.
FROM '**/trials.csv';                     -- Read all the CSV files into a table.
"""
sql(query).to_df()

	resp_msecs
0	1433.915727
1	2001.065710
2	1251.114625
3	1334.219182
4	683.367530
...	...
13612	NaN
13613	NaN
13614	NaN
13615	NaN
13616	NaN

13617 rows × 1 columns

Exercise: Get the feedback time in milliseconds for all trials in the experiment.

Solution

query = """
SELECT 
    feedback_time * 1000 as feedback_msecs 
FROM '**/trials.csv';
"""
sql(query).to_df()

	feedback_msecs
0	1471.223129
1	2004.431558
2	1286.820215
3	1372.021561
4	718.411230
...	...
13612	NaN
13613	NaN
13614	NaN
13615	NaN
13616	NaN

13617 rows × 1 columns

Exercise: Get both the response time and feedback time in milliseconds for all trials in the experiment.

Solution

query = """
SELECT 
    feedback_time * 1000 as feedback_msecs,
    response_time * 1000 as resp_msecs,
FROM '**/trials.csv';
"""
sql(query).to_df()

	feedback_msecs	resp_msecs
0	1471.223129	1433.915727
1	2004.431558	2001.065710
2	1286.820215	1251.114625
3	1372.021561	1334.219182
4	718.411230	683.367530
...	...	...
13612	NaN	NaN
13613	NaN	NaN
13614	NaN	NaN
13615	NaN	NaN
13616	NaN	NaN

13617 rows × 2 columns

Exercise: Lowercase the mouse name in the all sessions

Solution

query = """
SELECT
    lower(mouse) as mouse,
    * EXCLUDE (mouse),
FROM '**/session.json';
"""
sql(query).to_df().head()

	mouse	session_date	stim_onset	bin_size	id
0	muller	2017-01-07	0.5	0.01	b5b6
1	muller	2017-01-08	0.5	0.01	49bb
2	radnitz	2017-01-08	0.5	0.01	769e
3	muller	2017-01-09	0.5	0.01	31dc
4	radnitz	2017-01-09	0.5	0.01	99f4

Exercise: In this experiment, the mouse’s task is to compare the contrast levels of the left and right stimulus, and to decide which contrast is higher. Use SQL to make a new contrast_diff column that subtracts contrast_right from contrast_left.

Solution

query = """
SELECT
    contrast_left - contrast_right AS contrast_diff
FROM '**/trials.csv';
"""
sql(query).to_df().head()

	contrast_diff
0	0
1	-100
2	-100
3	-100
4	0

Section 2: Full-Table Aggregations

We can also run some basic statistics in SQL, making it useful for some quick summaries of our data. This works the same way as transformations. Let’s try it out!

Code	Description
*`SELECT count() FROM my_table`**	The number of rows in the table.
`SELECT count(colA) FROM my_table`	The number of values in colA in the table (same as the number of rows in the whole table.)
`SELECT min(colA), max(colA) FROM my_table`	The minimum and maximum values of colA
`SELECT avg(colA), median(colA) FROM my_table`	The mean and median values of colA
`SELECT first(colA), last(colA) FROM my_table`	The first and last values of colA (based on the ordering of the data)

Exercises

Example: How many sessions were recorded in the study?

sql('SELECT count(*) FROM "data/steinmetz/**/session.json"')

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│           37 │
└──────────────┘

Exercise: How many total trials were done in the experiment, across all sessions?

Solution

sql('SELECT count(*) FROM "data/steinmetz/*/trials.csv"')

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        13617 │
└──────────────┘

Exercise: How many total cells were recorded in this experiment?

Solution

sql(
"""
SELECT 
    count(*)
FROM 
    'data/steinmetz/*/cells.parquet'
""")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        32233 │
└──────────────┘

Exercise: How many total sessions were done by the mouse named “Richards”?

Solution

sql(
"""
SELECT count(*) FROM  "data/steinmetz/*/session.json" WHERE mouse = 'Richards'
""")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│            5 │
└──────────────┘

Exercise: How many cells were recorded in the hippocampus in this experiment?

Solution

sql(
"""
SELECT 
    count(*)
FROM 
    'data/steinmetz/*/cells.parquet'
WHERE brain_groups = 'hippocampus'
""")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│         3456 │
└──────────────┘

Exercise: What was the min and maximum values of the Left-Right CCF dimension recorded in the data (i.e. the most left coordinate and the most right coordinate)?

Solution

sql(
"""
SELECT
    min(ccf_lr),
    max(ccf_lr)
FROM 
    "data/steinmetz/*/cells.parquet"
""")

┌─────────────┬─────────────┐
│ min(ccf_lr) │ max(ccf_lr) │
│   double    │   double    │
├─────────────┼─────────────┤
│      1078.8 │      6346.6 │
└─────────────┴─────────────┘

Exercise: What was the mean (i.e. “average”) value for each of the CCF coordinates?

Solution

sql(
"""
SELECT
    avg(ccf_lr),
    avg(ccf_ap),
    avg(ccf_dv)
FROM 
    "data/steinmetz/*/cells.parquet"
""")

┌────────────────────┬───────────────────┬────────────────────┐
│    avg(ccf_lr)     │    avg(ccf_ap)    │    avg(ccf_dv)     │
│       double       │      double       │       double       │
├────────────────────┼───────────────────┼────────────────────┤
│ 4085.9178622147992 │ 6645.750891532806 │ 2932.2004544450415 │
└────────────────────┴───────────────────┴────────────────────┘

Exercise: What was the mean (i.e. “average”) value for each of the CCF coordinates in the thalamus?

Solution

sql(
"""
SELECT
    first(brain_groups),
    avg(ccf_lr),
    avg(ccf_ap),
    avg(ccf_dv)
FROM 
    "data/steinmetz/*/cells.parquet"
WHERE brain_groups = 'thalamus'
""")

┌─────────────────────┬───────────────────┬───────────────────┬───────────────────┐
│ first(brain_groups) │    avg(ccf_lr)    │    avg(ccf_ap)    │    avg(ccf_dv)    │
│       varchar       │      double       │      double       │      double       │
├─────────────────────┼───────────────────┼───────────────────┼───────────────────┤
│ thalamus            │ 3897.681323755769 │ 7337.369557037793 │ 3314.598170001713 │
└─────────────────────┴───────────────────┴───────────────────┴───────────────────┘

Exercise: What was the average response_type (-1 is leftward wheel turn, +1 is rightward wheel turn, 0 is no wheel turn), where it was considered an “Active Trial” (i.e. when the mouse wasn’t expected to just passively observe)?

Solution

sql(
"""
SELECT
    avg(response_type)
FROM "**/trials.csv"
WHERE active_trials = true
"""
)

┌──────────────────────┐
│  avg(response_type)  │
│        double        │
├──────────────────────┤
│ 0.002165618232443024 │
└──────────────────────┘

Section 3: SQL GROUP BY for Groupwise Statistics

Code	Description
`SELECT name, avg(value) FROM my_table GROUP BY name`	Get the mean value for every unique value of name
`SELECT name, avg(value) FROM my_table GROUP BY ALL`	Group the data by anything requested but not aggregated.
`SELECT name, avg(value) FROM my_table GROUP BY name ORDER BY name`	… and order the rows by the name column.
`SELECT name, avg(value) FROM my_table GROUP BY name HAVING name LIKE "G%"`	Only use groups where the name starts with “G”

Exercises

Example: For each contrast_left level, what was the average response_type (-1 is leftward wheel turn, +1 is rightward wheel turn, 0 is no wheel turn), where it was considered an “Active Trial” (i.e. when the mouse wasn’t expected to just passively observe)?

sql(
"""
SELECT
    contrast_left,
    round(avg(response_type), 2) AS mean_response_type
FROM "**/trials.csv"
WHERE active_trials = true
GROUP BY contrast_left
ORDER BY contrast_left
"""
)

┌───────────────┬────────────────────┐
│ contrast_left │ mean_response_type │
│     int64     │       double       │
├───────────────┼────────────────────┤
│             0 │              -0.35 │
│            25 │              -0.28 │
│            50 │               0.41 │
│           100 │               0.63 │
└───────────────┴────────────────────┘

Exercise: For each contrast_right level, what was the average response_type (-1 is leftward wheel turn, +1 is rightward wheel turn, 0 is no wheel turn), where it was considered an “Active Trial” (i.e. when the mouse wasn’t expected to just passively observe)?

Solution

sql(
"""
SELECT
    contrast_right,
    round(avg(response_type), 2) AS mean_response_type
FROM "**/trials.csv"
WHERE active_trials = true
GROUP BY contrast_right
ORDER BY contrast_right
"""
)

┌────────────────┬────────────────────┐
│ contrast_right │ mean_response_type │
│     int64      │       double       │
├────────────────┼────────────────────┤
│              0 │               0.27 │
│             25 │               0.26 │
│             50 │               -0.3 │
│            100 │              -0.58 │
└────────────────┴────────────────────┘

Exercise: For each combination of contrast_left and contrast_right level, what was the average response_type (-1 is leftward wheel turn, +1 is rightward wheel turn, 0 is no wheel turn), where it was considered an “Active Trial” (i.e. when the mouse wasn’t expected to just passively observe)?

Solution

sql(
"""
SELECT
    contrast_left,
    contrast_right,
    round(avg(response_type), 2) AS mean_response_type
FROM "**/trials.csv"
WHERE active_trials = true
GROUP BY contrast_left, contrast_right
ORDER BY contrast_left, contrast_right
"""
)

┌───────────────┬────────────────┬────────────────────┐
│ contrast_left │ contrast_right │ mean_response_type │
│     int64     │     int64      │       double       │
├───────────────┼────────────────┼────────────────────┤
│             0 │              0 │              -0.07 │
│             0 │             25 │              -0.65 │
│             0 │             50 │              -0.77 │
│             0 │            100 │              -0.75 │
│            25 │              0 │               0.56 │
│            25 │             25 │              -0.04 │
│            25 │             50 │              -0.46 │
│            25 │            100 │               -0.7 │
│            50 │              0 │                0.8 │
│            50 │             25 │               0.44 │
│            50 │             50 │               0.12 │
│            50 │            100 │              -0.36 │
│           100 │              0 │               0.74 │
│           100 │             25 │               0.65 │
│           100 │             50 │               0.55 │
│           100 │            100 │               0.22 │
├───────────────┴────────────────┴────────────────────┤
│ 16 rows                                   3 columns │
└─────────────────────────────────────────────────────┘

Exercise: For each brain group, how many cells were recorded?

Solution

sql(
"""
SELECT
    brain_groups,
    count(*) AS num_cells,
FROM "**/cells.parquet"
GROUP BY brain_groups
"""
)

┌───────────────────┬───────────┐
│   brain_groups    │ num_cells │
│      varchar      │   int64   │
├───────────────────┼───────────┤
│ visual cortex     │      2971 │
│ non-visual cortex │      7771 │
│ thalamus          │      5847 │
│ cortical subplate │       605 │
│ basal ganglia     │      3060 │
│ hippocampus       │      3456 │
│ TT                │       181 │
│ midbrain          │      4139 │
│ root              │      4203 │
└───────────────────┴───────────┘

Section 4: Joining Multiple Tables: JOIN

Code	Description
`FROM "a.csv" a`	Give a variable name `a` to a table loaded from “a.csv”
`FROM "a.csv" a JOIN "b.csv" b ON a.id = b.a_id`	A two-way join, matching the “id” column in table a with the “a_id” column in table b.
`FROM "a.csv" a JOIN "b.csv" b ON a.id = b.a_id JOIN "c.csv" c ON a.id = c.a_id`	A three-way join.

Docs around Join: https://duckdb.org/docs/sql/query_syntax/from

Exercises

Example: Join the sessions and trials tables on the session ids, and show the resulting column names:

query = """
DESCRIBE
SELECT
    *
FROM "**/session.json" sessions
JOIN "**/trials.csv" trials 
ON sessions.id = trials.session_id
"""
sql(query)

┌────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│  column_name   │ column_type │  null   │   key   │ default │  extra  │
│    varchar     │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ session_date   │ DATE        │ YES     │ NULL    │ NULL    │ NULL    │
│ mouse          │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ stim_onset     │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ bin_size       │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ id             │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ contrast_left  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ contrast_right │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ gocue          │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ stim_onset     │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ feedback_type  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ feedback_time  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ response_type  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ response_time  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ reaction_type  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ reaction_time  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ prev_reward    │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ active_trials  │ BOOLEAN     │ YES     │ NULL    │ NULL    │ NULL    │
│ session_id     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
├────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 18 rows                                                    6 columns │
└──────────────────────────────────────────────────────────────────────┘

Example: Join the sessions and cells tables on the session ids, and show the resulting column names:

query = """
DESCRIBE
SELECT
    *
FROM "**/session.json" sessions
JOIN "**/cells.parquet" cells 
ON sessions.id = cells.session_id
"""
sql(query)

┌────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│  column_name   │ column_type │  null   │   key   │ default │  extra  │
│    varchar     │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ session_date   │ DATE        │ YES     │ NULL    │ NULL    │ NULL    │
│ mouse          │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ stim_onset     │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ bin_size       │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ id             │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ ccf_ap         │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ ccf_dv         │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ ccf_lr         │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ brain_area     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ brain_groups   │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ trough_to_peak │ TINYINT     │ YES     │ NULL    │ NULL    │ NULL    │
│ session_id     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ cell           │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │
├────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 13 rows                                                    6 columns │
└──────────────────────────────────────────────────────────────────────┘

Example: Join the sessions, trials, and cells tables on the session ids, and show the resulting column names:

query = """
DESCRIBE
SELECT
    *
FROM "**/session.json" sessions
JOIN "**/cells.parquet" cells 
ON sessions.id = cells.session_id
JOIN "**/trials.csv" trials
ON sessions.id = trials.session_id
"""
sql(query).to_df()

	column_name	column_type	null	key	default	extra
0	session_date	DATE	YES	None	None	None
1	mouse	VARCHAR	YES	None	None	None
2	stim_onset	DOUBLE	YES	None	None	None
3	bin_size	DOUBLE	YES	None	None	None
4	id	VARCHAR	YES	None	None	None
5	ccf_ap	DOUBLE	YES	None	None	None
6	ccf_dv	DOUBLE	YES	None	None	None
7	ccf_lr	DOUBLE	YES	None	None	None
8	brain_area	VARCHAR	YES	None	None	None
9	brain_groups	VARCHAR	YES	None	None	None
10	trough_to_peak	TINYINT	YES	None	None	None
11	session_id	VARCHAR	YES	None	None	None
12	cell	INTEGER	YES	None	None	None
13	contrast_left	BIGINT	YES	None	None	None
14	contrast_right	BIGINT	YES	None	None	None
15	gocue	DOUBLE	YES	None	None	None
16	stim_onset	DOUBLE	YES	None	None	None
17	feedback_type	DOUBLE	YES	None	None	None
18	feedback_time	DOUBLE	YES	None	None	None
19	response_type	DOUBLE	YES	None	None	None
20	response_time	DOUBLE	YES	None	None	None
21	reaction_type	DOUBLE	YES	None	None	None
22	reaction_time	DOUBLE	YES	None	None	None
23	prev_reward	DOUBLE	YES	None	None	None
24	active_trials	BOOLEAN	YES	None	None	None
25	session_id	VARCHAR	YES	None	None	None

Section 5: (Demos) Pivoting Tables and Nesting SQL Statements

Analysis pipelines can get quite complex–there’s a lot that we want from our data! Below are some examples of analyses that uses nesting, pivoting, and exporting to Pandas to break down the analysis into multiple steps.

duckdb.sql(
"""
SELECT 
    contrast_right, 
    "0", "25", "50", "100"
FROM (
    PIVOT (
        SELECT
            contrast_left,
            contrast_right,
            round(avg(response_type), 2) AS mean_response_type
        FROM "**/trials.csv"
        WHERE active_trials = true AND response_type <> 0
        GROUP BY contrast_left, contrast_right
        ORDER BY contrast_left, contrast_right
    )
    ON contrast_left
    USING first(mean_response_type)
    GROUP BY contrast_right
)
"""
).to_df().set_index('contrast_right').style.format("{:.2f}")

result = duckdb.sql(
"""
PIVOT (
    SELECT
        mouse,
        session_id,
        session_date,
        brain_groups,
        count(cell) as num_cells,
    FROM "**/cells.parquet" cells
    INNER JOIN "**/session.json" sessions ON cells.session_id = sessions.id
    GROUP BY ALL
)
ON brain_groups
USING sum(num_cells)
GROUP BY mouse
"""
).to_df()
result.set_index('mouse').convert_dtypes().style.format()

	TT	basal ganglia	cortical subplate	hippocampus	midbrain	non-visual cortex	root	thalamus	visual cortex
mouse
Theiler				209		497	442	59	141
Forssmann		494	52	661		1068	1241	864	219
Radnitz	34	430		111	1017	1229	348	204	430
Moniz				297	185	377	44	801	379
Hench	137	450		506	389	1019	60	819	593
Muller	10	92		459	714	408	583	366	619
Lederberg		681	173	429	787	951	816	1250	221
Richards		562	195	235	344	1387	136	957	19
Tatum		351	185	329	703	757	433	372	205
None				220		78	100	155	145