Aggregating Data Using Statistical Functions in SQL
Author
Setup
Data Download
We’ll be exploring some data from a Steinmetz et al NeuroPixel experiment, processed here into JSON files for our tabular analysis, along with some other familiar file types. Please run the code below to download the data. It will take 5-10 minutes to download.
from tqdm import tqdm
from pathlib import Path
from webdav4.fsspec import WebdavFileSystem
# https://uni-bonn.sciebo.de/s/oZql1bk0p1AvK0w
fs = WebdavFileSystem("https://uni-bonn.sciebo.de/public.php/webdav", auth=("oZql1bk0p1AvK0w", ""))
for name in tqdm(fs.ls("/", detail=False), desc="Downloading Data to data/stenmetz"):
if not Path(f"data/steinmetz/{name}").exists():
fs.download(name, f"data/steinmetz/{name}", recursive=True)Downloading Data to data/stenmetz: 0%| | 0/37 [00:00<?, ?it/s]
Downloading Data to data/stenmetz: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [00:00<00:00, 14227.10it/s]
Import Libraries
import duckdb
from duckdb import sqlSection 1: Transformations in SQL SELECT Statements
Simple transformations and renaming operations can be done in-line in a query; DuckDB (and most SQL RDBMS flavors in general) support a wide variety of operations on a wide variety of date types.
Here, we’ll look mainly at numeric and text data:
| Code | Description |
|---|---|
SELECT cos(x) FROM my_table |
Get the cosine of the “x” column. |
SELECT cos(x) as cs FROM my_table |
Get the cosine of the “x” column and rename it to “cs”. |
SELECT cos(x) as cs, sin(x) as sc FROM my_table |
Calculate the sine and cosine of x and make two columns out of them. |
SELECT round(x, 2), floor(x), ceil(x) from my_table |
Round x to 2 decimal places, round it down, and round it up. |
SELECT x * 2, x + 2, x - 2, x / 2 FROM my_table |
Do arithmetic on x. |
SELECT upper(t), lower(t), reverse(t) FROM my_table |
Uppercase, lowercase, and reverse the text in the t column. |
SELECT left(t, 4), right(t, 4) FROM my_table |
Extract the left-most 4 characters and right right-most 4 characters from t. |
- Numeric Function: https://duckdb.org/docs/sql/functions/numeric
- Text Functions: https://duckdb.org/docs/sql/functions/char
- Overview over all supported types: https://duckdb.org/docs/sql/functions/overview
Exercises
Example: Get the response time in milliseconds for all trials in the experiment.
query = """
SELECT -- Select Specific Columns
response_time * 1000 as resp_msecs -- Multiply the response time column by 1000 and rename it.
FROM '**/trials.csv'; -- Read all the CSV files into a table.
"""
sql(query).to_df()| resp_msecs | |
|---|---|
| 0 | 1433.915727 |
| 1 | 2001.065710 |
| 2 | 1251.114625 |
| 3 | 1334.219182 |
| 4 | 683.367530 |
| ... | ... |
| 13612 | NaN |
| 13613 | NaN |
| 13614 | NaN |
| 13615 | NaN |
| 13616 | NaN |
13617 rows × 1 columns
Exercise: Get the feedback time in milliseconds for all trials in the experiment.
Solution
query = """
SELECT
feedback_time * 1000 as feedback_msecs
FROM '**/trials.csv';
"""
sql(query).to_df()| feedback_msecs | |
|---|---|
| 0 | 1471.223129 |
| 1 | 2004.431558 |
| 2 | 1286.820215 |
| 3 | 1372.021561 |
| 4 | 718.411230 |
| ... | ... |
| 13612 | NaN |
| 13613 | NaN |
| 13614 | NaN |
| 13615 | NaN |
| 13616 | NaN |
13617 rows × 1 columns
Exercise: Get both the response time and feedback time in milliseconds for all trials in the experiment.
Solution
query = """
SELECT
feedback_time * 1000 as feedback_msecs,
response_time * 1000 as resp_msecs,
FROM '**/trials.csv';
"""
sql(query).to_df()| feedback_msecs | resp_msecs | |
|---|---|---|
| 0 | 1471.223129 | 1433.915727 |
| 1 | 2004.431558 | 2001.065710 |
| 2 | 1286.820215 | 1251.114625 |
| 3 | 1372.021561 | 1334.219182 |
| 4 | 718.411230 | 683.367530 |
| ... | ... | ... |
| 13612 | NaN | NaN |
| 13613 | NaN | NaN |
| 13614 | NaN | NaN |
| 13615 | NaN | NaN |
| 13616 | NaN | NaN |
13617 rows × 2 columns
Exercise: Lowercase the mouse name in the all sessions
Solution
query = """
SELECT
lower(mouse) as mouse,
* EXCLUDE (mouse),
FROM '**/session.json';
"""
sql(query).to_df().head()| mouse | session_date | stim_onset | bin_size | id | |
|---|---|---|---|---|---|
| 0 | muller | 2017-01-07 | 0.5 | 0.01 | b5b6 |
| 1 | muller | 2017-01-08 | 0.5 | 0.01 | 49bb |
| 2 | radnitz | 2017-01-08 | 0.5 | 0.01 | 769e |
| 3 | muller | 2017-01-09 | 0.5 | 0.01 | 31dc |
| 4 | radnitz | 2017-01-09 | 0.5 | 0.01 | 99f4 |
Exercise: In this experiment, the mouse’s task is to compare the contrast levels of the left and right stimulus, and to decide which contrast is higher. Use SQL to make a new contrast_diff column that subtracts contrast_right from contrast_left.
Solution
query = """
SELECT
contrast_left - contrast_right AS contrast_diff
FROM '**/trials.csv';
"""
sql(query).to_df().head()| contrast_diff | |
|---|---|
| 0 | 0 |
| 1 | -100 |
| 2 | -100 |
| 3 | -100 |
| 4 | 0 |
Section 2: Full-Table Aggregations
We can also run some basic statistics in SQL, making it useful for some quick summaries of our data. This works the same way as transformations. Let’s try it out!
| Code | Description |
|---|---|
SELECT count(*) FROM my_table |
The number of rows in the table. |
SELECT count(colA) FROM my_table |
The number of values in colA in the table (same as the number of rows in the whole table.) |
SELECT min(colA), max(colA) FROM my_table |
The minimum and maximum values of colA |
SELECT avg(colA), median(colA) FROM my_table |
The mean and median values of colA |
SELECT first(colA), last(colA) FROM my_table |
The first and last values of colA (based on the ordering of the data) |
Exercises
Example: How many sessions were recorded in the study?
sql('SELECT count(*) FROM "data/steinmetz/**/session.json"')┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 37 │
└──────────────┘Exercise: How many total trials were done in the experiment, across all sessions?
Solution
sql('SELECT count(*) FROM "data/steinmetz/*/trials.csv"')┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 13617 │
└──────────────┘Exercise: How many total cells were recorded in this experiment?
Solution
sql(
"""
SELECT
count(*)
FROM
'data/steinmetz/*/cells.parquet'
""")┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 32233 │
└──────────────┘Exercise: How many total sessions were done by the mouse named “Richards”?
Solution
sql(
"""
SELECT count(*) FROM "data/steinmetz/*/session.json" WHERE mouse = 'Richards'
""")┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 5 │
└──────────────┘Exercise: How many cells were recorded in the hippocampus in this experiment?
Solution
sql(
"""
SELECT
count(*)
FROM
'data/steinmetz/*/cells.parquet'
WHERE brain_groups = 'hippocampus'
""")┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 3456 │
└──────────────┘Exercise: What was the min and maximum values of the Left-Right CCF dimension recorded in the data (i.e. the most left coordinate and the most right coordinate)?
Solution
sql(
"""
SELECT
min(ccf_lr),
max(ccf_lr)
FROM
"data/steinmetz/*/cells.parquet"
""")┌─────────────┬─────────────┐
│ min(ccf_lr) │ max(ccf_lr) │
│ double │ double │
├─────────────┼─────────────┤
│ 1078.8 │ 6346.6 │
└─────────────┴─────────────┘Exercise: What was the mean (i.e. “average”) value for each of the CCF coordinates?
Solution
sql(
"""
SELECT
avg(ccf_lr),
avg(ccf_ap),
avg(ccf_dv)
FROM
"data/steinmetz/*/cells.parquet"
""")┌────────────────────┬───────────────────┬────────────────────┐
│ avg(ccf_lr) │ avg(ccf_ap) │ avg(ccf_dv) │
│ double │ double │ double │
├────────────────────┼───────────────────┼────────────────────┤
│ 4085.9178622147992 │ 6645.750891532806 │ 2932.2004544450415 │
└────────────────────┴───────────────────┴────────────────────┘Exercise: What was the mean (i.e. “average”) value for each of the CCF coordinates in the thalamus?
Solution
sql(
"""
SELECT
first(brain_groups),
avg(ccf_lr),
avg(ccf_ap),
avg(ccf_dv)
FROM
"data/steinmetz/*/cells.parquet"
WHERE brain_groups = 'thalamus'
""")┌─────────────────────┬───────────────────┬───────────────────┬───────────────────┐
│ first(brain_groups) │ avg(ccf_lr) │ avg(ccf_ap) │ avg(ccf_dv) │
│ varchar │ double │ double │ double │
├─────────────────────┼───────────────────┼───────────────────┼───────────────────┤
│ thalamus │ 3897.681323755769 │ 7337.369557037793 │ 3314.598170001713 │
└─────────────────────┴───────────────────┴───────────────────┴───────────────────┘Exercise: What was the average response_type (-1 is leftward wheel turn, +1 is rightward wheel turn, 0 is no wheel turn), where it was considered an “Active Trial” (i.e. when the mouse wasn’t expected to just passively observe)?
Solution
sql(
"""
SELECT
avg(response_type)
FROM "**/trials.csv"
WHERE active_trials = true
"""
)┌──────────────────────┐
│ avg(response_type) │
│ double │
├──────────────────────┤
│ 0.002165618232443024 │
└──────────────────────┘Section 3: SQL GROUP BY for Groupwise Statistics
| Code | Description |
|---|---|
SELECT name, avg(value) FROM my_table GROUP BY name |
Get the mean value for every unique value of name |
SELECT name, avg(value) FROM my_table GROUP BY ALL |
Group the data by anything requested but not aggregated. |
SELECT name, avg(value) FROM my_table GROUP BY name ORDER BY name |
… and order the rows by the name column. |
SELECT name, avg(value) FROM my_table GROUP BY name HAVING name LIKE "G%" |
Only use groups where the name starts with “G” |
Exercises
Example: For each contrast_left level, what was the average response_type (-1 is leftward wheel turn, +1 is rightward wheel turn, 0 is no wheel turn), where it was considered an “Active Trial” (i.e. when the mouse wasn’t expected to just passively observe)?
sql(
"""
SELECT
contrast_left,
round(avg(response_type), 2) AS mean_response_type
FROM "**/trials.csv"
WHERE active_trials = true
GROUP BY contrast_left
ORDER BY contrast_left
"""
)┌───────────────┬────────────────────┐
│ contrast_left │ mean_response_type │
│ int64 │ double │
├───────────────┼────────────────────┤
│ 0 │ -0.35 │
│ 25 │ -0.28 │
│ 50 │ 0.41 │
│ 100 │ 0.63 │
└───────────────┴────────────────────┘Exercise: For each contrast_right level, what was the average response_type (-1 is leftward wheel turn, +1 is rightward wheel turn, 0 is no wheel turn), where it was considered an “Active Trial” (i.e. when the mouse wasn’t expected to just passively observe)?
Solution
sql(
"""
SELECT
contrast_right,
round(avg(response_type), 2) AS mean_response_type
FROM "**/trials.csv"
WHERE active_trials = true
GROUP BY contrast_right
ORDER BY contrast_right
"""
)┌────────────────┬────────────────────┐
│ contrast_right │ mean_response_type │
│ int64 │ double │
├────────────────┼────────────────────┤
│ 0 │ 0.27 │
│ 25 │ 0.26 │
│ 50 │ -0.3 │
│ 100 │ -0.58 │
└────────────────┴────────────────────┘Exercise: For each combination of contrast_left and contrast_right level, what was the average response_type (-1 is leftward wheel turn, +1 is rightward wheel turn, 0 is no wheel turn), where it was considered an “Active Trial” (i.e. when the mouse wasn’t expected to just passively observe)?
Solution
sql(
"""
SELECT
contrast_left,
contrast_right,
round(avg(response_type), 2) AS mean_response_type
FROM "**/trials.csv"
WHERE active_trials = true
GROUP BY contrast_left, contrast_right
ORDER BY contrast_left, contrast_right
"""
)┌───────────────┬────────────────┬────────────────────┐
│ contrast_left │ contrast_right │ mean_response_type │
│ int64 │ int64 │ double │
├───────────────┼────────────────┼────────────────────┤
│ 0 │ 0 │ -0.07 │
│ 0 │ 25 │ -0.65 │
│ 0 │ 50 │ -0.77 │
│ 0 │ 100 │ -0.75 │
│ 25 │ 0 │ 0.56 │
│ 25 │ 25 │ -0.04 │
│ 25 │ 50 │ -0.46 │
│ 25 │ 100 │ -0.7 │
│ 50 │ 0 │ 0.8 │
│ 50 │ 25 │ 0.44 │
│ 50 │ 50 │ 0.12 │
│ 50 │ 100 │ -0.36 │
│ 100 │ 0 │ 0.74 │
│ 100 │ 25 │ 0.65 │
│ 100 │ 50 │ 0.55 │
│ 100 │ 100 │ 0.22 │
├───────────────┴────────────────┴────────────────────┤
│ 16 rows 3 columns │
└─────────────────────────────────────────────────────┘Exercise: For each brain group, how many cells were recorded?
Solution
sql(
"""
SELECT
brain_groups,
count(*) AS num_cells,
FROM "**/cells.parquet"
GROUP BY brain_groups
"""
)┌───────────────────┬───────────┐
│ brain_groups │ num_cells │
│ varchar │ int64 │
├───────────────────┼───────────┤
│ visual cortex │ 2971 │
│ non-visual cortex │ 7771 │
│ thalamus │ 5847 │
│ cortical subplate │ 605 │
│ basal ganglia │ 3060 │
│ hippocampus │ 3456 │
│ TT │ 181 │
│ midbrain │ 4139 │
│ root │ 4203 │
└───────────────────┴───────────┘Section 4: Joining Multiple Tables: JOIN
| Code | Description |
|---|---|
FROM "a.csv" a |
Give a variable name a to a table loaded from “a.csv” |
FROM "a.csv" a JOIN "b.csv" b ON a.id = b.a_id |
A two-way join, matching the “id” column in table a with the “a_id” column in table b. |
FROM "a.csv" a JOIN "b.csv" b ON a.id = b.a_id JOIN "c.csv" c ON a.id = c.a_id |
A three-way join. |
Docs around Join: https://duckdb.org/docs/sql/query_syntax/from
Exercises
Example: Join the sessions and trials tables on the session ids, and show the resulting column names:
query = """
DESCRIBE
SELECT
*
FROM "**/session.json" sessions
JOIN "**/trials.csv" trials
ON sessions.id = trials.session_id
"""
sql(query)┌────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ session_date │ DATE │ YES │ NULL │ NULL │ NULL │
│ mouse │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ stim_onset │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ bin_size │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ id │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ contrast_left │ BIGINT │ YES │ NULL │ NULL │ NULL │
│ contrast_right │ BIGINT │ YES │ NULL │ NULL │ NULL │
│ gocue │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ stim_onset │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ feedback_type │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ feedback_time │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ response_type │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ response_time │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ reaction_type │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ reaction_time │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ prev_reward │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ active_trials │ BOOLEAN │ YES │ NULL │ NULL │ NULL │
│ session_id │ VARCHAR │ YES │ NULL │ NULL │ NULL │
├────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 18 rows 6 columns │
└──────────────────────────────────────────────────────────────────────┘Example: Join the sessions and cells tables on the session ids, and show the resulting column names:
query = """
DESCRIBE
SELECT
*
FROM "**/session.json" sessions
JOIN "**/cells.parquet" cells
ON sessions.id = cells.session_id
"""
sql(query)┌────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ session_date │ DATE │ YES │ NULL │ NULL │ NULL │
│ mouse │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ stim_onset │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ bin_size │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ id │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ ccf_ap │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ ccf_dv │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ ccf_lr │ DOUBLE │ YES │ NULL │ NULL │ NULL │
│ brain_area │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ brain_groups │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ trough_to_peak │ TINYINT │ YES │ NULL │ NULL │ NULL │
│ session_id │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ cell │ INTEGER │ YES │ NULL │ NULL │ NULL │
├────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 13 rows 6 columns │
└──────────────────────────────────────────────────────────────────────┘Example: Join the sessions, trials, and cells tables on the session ids, and show the resulting column names:
query = """
DESCRIBE
SELECT
*
FROM "**/session.json" sessions
JOIN "**/cells.parquet" cells
ON sessions.id = cells.session_id
JOIN "**/trials.csv" trials
ON sessions.id = trials.session_id
"""
sql(query).to_df()| column_name | column_type | null | key | default | extra | |
|---|---|---|---|---|---|---|
| 0 | session_date | DATE | YES | None | None | None |
| 1 | mouse | VARCHAR | YES | None | None | None |
| 2 | stim_onset | DOUBLE | YES | None | None | None |
| 3 | bin_size | DOUBLE | YES | None | None | None |
| 4 | id | VARCHAR | YES | None | None | None |
| 5 | ccf_ap | DOUBLE | YES | None | None | None |
| 6 | ccf_dv | DOUBLE | YES | None | None | None |
| 7 | ccf_lr | DOUBLE | YES | None | None | None |
| 8 | brain_area | VARCHAR | YES | None | None | None |
| 9 | brain_groups | VARCHAR | YES | None | None | None |
| 10 | trough_to_peak | TINYINT | YES | None | None | None |
| 11 | session_id | VARCHAR | YES | None | None | None |
| 12 | cell | INTEGER | YES | None | None | None |
| 13 | contrast_left | BIGINT | YES | None | None | None |
| 14 | contrast_right | BIGINT | YES | None | None | None |
| 15 | gocue | DOUBLE | YES | None | None | None |
| 16 | stim_onset | DOUBLE | YES | None | None | None |
| 17 | feedback_type | DOUBLE | YES | None | None | None |
| 18 | feedback_time | DOUBLE | YES | None | None | None |
| 19 | response_type | DOUBLE | YES | None | None | None |
| 20 | response_time | DOUBLE | YES | None | None | None |
| 21 | reaction_type | DOUBLE | YES | None | None | None |
| 22 | reaction_time | DOUBLE | YES | None | None | None |
| 23 | prev_reward | DOUBLE | YES | None | None | None |
| 24 | active_trials | BOOLEAN | YES | None | None | None |
| 25 | session_id | VARCHAR | YES | None | None | None |
Section 5: (Demos) Pivoting Tables and Nesting SQL Statements
Analysis pipelines can get quite complex–there’s a lot that we want from our data! Below are some examples of analyses that uses nesting, pivoting, and exporting to Pandas to break down the analysis into multiple steps.
duckdb.sql(
"""
SELECT
contrast_right,
"0", "25", "50", "100"
FROM (
PIVOT (
SELECT
contrast_left,
contrast_right,
round(avg(response_type), 2) AS mean_response_type
FROM "**/trials.csv"
WHERE active_trials = true AND response_type <> 0
GROUP BY contrast_left, contrast_right
ORDER BY contrast_left, contrast_right
)
ON contrast_left
USING first(mean_response_type)
GROUP BY contrast_right
)
"""
).to_df().set_index('contrast_right').style.format("{:.2f}")result = duckdb.sql(
"""
PIVOT (
SELECT
mouse,
session_id,
session_date,
brain_groups,
count(cell) as num_cells,
FROM "**/cells.parquet" cells
INNER JOIN "**/session.json" sessions ON cells.session_id = sessions.id
GROUP BY ALL
)
ON brain_groups
USING sum(num_cells)
GROUP BY mouse
"""
).to_df()
result.set_index('mouse').convert_dtypes().style.format()| TT | basal ganglia | cortical subplate | hippocampus | midbrain | non-visual cortex | root | thalamus | visual cortex | |
|---|---|---|---|---|---|---|---|---|---|
| mouse | |||||||||
| Theiler | 209 | 497 | 442 | 59 | 141 | ||||
| Forssmann | 494 | 52 | 661 | 1068 | 1241 | 864 | 219 | ||
| Radnitz | 34 | 430 | 111 | 1017 | 1229 | 348 | 204 | 430 | |
| Moniz | 297 | 185 | 377 | 44 | 801 | 379 | |||
| Hench | 137 | 450 | 506 | 389 | 1019 | 60 | 819 | 593 | |
| Muller | 10 | 92 | 459 | 714 | 408 | 583 | 366 | 619 | |
| Lederberg | 681 | 173 | 429 | 787 | 951 | 816 | 1250 | 221 | |
| Richards | 562 | 195 | 235 | 344 | 1387 | 136 | 957 | 19 | |
| Tatum | 351 | 185 | 329 | 703 | 757 | 433 | 372 | 205 | |
| None | 220 | 78 | 100 | 155 | 145 |