SQL Queries over Data Files With DuckDB

Author
Dr. Nicholas Del Grosso

SQL, which stands for Structured Query Language, is a standard language used to communicate with databases. It is widely used for managing and manipulating data in databases, and is particularly good for tabular-formatted data. SQL allows you to perform tasks such as retrieving specific data from large databases, updating data, creating new tables or databases, and setting permissions on data access. It is a fundamental tool in many fields that require data management, including research, business, and software development.

DuckDB is a database management system that is designed to be easy to use and efficient. It is useful for anyone who needs to work with big datasets and wants to get results quickly, and it has bindings in a wide variety of languages (including Python and a command-line interface) . Whether you are working on research data or any large dataset, DuckDB can be a helpful tool for managing and analyzing your data. Some other properties:

  • DuckDB is an “embedded DBMS*, like SQLite, meaning that it it doesn’t need a server process to run.
  • DudckDB is a “column-store” DBMS, which makes it very fast at analytical processes like loading up a few large columns at a time. The tradeoff is that it is slower at writing transactional data, so it’s not meant to be used for writing large amounts real-time data at high speeds (for this, SQLite, MySQL, and PostGres would be popular alternatives).
  • DuckDB is unique in that it can be used directly on common tabular file types, without necessarily building a special database file and defining all the tables first.
  • DuckDB has integrations with Pandas, Polars, and PyArrow, and can both run queries on those Python variables without making any copies (high-speed!), and can convert its tables directly to those types, making it easy to integrate into most analysis pipelines.

Setup

Data Download

from tqdm import tqdm
from pathlib import Path
from webdav4.fsspec import WebdavFileSystem

# https://uni-bonn.sciebo.de/s/oZql1bk0p1AvK0w
fs = WebdavFileSystem("https://uni-bonn.sciebo.de/public.php/webdav", auth=("oZql1bk0p1AvK0w", ""))
for name in tqdm(fs.ls("/", detail=False), desc="Downloading Data to data/stenmetz"):
    if not Path(f"data/steinmetz/{name}").exists():
        fs.download(name, f"data/steinmetz/{name}", recursive=True)
Downloading Data to data/stenmetz:   0%|                                                                                                                                     | 0/37 [00:00<?, ?it/s]
Downloading Data to data/stenmetz: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [00:00<00:00, 15015.89it/s]

Import Libraries

import numpy
import pandas as pd
from duckdb import sql
import duckdb
import seaborn as sns

Section 1: Loading Data from DataFrames with DuckDB: the SELECT * FROM Statement with Pandas

Code Description
duckdb.sql('SELECT * FROM df') Load all the data from the df variable into DuckDB
duckdb.sql('FROM df') A shortcut: The same as “SELECT * FROM df”
duckdb.sql('SELECT colA, colB FROM df') Load only columns “colA” and “colA” from df.
duckdb.sql('DESCRIBE SELECT * FROM df') Show information about each column in df.
duckdb.sql('DESCRIBE FROM df') Show information about each column in df.
duckdb.sql('SELECT ColA, ColB FROM df ORDER BY ColA') Order the rows by the values in ColA
duckdb.sql('SELECT ColA as ColumnA, * FROM df ORDER BY ColA') Rename ColA to ColumnA and select the rest of the columns as well.
duckdb.sql('FROM df LIMIT 3) Show only the first 3 rows (good for quickly previewing data)
duckdb.sql().to_df() Convert the result back to a Pandas DataFrame.

Exercises

mri_sessions = pd.DataFrame({
    "SessionID": [101, 102, 103, 104, 105],
    "PatientID": ["P001", "P002", "P001", "P003", "P002"],
    "Date": ["2024-02-01", "2024-02-03", "2024-02-05", "2024-02-07", "2024-02-09"],
    "Duration": [30, 45, 30, 60, 45]
})

ca2_sessions = pd.DataFrame({
    "ExperimentID": [301, 302, 303, 304, 305],
    "AnimalModel": ["Mouse", "Rat", "Zebrafish", "Mouse", "Rat"],
    "SessionDate": ["2024-04-01", "2024-04-05", "2024-04-09", "2024-04-13", "2024-04-17"],
    "DyeUsed": ["OGB-1", "Fluo-4", "GCaMP6", "OGB-1", "Fluo-4"],
    "ImagingTechnique": ["Two-photon", "Confocal", "Light-sheet", "Two-photon", "Confocal"],
    "NumCells": [200, 150, 300, 250, 180]
})

Example: Get all columns of the mri session dataframe

sql("FROM mri_sessions")
┌───────────┬───────────┬────────────┬──────────┐
│ SessionID │ PatientID │    Date    │ Duration │
│   int64   │  varchar  │  varchar   │  int64   │
├───────────┼───────────┼────────────┼──────────┤
│       101 │ P001      │ 2024-02-01 │       30 │
│       102 │ P002      │ 2024-02-03 │       45 │
│       103 │ P001      │ 2024-02-05 │       30 │
│       104 │ P003      │ 2024-02-07 │       60 │
│       105 │ P002      │ 2024-02-09 │       45 │
└───────────┴───────────┴────────────┴──────────┘

Exercise: Get all columns of the ca2 sessions.

Solution
sql("FROM ca2_sessions")
┌──────────────┬─────────────┬─────────────┬─────────┬──────────────────┬──────────┐
│ ExperimentID │ AnimalModel │ SessionDate │ DyeUsed │ ImagingTechnique │ NumCells │
│    int64     │   varchar   │   varchar   │ varchar │     varchar      │  int64   │
├──────────────┼─────────────┼─────────────┼─────────┼──────────────────┼──────────┤
│          301 │ Mouse       │ 2024-04-01  │ OGB-1   │ Two-photon       │      200 │
│          302 │ Rat         │ 2024-04-05  │ Fluo-4  │ Confocal         │      150 │
│          303 │ Zebrafish   │ 2024-04-09  │ GCaMP6  │ Light-sheet      │      300 │
│          304 │ Mouse       │ 2024-04-13  │ OGB-1   │ Two-photon       │      250 │
│          305 │ Rat         │ 2024-04-17  │ Fluo-4  │ Confocal         │      180 │
└──────────────┴─────────────┴─────────────┴─────────┴──────────────────┴──────────┘

Exercise: Select only the Patient ID and Duration columns from the MRI data

Solution
sql("Select PatientID, Date FROM mri_sessions")
┌───────────┬────────────┐
│ PatientID │    Date    │
│  varchar  │  varchar   │
├───────────┼────────────┤
│ P001      │ 2024-02-01 │
│ P002      │ 2024-02-03 │
│ P001      │ 2024-02-05 │
│ P003      │ 2024-02-07 │
│ P002      │ 2024-02-09 │
└───────────┴────────────┘

Exercise: Select only the Experiment ID and Animal Model columns from the MRI data

Solution
sql("SELECT ExperimentID, AnimalModel from ca2_sessions")
┌──────────────┬─────────────┐
│ ExperimentID │ AnimalModel │
│    int64     │   varchar   │
├──────────────┼─────────────┤
│          301 │ Mouse       │
│          302 │ Rat         │
│          303 │ Zebrafish   │
│          304 │ Mouse       │
│          305 │ Rat         │
└──────────────┴─────────────┘

Exercise: Reorder the CA2 data by the number of cells counted, and convert the result back to a Pandas DataFrame:

Solution
sql("FROM ca2_sessions ORDER BY NumCells").to_df()

ExperimentID AnimalModel SessionDate DyeUsed ImagingTechnique NumCells
0 302 Rat 2024-04-05 Fluo-4 Confocal 150
1 305 Rat 2024-04-17 Fluo-4 Confocal 180
2 301 Mouse 2024-04-01 OGB-1 Two-photon 200
3 304 Mouse 2024-04-13 OGB-1 Two-photon 250
4 303 Zebrafish 2024-04-09 GCaMP6 Light-sheet 300

Exercise: In the Ca2 data, rename the SessionDate column to be just “Date”.

Solution
sql("SELECT SessionDate as Date, * FROM ca2_sessions")
┌────────────┬──────────────┬─────────────┬─────────────┬─────────┬──────────────────┬──────────┐
│    Date    │ ExperimentID │ AnimalModel │ SessionDate │ DyeUsed │ ImagingTechnique │ NumCells │
│  varchar   │    int64     │   varchar   │   varchar   │ varchar │     varchar      │  int64   │
├────────────┼──────────────┼─────────────┼─────────────┼─────────┼──────────────────┼──────────┤
│ 2024-04-01 │          301 │ Mouse       │ 2024-04-01  │ OGB-1   │ Two-photon       │      200 │
│ 2024-04-05 │          302 │ Rat         │ 2024-04-05  │ Fluo-4  │ Confocal         │      150 │
│ 2024-04-09 │          303 │ Zebrafish   │ 2024-04-09  │ GCaMP6  │ Light-sheet      │      300 │
│ 2024-04-13 │          304 │ Mouse       │ 2024-04-13  │ OGB-1   │ Two-photon       │      250 │
│ 2024-04-17 │          305 │ Rat         │ 2024-04-17  │ Fluo-4  │ Confocal         │      180 │
└────────────┴──────────────┴─────────────┴─────────────┴─────────┴──────────────────┴──────────┘

Exercise: In the MRI data, load just the SessionID (renamed to be just ID) and (PatientID renamed to be Patient) columns. Sort the data by the patient ID.

Solution
sql("SELECT SessionID as ID, PatientID as Patient FROM mri_sessions ORDER BY Patient")
┌───────┬─────────┐
│  ID   │ Patient │
│ int64 │ varchar │
├───────┼─────────┤
│   101 │ P001    │
│   103 │ P001    │
│   102 │ P002    │
│   105 │ P002    │
│   104 │ P003    │
└───────┴─────────┘

Section 2: Loading Data with DuckDB- the SELECT * FROM Statement with Files

Now, let’s build on these queries, using DuckDB’s ability to work on

Code Description
duckdb.sql('SELECT * FROM "path/to/file.json"') Load the JSON file into a table.
duckdb.sql('SELECT * FROM "path/to/file.csv"') Load the CSV file into a table.
duckdb.sql('SELECT * FROM "path/to/file.parquet"') Load the PARQUET file into a table.
duckdb.sql('SELECT * FROM "path/*/file.parquet"') Load all the “file.parquet” files into a table.
duckdb.sql('SELECT * FROM "**/*.parquet"') Load all the parquet files that start with the letter “f” into a table, no matter what subfolder they are in.
duckdb.sql('SELECT * FROM read_json("path/to/file.json", filename=true)') Load the “file.json” file into a table using the read_json() function directly.

For the next sections, we’ll be exploring some data from a Steinmetz et al NeuroPixel experiment, processed here into JSON files for our tabular analysis, along with some other familiar file types.

Exercises

Example: Use duckdb to read the session.json file from the session recorded on 2017-10-11. What metadata fields were recorded on that day?

sql('FROM "data/steinmetz/*_2017-10-11_*/session.json"')
┌──────────────┬─────────┬────────────┬──────────┬─────────┐
│ session_date │  mouse  │ stim_onset │ bin_size │   id    │
│     date     │ varchar │   double   │  double  │ varchar │
├──────────────┼─────────┼────────────┼──────────┼─────────┤
│ 2017-10-11   │ Theiler │        0.5 │     0.01 │ ab16    │
└──────────────┴─────────┴────────────┴──────────┴─────────┘

Exercise: Use duckdb to read the session.json file from the session recorded on 2017-11-02. What metadata fields were recorded on that day?

Solution
sql('FROM "data/steinmetz/steinmetz_2017-11-02_Forssmann/session.json"')
┌──────────────┬───────────┬────────────┬──────────┬─────────┐
│ session_date │   mouse   │ stim_onset │ bin_size │   id    │
│     date     │  varchar  │   double   │  double  │ varchar │
├──────────────┼───────────┼────────────┼──────────┼─────────┤
│ 2017-11-02   │ Forssmann │        0.5 │     0.01 │ dda4    │
└──────────────┴───────────┴────────────┴──────────┴─────────┘

Exercise: Use duckdb to read the ’trials.csv’ file recorded on 2017-11-02. What trial variables were recorded for that session? (Note: if you cannot see all the columns, either add "DESCRIBE" to the front of the SQL statement, or you can convert the output to a Pandas DataFrame with .to_df())

Solution
sql('FROM "data/steinmetz/steinmetz_2017-11-02_Forssmann/trials.csv"').to_df()

contrast_left contrast_right gocue stim_onset feedback_type feedback_time response_type response_time reaction_type reaction_time prev_reward active_trials session_id
0 0 100 0.582163 0.5 1.0 0.827766 -1.0 0.784573 -1.0 290.0 -10.000000 True dda4
1 0 0 0.484221 0.5 1.0 2.320245 0.0 2.288648 0.0 inf -5.539448 True dda4
2 25 25 0.688418 0.5 -1.0 1.358262 -1.0 1.358262 -1.0 1230.0 -4.196441 True dda4
3 0 0 0.628084 0.5 -1.0 0.862720 1.0 0.862720 1.0 720.0 -2.937326 True dda4
4 0 0 0.748259 0.5 1.0 2.587083 0.0 2.554045 0.0 inf -2.342347 True dda4
... ... ... ... ... ... ... ... ... ... ... ... ... ...
359 0 50 NaN 0.5 NaN NaN NaN NaN NaN NaN NaN False dda4
360 0 100 NaN 0.5 NaN NaN NaN NaN NaN NaN NaN False dda4
361 0 50 NaN 0.5 NaN NaN NaN NaN NaN NaN NaN False dda4
362 0 100 NaN 0.5 NaN NaN NaN NaN NaN NaN NaN False dda4
363 0 100 NaN 0.5 NaN NaN NaN NaN NaN NaN NaN False dda4

364 rows × 13 columns

Exercise: Use duckdb to read all of the session.json files in the dataset.

Solution
duckdb.sql('FROM "data/steinmetz/**/session.json"')
┌──────────────┬───────────┬────────────┬──────────┬─────────┐
│ session_date │   mouse   │ stim_onset │ bin_size │   id    │
│     date     │  varchar  │   double   │  double  │ varchar │
├──────────────┼───────────┼────────────┼──────────┼─────────┤
│ 2017-01-07   │ Muller    │        0.5 │     0.01 │ b5b6    │
│ 2017-01-08   │ Muller    │        0.5 │     0.01 │ 49bb    │
│ 2017-01-08   │ Radnitz   │        0.5 │     0.01 │ 769e    │
│ 2017-01-09   │ Muller    │        0.5 │     0.01 │ 31dc    │
│ 2017-01-09   │ Radnitz   │        0.5 │     0.01 │ 99f4    │
│ 2017-01-10   │ Radnitz   │        0.5 │     0.01 │ a400    │
│ 2017-01-11   │ Radnitz   │        0.5 │     0.01 │ 6207    │
│ 2017-01-12   │ Radnitz   │        0.5 │     0.01 │ 3e6c    │
│ 2017-05-15   │ Moniz     │        0.5 │     0.01 │ 40f7    │
│ 2017-05-16   │ Moniz     │        0.5 │     0.01 │ 2474    │
│     ·        │   ·       │         ·  │       ·  │  ·      │
│     ·        │   ·       │         ·  │       ·  │  ·      │
│     ·        │   ·       │         ·  │       ·  │  ·      │
│ 2017-12-06   │ Lederberg │        0.5 │     0.01 │ 1e77    │
│ 2017-12-06   │ Tatum     │        0.5 │     0.01 │ e7bd    │
│ 2017-12-07   │ Lederberg │        0.5 │     0.01 │ e119    │
│ 2017-12-07   │ Tatum     │        0.5 │     0.01 │ 2edb    │
│ 2017-12-08   │ Lederberg │        0.5 │     0.01 │ 0080    │
│ 2017-12-08   │ Tatum     │        0.5 │     0.01 │ 403f    │
│ 2017-12-09   │ Lederberg │        0.5 │     0.01 │ 4363    │
│ 2017-12-09   │ Tatum     │        0.5 │     0.01 │ 2115    │
│ 2017-12-10   │ Lederberg │        0.5 │     0.01 │ df5e    │
│ 2017-12-11   │ Lederberg │        0.5 │     0.01 │ ecb9    │
├──────────────┴───────────┴────────────┴──────────┴─────────┤
│ 37 rows (20 shown)                               5 columns │
└────────────────────────────────────────────────────────────┘

Exercise: Use DuckDB to read all the trials.csv files in the dataset.

Solution
duckdb.sql('DESCRIBE FROM "data/steinmetz/*/trials.csv"')
┌────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│  column_name   │ column_type │  null   │   key   │ default │  extra  │
│    varchar     │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ contrast_left  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ contrast_right │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ gocue          │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ stim_onset     │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ feedback_type  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ feedback_time  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ response_type  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ response_time  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ reaction_type  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ reaction_time  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ prev_reward    │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ active_trials  │ BOOLEAN     │ YES     │ NULL    │ NULL    │ NULL    │
│ session_id     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
├────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 13 rows                                                    6 columns │
└──────────────────────────────────────────────────────────────────────┘

Exercise: Use DuckDB to get all the cells that were recorded from in the dataset.

Solution
duckdb.sql('DESCRIBE FROM "data/steinmetz/*/cells.parquet"')
┌────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│  column_name   │ column_type │  null   │   key   │ default │  extra  │
│    varchar     │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ ccf_ap         │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ ccf_dv         │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ ccf_lr         │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ brain_area     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ brain_groups   │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ trough_to_peak │ TINYINT     │ YES     │ NULL    │ NULL    │ NULL    │
│ session_id     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ cell           │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │
└────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

Exercise: What variables were stored when recording lick behaviors (Tip: Use DESCRIBE at the front of the query)?

Solution
duckdb.sql('DESCRIBE FROM "data/steinmetz/*/licks.parquet"')
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ licks       │ TINYINT     │ YES     │ NULL    │ NULL    │ NULL    │
│ session_id  │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ trial       │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │
│ time        │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

Section 3: Getting Unique Values with SELECT DISTINCT

Often, we want to summarize data by finding out what unique values there are in a dataset, in order to plan out an analysis. In SQL, the SELECT DISTINCT statement helps with this, and can even show distinct combinations of column values.

Code Description
SELECT * FROM my_table Select all the columns.
SELECT colA, colB FROM my_table Select just colA and colB
SELECT DISTINCT colA FROM my_table Show just the unique values of colA
SELECT DISTINCT colA, colB FROM my_table Show just the unique combinations of values between colA and colB
SELECT DISTINCT colA, colB FROM my_table ORDER BY colA …and order by colA
SELECT DISTINCT colA, colB FROM my_table ORDER BY colA, colB …and order by colA, then colB
SELECT DISTINCT colA, colB FROM my_table ORDER BY * Order the data by whatever columns are selected, in the order they appear in the table.

Exercises

Example: What mice were in the study? Show just the unique values.

sql(
"""
SELECT DISTINCT 
    mouse,
FROM 
    "data/steinmetz/*/session.json"
""")
┌───────────┐
│   mouse   │
│  varchar  │
├───────────┤
│ Forssmann │
│ Richards  │
│ Lederberg │
│ NULL      │
│ Tatum     │
│ Theiler   │
│ Moniz     │
│ Radnitz   │
│ Muller    │
│ Hench     │
├───────────┤
│  10 rows  │
└───────────┘

Exercise: What unique (“distinct”) contrast levels of the left stimulus were there in this experiment? To make it easier to read, order the rows in the resulting table.

Solution
sql(
"""
SELECT DISTINCT
    contrast_left
FROM 
    "data/steinmetz/*/trials.csv"
ORDER BY *
""")
┌───────────────┐
│ contrast_left │
│     int64     │
├───────────────┤
│             0 │
│            25 │
│            50 │
│           100 │
└───────────────┘

Exercise: What unique (“distinct”) combinations of contrast levels between the left and right stimulus were there in this experiment? To make it easier to read, order the rows in the resulting table.

Solution
sql(
"""
SELECT DISTINCT
    contrast_left,
    contrast_right
FROM 
    "data/steinmetz/*/trials.csv"
ORDER BY contrast_left, contrast_right
""")
┌───────────────┬────────────────┐
│ contrast_left │ contrast_right │
│     int64     │     int64      │
├───────────────┼────────────────┤
│             0 │              0 │
│             0 │             25 │
│             0 │             50 │
│             0 │            100 │
│            25 │              0 │
│            25 │             25 │
│            25 │             50 │
│            25 │            100 │
│            50 │              0 │
│            50 │             25 │
│            50 │             50 │
│            50 │            100 │
│           100 │              0 │
│           100 │             25 │
│           100 │             50 │
│           100 │            100 │
├───────────────┴────────────────┤
│ 16 rows              2 columns │
└────────────────────────────────┘

Exercise: What were the different stimulus onset time settings used this experiment?

Solution
sql(
"""
SELECT DISTINCT
    stim_onset
FROM 
    "data/steinmetz/*/trials.csv"
""")
┌────────────┐
│ stim_onset │
│   double   │
├────────────┤
│        0.5 │
└────────────┘

Exercise: Which general areas of the brain (let’s use “brain_groups” here) were the cells in this study recorded from?

Solution
sql(
"""
SELECT DISTINCT
    brain_groups
FROM 
    "data/steinmetz/*/cells.parquet"
ORDER BY brain_groups
""")
┌───────────────────┐
│   brain_groups    │
│      varchar      │
├───────────────────┤
│ TT                │
│ basal ganglia     │
│ cortical subplate │
│ hippocampus       │
│ midbrain          │
│ non-visual cortex │
│ root              │
│ thalamus          │
│ visual cortex     │
└───────────────────┘

Exercise: Which brain groups were associated with which brain areas?

Solution
sql(
"""
SELECT DISTINCT
    brain_groups,
    brain_area
FROM 
    "data/steinmetz/*/cells.parquet"
ORDER BY brain_groups
""")
┌───────────────┬────────────┐
│ brain_groups  │ brain_area │
│    varchar    │  varchar   │
├───────────────┼────────────┤
│ TT            │ TT         │
│ basal ganglia │ OT         │
│ basal ganglia │ SNr        │
│ basal ganglia │ MS         │
│ basal ganglia │ ACB        │
│ basal ganglia │ LSr        │
│ basal ganglia │ LSc        │
│ basal ganglia │ CP         │
│ basal ganglia │ GPe        │
│ basal ganglia │ LS         │
│    ·          │ ·          │
│    ·          │ ·          │
│    ·          │ ·          │
│ thalamus      │ LP         │
│ thalamus      │ SPF        │
│ thalamus      │ RT         │
│ thalamus      │ VAL        │
│ visual cortex │ VISrl      │
│ visual cortex │ VISa       │
│ visual cortex │ VISl       │
│ visual cortex │ VISp       │
│ visual cortex │ VISam      │
│ visual cortex │ VISpm      │
├───────────────┴────────────┤
│     72 rows (20 shown)     │
└────────────────────────────┘

Section 4: Filtering Data with WHERE

Let’s Filter the data! The WHERE clause in SQL helps to only load up the rows that meet a specified condition:

Code Description
FROM my_table WHERE colA > 5 Only take the rows where colA is greater than 5.
FROM my_table WHERE colA = 'adult' Only take the rows where colA is “adult”.
FROM my_table WHERE colA IS NOT NULL Only take the non-missing rows of colA
FROM my_table WHERE colA LIKE 'C%' Only take the rows of ColA where the text starts with a “C”.

More filtering experessions can be found at these links (won’t be used in these exercises; just for reference):

Exercises

Example: Which brain areas of the brain group “thalamus” did we record cells from in this experiment? (Tip: use single quotes ' to reference text.)

sql(
"""
SELECT DISTINCT
    brain_groups,
    brain_area
FROM 
    'data/steinmetz/*/cells.parquet'
WHERE brain_groups = 'thalamus'
ORDER BY brain_groups
""")
┌──────────────┬────────────┐
│ brain_groups │ brain_area │
│   varchar    │  varchar   │
├──────────────┼────────────┤
│ thalamus     │ MG         │
│ thalamus     │ PO         │
│ thalamus     │ LGd        │
│ thalamus     │ POL        │
│ thalamus     │ VPM        │
│ thalamus     │ VPL        │
│ thalamus     │ CL         │
│ thalamus     │ TH         │
│ thalamus     │ MD         │
│ thalamus     │ LH         │
│ thalamus     │ LD         │
│ thalamus     │ LP         │
│ thalamus     │ VAL        │
│ thalamus     │ RT         │
│ thalamus     │ PT         │
│ thalamus     │ SPF        │
├──────────────┴────────────┤
│ 16 rows         2 columns │
└───────────────────────────┘

Exercise: Which brain areas recorded in this experiment start with the capital letter V?

Solution
sql(
"""
SELECT DISTINCT
    brain_area
FROM 
    "data/steinmetz/*/cells.parquet"
WHERE brain_area LIKE 'V%'
""")
┌────────────┐
│ brain_area │
│  varchar   │
├────────────┤
│ VISam      │
│ VISrl      │
│ VISl       │
│ VPL        │
│ VPM        │
│ VISa       │
│ VISpm      │
│ VISp       │
│ VAL        │
└────────────┘

Exercise: Which brain groups recorded in this experiment end with the word "cortex"?

Solution
sql(
"""
SELECT DISTINCT
    brain_groups
FROM '**/cells.parquet'
WHERE brain_groups LIKE '%cortex'
"""    
)
┌───────────────────┐
│   brain_groups    │
│      varchar      │
├───────────────────┤
│ visual cortex     │
│ non-visual cortex │
└───────────────────┘

Exercise: Which sessions had cell recordings in the visual cortex?

Solution
sql(
"""
SELECT DISTINCT
    session_id
FROM 
    "data/steinmetz/*/cells.parquet"
WHERE brain_groups = 'visual cortex'
""")
┌────────────┐
│ session_id │
│  varchar   │
├────────────┤
│ 8125       │
│ dda4       │
│ 403f       │
│ 49bb       │
│ 40f7       │
│ 8cbd       │
│ 2474       │
│ 99f4       │
│ e7bd       │
│ 6123       │
│ 1e77       │
│ e119       │
│ 2edb       │
│ 6207       │
│ a400       │
│ 77a8       │
│ 769e       │
│ ab16       │
│ b5b6       │
│ bba2       │
│ 03ce       │
│ 31dc       │
│ 3023       │
├────────────┤
│  23 rows   │
└────────────┘

Exercise: Which sessions had recordings in the hippocampus?

Solution
sql(
"""
SELECT DISTINCT
    session_id
FROM 
    "data/steinmetz/*/cells.parquet"
WHERE brain_groups = 'hippocampus'
""")
┌────────────┐
│ session_id │
│  varchar   │
├────────────┤
│ 0080       │
│ ab16       │
│ 77a8       │
│ 4363       │
│ bba2       │
│ 03ce       │
│ 0834       │
│ 2115       │
│ 31dc       │
│ 40f7       │
│  ·         │
│  ·         │
│  ·         │
│ 62c0       │
│ 3023       │
│ 8125       │
│ 93c7       │
│ 49bb       │
│ dda4       │
│ e119       │
│ 6123       │
│ 2edb       │
│ 1e77       │
├────────────┤
│  27 rows   │
│ (20 shown) │
└────────────┘

Section 5: (Demo) Plotting Data: Query, then Plot!

Most of the time, you don’t need any complex queries; in a language like Python is that you get to use all the other Python tools!

Exercises

Example: Making a Statistical Plot of Trial Performance

df = duckdb.sql('FROM "**/trials.csv" WHERE active_trials = true AND response_type <> 0').to_df()
sns.catplot(data=df, x='contrast_left', hue='contrast_right', y='response_type', kind='point');

Example: Rendering a 3D point cloud of cell positions in the Allen Brain Atlas.

# %pip install plotly nbformat bg-atlasapi
import plotly.express as px
import plotly.graph_objects as go
from bg_atlasapi.bg_atlas import BrainGlobeAtlas
import duckdb
df = duckdb.sql("FROM '**/cells.parquet'").to_df()
atlas_mesh = BrainGlobeAtlas("allen_mouse_25um").mesh_from_structure('root')


fig1 = px.scatter_3d(x=atlas_mesh.points[:, 0], y=atlas_mesh.points[:, 1], z=atlas_mesh.points[:, 2], opacity=0.1).update_traces(marker=dict(size=.6))
fig2 = px.scatter_3d(x=df.ccf_ap, y=df.ccf_dv, z=df.ccf_lr, color=df.brain_groups, width=700, title="Cell CCF Coordinates").update_traces(marker=dict(size=2))
go.Figure(data=fig1.data + fig2.data, layout=fig1.layout).update_layout(scene=dict(xaxis={'visible': False}, yaxis={'visible': False}, zaxis={'visible': False}), width=700, scene_camera=dict(eye=dict(x=.75, y=.75, z=.75)))