SQL Queries over Data Files With DuckDB

Courses

Analyzing JSON-, CSV, and Parquet data using SQL in DuckDB

Author

Dr. Nicholas Del Grosso

Download Materials

SQL, which stands for Structured Query Language, is a standard language used to communicate with databases. It is widely used for managing and manipulating data in databases, and is particularly good for tabular-formatted data. SQL allows you to perform tasks such as retrieving specific data from large databases, updating data, creating new tables or databases, and setting permissions on data access. It is a fundamental tool in many fields that require data management, including research, business, and software development.

DuckDB is a database management system that is designed to be easy to use and efficient. It is useful for anyone who needs to work with big datasets and wants to get results quickly, and it has bindings in a wide variety of languages (including Python and a command-line interface). Whether you are working on research data or any large dataset, DuckDB can be a helpful tool for managing and analyzing your data. Some other properties:

DuckDB is an “embedded DBMS*, like SQLite, meaning that it it doesn’t need a server process to run.
DudckDB is a “column-store” DBMS, which makes it very fast at analytical processes like loading up a few large columns at a time. The tradeoff is that it is slower at writing transactional data, so it’s not meant to be used for writing large amounts real-time data at high speeds (for this, SQLite, MySQL, and PostGres would be popular alternatives).
DuckDB is unique in that it can be used directly on common tabular file types, without necessarily building a special database file and defining all the tables first.
DuckDB has integrations with Pandas, Polars, and PyArrow, and can both run queries on those Python variables without making any copies (high-speed!), and can convert its tables directly to those types, making it easy to integrate into most analysis pipelines.

Setup

Data Download

from tqdm import tqdm
from pathlib import Path
from webdav4.fsspec import WebdavFileSystem

# https://uni-bonn.sciebo.de/s/oZql1bk0p1AvK0w
fs = WebdavFileSystem("https://uni-bonn.sciebo.de/public.php/webdav", auth=("oZql1bk0p1AvK0w", ""))
for name in tqdm(fs.ls("/", detail=False), desc="Downloading Data to data/stenmetz"):
    if not Path(f"data/steinmetz/{name}").exists():
        fs.download(name, f"data/steinmetz/{name}", recursive=True)


Downloading Data to data/stenmetz:   0%|                                                                                                                                     | 0/37 [00:00<?, ?it/s]


Downloading Data to data/stenmetz: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 37/37 [00:00<00:00, 15015.89it/s]

Import Libraries

import numpy
import pandas as pd
from duckdb import sql
import duckdb
import seaborn as sns

Section 1: Loading Data from DataFrames with DuckDB: the SELECT * FROM Statement with Pandas

Code	Description
`duckdb.sql('SELECT * FROM df')`	Load all the data from the `df` variable into DuckDB
`duckdb.sql('FROM df')`	A shortcut: The same as “SELECT * FROM df”
`duckdb.sql('SELECT colA, colB FROM df')`	Load only columns “colA” and “colA” from `df`.
`duckdb.sql('DESCRIBE SELECT * FROM df')`	Show information about each column in `df`.
`duckdb.sql('DESCRIBE FROM df')`	Show information about each column in `df`.
`duckdb.sql('SELECT ColA, ColB FROM df ORDER BY ColA')`	Order the rows by the values in ColA
`duckdb.sql('SELECT ColA as ColumnA, * FROM df ORDER BY ColA')`	Rename ColA to ColumnA and select the rest of the columns as well.
`duckdb.sql('FROM df LIMIT 3)`	Show only the first 3 rows (good for quickly previewing data)
`duckdb.sql().to_df()`	Convert the result back to a Pandas DataFrame.

Exercises

mri_sessions = pd.DataFrame({
    "SessionID": [101, 102, 103, 104, 105],
    "PatientID": ["P001", "P002", "P001", "P003", "P002"],
    "Date": ["2024-02-01", "2024-02-03", "2024-02-05", "2024-02-07", "2024-02-09"],
    "Duration": [30, 45, 30, 60, 45]
})

ca2_sessions = pd.DataFrame({
    "ExperimentID": [301, 302, 303, 304, 305],
    "AnimalModel": ["Mouse", "Rat", "Zebrafish", "Mouse", "Rat"],
    "SessionDate": ["2024-04-01", "2024-04-05", "2024-04-09", "2024-04-13", "2024-04-17"],
    "DyeUsed": ["OGB-1", "Fluo-4", "GCaMP6", "OGB-1", "Fluo-4"],
    "ImagingTechnique": ["Two-photon", "Confocal", "Light-sheet", "Two-photon", "Confocal"],
    "NumCells": [200, 150, 300, 250, 180]
})

Example: Get all columns of the mri session dataframe

sql("FROM mri_sessions")

┌───────────┬───────────┬────────────┬──────────┐
│ SessionID │ PatientID │    Date    │ Duration │
│   int64   │  varchar  │  varchar   │  int64   │
├───────────┼───────────┼────────────┼──────────┤
│       101 │ P001      │ 2024-02-01 │       30 │
│       102 │ P002      │ 2024-02-03 │       45 │
│       103 │ P001      │ 2024-02-05 │       30 │
│       104 │ P003      │ 2024-02-07 │       60 │
│       105 │ P002      │ 2024-02-09 │       45 │
└───────────┴───────────┴────────────┴──────────┘

Exercise: Get all columns of the ca2 sessions.

Solution

sql("FROM ca2_sessions")

┌──────────────┬─────────────┬─────────────┬─────────┬──────────────────┬──────────┐
│ ExperimentID │ AnimalModel │ SessionDate │ DyeUsed │ ImagingTechnique │ NumCells │
│    int64     │   varchar   │   varchar   │ varchar │     varchar      │  int64   │
├──────────────┼─────────────┼─────────────┼─────────┼──────────────────┼──────────┤
│          301 │ Mouse       │ 2024-04-01  │ OGB-1   │ Two-photon       │      200 │
│          302 │ Rat         │ 2024-04-05  │ Fluo-4  │ Confocal         │      150 │
│          303 │ Zebrafish   │ 2024-04-09  │ GCaMP6  │ Light-sheet      │      300 │
│          304 │ Mouse       │ 2024-04-13  │ OGB-1   │ Two-photon       │      250 │
│          305 │ Rat         │ 2024-04-17  │ Fluo-4  │ Confocal         │      180 │
└──────────────┴─────────────┴─────────────┴─────────┴──────────────────┴──────────┘

Exercise: Select only the Patient ID and Duration columns from the MRI data

Solution

sql("Select PatientID, Date FROM mri_sessions")

┌───────────┬────────────┐
│ PatientID │    Date    │
│  varchar  │  varchar   │
├───────────┼────────────┤
│ P001      │ 2024-02-01 │
│ P002      │ 2024-02-03 │
│ P001      │ 2024-02-05 │
│ P003      │ 2024-02-07 │
│ P002      │ 2024-02-09 │
└───────────┴────────────┘

Exercise: Select only the Experiment ID and Animal Model columns from the MRI data

Solution

sql("SELECT ExperimentID, AnimalModel from ca2_sessions")

┌──────────────┬─────────────┐
│ ExperimentID │ AnimalModel │
│    int64     │   varchar   │
├──────────────┼─────────────┤
│          301 │ Mouse       │
│          302 │ Rat         │
│          303 │ Zebrafish   │
│          304 │ Mouse       │
│          305 │ Rat         │
└──────────────┴─────────────┘

Exercise: Reorder the CA2 data by the number of cells counted, and convert the result back to a Pandas DataFrame:

Solution

sql("FROM ca2_sessions ORDER BY NumCells").to_df()

	ExperimentID	AnimalModel	SessionDate	DyeUsed	ImagingTechnique	NumCells
0	302	Rat	2024-04-05	Fluo-4	Confocal	150
1	305	Rat	2024-04-17	Fluo-4	Confocal	180
2	301	Mouse	2024-04-01	OGB-1	Two-photon	200
3	304	Mouse	2024-04-13	OGB-1	Two-photon	250
4	303	Zebrafish	2024-04-09	GCaMP6	Light-sheet	300

Exercise: In the Ca2 data, rename the SessionDate column to be just “Date”.

Solution

sql("SELECT SessionDate as Date, * FROM ca2_sessions")

┌────────────┬──────────────┬─────────────┬─────────────┬─────────┬──────────────────┬──────────┐
│    Date    │ ExperimentID │ AnimalModel │ SessionDate │ DyeUsed │ ImagingTechnique │ NumCells │
│  varchar   │    int64     │   varchar   │   varchar   │ varchar │     varchar      │  int64   │
├────────────┼──────────────┼─────────────┼─────────────┼─────────┼──────────────────┼──────────┤
│ 2024-04-01 │          301 │ Mouse       │ 2024-04-01  │ OGB-1   │ Two-photon       │      200 │
│ 2024-04-05 │          302 │ Rat         │ 2024-04-05  │ Fluo-4  │ Confocal         │      150 │
│ 2024-04-09 │          303 │ Zebrafish   │ 2024-04-09  │ GCaMP6  │ Light-sheet      │      300 │
│ 2024-04-13 │          304 │ Mouse       │ 2024-04-13  │ OGB-1   │ Two-photon       │      250 │
│ 2024-04-17 │          305 │ Rat         │ 2024-04-17  │ Fluo-4  │ Confocal         │      180 │
└────────────┴──────────────┴─────────────┴─────────────┴─────────┴──────────────────┴──────────┘

Exercise: In the MRI data, load just the SessionID (renamed to be just ID) and (PatientID renamed to be Patient) columns. Sort the data by the patient ID.

Solution

sql("SELECT SessionID as ID, PatientID as Patient FROM mri_sessions ORDER BY Patient")

┌───────┬─────────┐
│  ID   │ Patient │
│ int64 │ varchar │
├───────┼─────────┤
│   101 │ P001    │
│   103 │ P001    │
│   102 │ P002    │
│   105 │ P002    │
│   104 │ P003    │
└───────┴─────────┘

Section 2: Loading Data with DuckDB- the SELECT * FROM Statement with Files

Now, let’s build on these queries, using DuckDB’s ability to work on

Code	Description
`duckdb.sql('SELECT * FROM "path/to/file.json"')`	Load the JSON file into a table.
`duckdb.sql('SELECT * FROM "path/to/file.csv"')`	Load the CSV file into a table.
`duckdb.sql('SELECT * FROM "path/to/file.parquet"')`	Load the PARQUET file into a table.
`duckdb.sql('SELECT * FROM "path/*/file.parquet"')`	Load all the “file.parquet” files into a table.
`duckdb.sql('SELECT * FROM "*/.parquet"')`	Load all the parquet files that start with the letter “f” into a table, no matter what subfolder they are in.
`duckdb.sql('SELECT * FROM read_json("path/to/file.json", filename=true)')`	Load the “file.json” file into a table using the read_json() function directly.

For the next sections, we’ll be exploring some data from a Steinmetz et al NeuroPixel experiment, processed here into JSON files for our tabular analysis, along with some other familiar file types.

Exercises

Example: Use duckdb to read the session.json file from the session recorded on 2017-10-11. What metadata fields were recorded on that day?

sql('FROM "data/steinmetz/*_2017-10-11_*/session.json"')

┌──────────────┬─────────┬────────────┬──────────┬─────────┐
│ session_date │  mouse  │ stim_onset │ bin_size │   id    │
│     date     │ varchar │   double   │  double  │ varchar │
├──────────────┼─────────┼────────────┼──────────┼─────────┤
│ 2017-10-11   │ Theiler │        0.5 │     0.01 │ ab16    │
└──────────────┴─────────┴────────────┴──────────┴─────────┘

Exercise: Use duckdb to read the session.json file from the session recorded on 2017-11-02. What metadata fields were recorded on that day?

Solution

sql('FROM "data/steinmetz/steinmetz_2017-11-02_Forssmann/session.json"')

┌──────────────┬───────────┬────────────┬──────────┬─────────┐
│ session_date │   mouse   │ stim_onset │ bin_size │   id    │
│     date     │  varchar  │   double   │  double  │ varchar │
├──────────────┼───────────┼────────────┼──────────┼─────────┤
│ 2017-11-02   │ Forssmann │        0.5 │     0.01 │ dda4    │
└──────────────┴───────────┴────────────┴──────────┴─────────┘

Exercise: Use duckdb to read the ’trials.csv’ file recorded on 2017-11-02. What trial variables were recorded for that session? (Note: if you cannot see all the columns, either add "DESCRIBE" to the front of the SQL statement, or you can convert the output to a Pandas DataFrame with .to_df())

Solution

sql('FROM "data/steinmetz/steinmetz_2017-11-02_Forssmann/trials.csv"').to_df()

	contrast_left	contrast_right	gocue	stim_onset	feedback_type	feedback_time	response_type	response_time	reaction_type	reaction_time	prev_reward	active_trials	session_id
0	0	100	0.582163	0.5	1.0	0.827766	-1.0	0.784573	-1.0	290.0	-10.000000	True	dda4
1	0	0	0.484221	0.5	1.0	2.320245	0.0	2.288648	0.0	inf	-5.539448	True	dda4
2	25	25	0.688418	0.5	-1.0	1.358262	-1.0	1.358262	-1.0	1230.0	-4.196441	True	dda4
3	0	0	0.628084	0.5	-1.0	0.862720	1.0	0.862720	1.0	720.0	-2.937326	True	dda4
4	0	0	0.748259	0.5	1.0	2.587083	0.0	2.554045	0.0	inf	-2.342347	True	dda4
...	...	...	...	...	...	...	...	...	...	...	...	...	...
359	0	50	NaN	0.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	False	dda4
360	0	100	NaN	0.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	False	dda4
361	0	50	NaN	0.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	False	dda4
362	0	100	NaN	0.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	False	dda4
363	0	100	NaN	0.5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	False	dda4

364 rows × 13 columns

Exercise: Use duckdb to read all of the session.json files in the dataset.

Solution

duckdb.sql('FROM "data/steinmetz/**/session.json"')

┌──────────────┬───────────┬────────────┬──────────┬─────────┐
│ session_date │   mouse   │ stim_onset │ bin_size │   id    │
│     date     │  varchar  │   double   │  double  │ varchar │
├──────────────┼───────────┼────────────┼──────────┼─────────┤
│ 2017-01-07   │ Muller    │        0.5 │     0.01 │ b5b6    │
│ 2017-01-08   │ Muller    │        0.5 │     0.01 │ 49bb    │
│ 2017-01-08   │ Radnitz   │        0.5 │     0.01 │ 769e    │
│ 2017-01-09   │ Muller    │        0.5 │     0.01 │ 31dc    │
│ 2017-01-09   │ Radnitz   │        0.5 │     0.01 │ 99f4    │
│ 2017-01-10   │ Radnitz   │        0.5 │     0.01 │ a400    │
│ 2017-01-11   │ Radnitz   │        0.5 │     0.01 │ 6207    │
│ 2017-01-12   │ Radnitz   │        0.5 │     0.01 │ 3e6c    │
│ 2017-05-15   │ Moniz     │        0.5 │     0.01 │ 40f7    │
│ 2017-05-16   │ Moniz     │        0.5 │     0.01 │ 2474    │
│     ·        │   ·       │         ·  │       ·  │  ·      │
│     ·        │   ·       │         ·  │       ·  │  ·      │
│     ·        │   ·       │         ·  │       ·  │  ·      │
│ 2017-12-06   │ Lederberg │        0.5 │     0.01 │ 1e77    │
│ 2017-12-06   │ Tatum     │        0.5 │     0.01 │ e7bd    │
│ 2017-12-07   │ Lederberg │        0.5 │     0.01 │ e119    │
│ 2017-12-07   │ Tatum     │        0.5 │     0.01 │ 2edb    │
│ 2017-12-08   │ Lederberg │        0.5 │     0.01 │ 0080    │
│ 2017-12-08   │ Tatum     │        0.5 │     0.01 │ 403f    │
│ 2017-12-09   │ Lederberg │        0.5 │     0.01 │ 4363    │
│ 2017-12-09   │ Tatum     │        0.5 │     0.01 │ 2115    │
│ 2017-12-10   │ Lederberg │        0.5 │     0.01 │ df5e    │
│ 2017-12-11   │ Lederberg │        0.5 │     0.01 │ ecb9    │
├──────────────┴───────────┴────────────┴──────────┴─────────┤
│ 37 rows (20 shown)                               5 columns │
└────────────────────────────────────────────────────────────┘

Exercise: Use DuckDB to read all the trials.csv files in the dataset.

Solution

duckdb.sql('DESCRIBE FROM "data/steinmetz/*/trials.csv"')

┌────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│  column_name   │ column_type │  null   │   key   │ default │  extra  │
│    varchar     │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ contrast_left  │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ contrast_right │ BIGINT      │ YES     │ NULL    │ NULL    │ NULL    │
│ gocue          │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ stim_onset     │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ feedback_type  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ feedback_time  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ response_type  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ response_time  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ reaction_type  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ reaction_time  │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ prev_reward    │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ active_trials  │ BOOLEAN     │ YES     │ NULL    │ NULL    │ NULL    │
│ session_id     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
├────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 13 rows                                                    6 columns │
└──────────────────────────────────────────────────────────────────────┘

Exercise: Use DuckDB to get all the cells that were recorded from in the dataset.

Solution

duckdb.sql('DESCRIBE FROM "data/steinmetz/*/cells.parquet"')

┌────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│  column_name   │ column_type │  null   │   key   │ default │  extra  │
│    varchar     │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ ccf_ap         │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ ccf_dv         │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ ccf_lr         │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
│ brain_area     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ brain_groups   │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ trough_to_peak │ TINYINT     │ YES     │ NULL    │ NULL    │ NULL    │
│ session_id     │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ cell           │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │
└────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

Exercise: What variables were stored when recording lick behaviors (Tip: Use DESCRIBE at the front of the query)?

Solution

duckdb.sql('DESCRIBE FROM "data/steinmetz/*/licks.parquet"')

┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ licks       │ TINYINT     │ YES     │ NULL    │ NULL    │ NULL    │
│ session_id  │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
│ trial       │ INTEGER     │ YES     │ NULL    │ NULL    │ NULL    │
│ time        │ DOUBLE      │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

Section 3: Getting Unique Values with SELECT DISTINCT

Often, we want to summarize data by finding out what unique values there are in a dataset, in order to plan out an analysis. In SQL, the SELECT DISTINCT statement helps with this, and can even show distinct combinations of column values.

Code	Description
`SELECT * FROM my_table`	Select all the columns.
`SELECT colA, colB FROM my_table`	Select just colA and colB
`SELECT DISTINCT colA FROM my_table`	Show just the unique values of colA
`SELECT DISTINCT colA, colB FROM my_table`	Show just the unique combinations of values between colA and colB
`SELECT DISTINCT colA, colB FROM my_table ORDER BY colA`	…and order by colA
`SELECT DISTINCT colA, colB FROM my_table ORDER BY colA, colB`	…and order by colA, then colB
`SELECT DISTINCT colA, colB FROM my_table ORDER BY *`	Order the data by whatever columns are selected, in the order they appear in the table.

Exercises

Example: What mice were in the study? Show just the unique values.

sql(
"""
SELECT DISTINCT 
    mouse,
FROM 
    "data/steinmetz/*/session.json"
""")

┌───────────┐
│   mouse   │
│  varchar  │
├───────────┤
│ Forssmann │
│ Richards  │
│ Lederberg │
│ NULL      │
│ Tatum     │
│ Theiler   │
│ Moniz     │
│ Radnitz   │
│ Muller    │
│ Hench     │
├───────────┤
│  10 rows  │
└───────────┘

Exercise: What unique (“distinct”) contrast levels of the left stimulus were there in this experiment? To make it easier to read, order the rows in the resulting table.

Solution

sql(
"""
SELECT DISTINCT
    contrast_left
FROM 
    "data/steinmetz/*/trials.csv"
ORDER BY *
""")

┌───────────────┐
│ contrast_left │
│     int64     │
├───────────────┤
│             0 │
│            25 │
│            50 │
│           100 │
└───────────────┘

Exercise: What unique (“distinct”) combinations of contrast levels between the left and right stimulus were there in this experiment? To make it easier to read, order the rows in the resulting table.

Solution

sql(
"""
SELECT DISTINCT
    contrast_left,
    contrast_right
FROM 
    "data/steinmetz/*/trials.csv"
ORDER BY contrast_left, contrast_right
""")

┌───────────────┬────────────────┐
│ contrast_left │ contrast_right │
│     int64     │     int64      │
├───────────────┼────────────────┤
│             0 │              0 │
│             0 │             25 │
│             0 │             50 │
│             0 │            100 │
│            25 │              0 │
│            25 │             25 │
│            25 │             50 │
│            25 │            100 │
│            50 │              0 │
│            50 │             25 │
│            50 │             50 │
│            50 │            100 │
│           100 │              0 │
│           100 │             25 │
│           100 │             50 │
│           100 │            100 │
├───────────────┴────────────────┤
│ 16 rows              2 columns │
└────────────────────────────────┘

Exercise: What were the different stimulus onset time settings used this experiment?

Solution

sql(
"""
SELECT DISTINCT
    stim_onset
FROM 
    "data/steinmetz/*/trials.csv"
""")

┌────────────┐
│ stim_onset │
│   double   │
├────────────┤
│        0.5 │
└────────────┘

Exercise: Which general areas of the brain (let’s use “brain_groups” here) were the cells in this study recorded from?

Solution

sql(
"""
SELECT DISTINCT
    brain_groups
FROM 
    "data/steinmetz/*/cells.parquet"
ORDER BY brain_groups
""")

┌───────────────────┐
│   brain_groups    │
│      varchar      │
├───────────────────┤
│ TT                │
│ basal ganglia     │
│ cortical subplate │
│ hippocampus       │
│ midbrain          │
│ non-visual cortex │
│ root              │
│ thalamus          │
│ visual cortex     │
└───────────────────┘

Exercise: Which brain groups were associated with which brain areas?

Solution

sql(
"""
SELECT DISTINCT
    brain_groups,
    brain_area
FROM 
    "data/steinmetz/*/cells.parquet"
ORDER BY brain_groups
""")

┌───────────────┬────────────┐
│ brain_groups  │ brain_area │
│    varchar    │  varchar   │
├───────────────┼────────────┤
│ TT            │ TT         │
│ basal ganglia │ OT         │
│ basal ganglia │ SNr        │
│ basal ganglia │ MS         │
│ basal ganglia │ ACB        │
│ basal ganglia │ LSr        │
│ basal ganglia │ LSc        │
│ basal ganglia │ CP         │
│ basal ganglia │ GPe        │
│ basal ganglia │ LS         │
│    ·          │ ·          │
│    ·          │ ·          │
│    ·          │ ·          │
│ thalamus      │ LP         │
│ thalamus      │ SPF        │
│ thalamus      │ RT         │
│ thalamus      │ VAL        │
│ visual cortex │ VISrl      │
│ visual cortex │ VISa       │
│ visual cortex │ VISl       │
│ visual cortex │ VISp       │
│ visual cortex │ VISam      │
│ visual cortex │ VISpm      │
├───────────────┴────────────┤
│     72 rows (20 shown)     │
└────────────────────────────┘

Section 4: Filtering Data with WHERE

Let’s Filter the data! The WHERE clause in SQL helps to only load up the rows that meet a specified condition:

Code	Description
`FROM my_table WHERE colA > 5`	Only take the rows where colA is greater than 5.
`FROM my_table WHERE colA = 'adult'`	Only take the rows where colA is “adult”.
`FROM my_table WHERE colA IS NOT NULL`	Only take the non-missing rows of colA
`FROM my_table WHERE colA LIKE 'C%'`	Only take the rows of ColA where the text starts with a “C”.

More filtering experessions can be found at these links (won’t be used in these exercises; just for reference):

Logical Operators: https://duckdb.org/docs/sql/expressions/logical_operators
More Comparisons: https://duckdb.org/docs/sql/expressions/comparison_operators
More: https://duckdb.org/docs/sql/expressions/in
Text processing functions: https://duckdb.org/docs/sql/functions/char

Exercises

Example: Which brain areas of the brain group “thalamus” did we record cells from in this experiment? (Tip: use single quotes ' to reference text.)

sql(
"""
SELECT DISTINCT
    brain_groups,
    brain_area
FROM 
    'data/steinmetz/*/cells.parquet'
WHERE brain_groups = 'thalamus'
ORDER BY brain_groups
""")

┌──────────────┬────────────┐
│ brain_groups │ brain_area │
│   varchar    │  varchar   │
├──────────────┼────────────┤
│ thalamus     │ MG         │
│ thalamus     │ PO         │
│ thalamus     │ LGd        │
│ thalamus     │ POL        │
│ thalamus     │ VPM        │
│ thalamus     │ VPL        │
│ thalamus     │ CL         │
│ thalamus     │ TH         │
│ thalamus     │ MD         │
│ thalamus     │ LH         │
│ thalamus     │ LD         │
│ thalamus     │ LP         │
│ thalamus     │ VAL        │
│ thalamus     │ RT         │
│ thalamus     │ PT         │
│ thalamus     │ SPF        │
├──────────────┴────────────┤
│ 16 rows         2 columns │
└───────────────────────────┘

Exercise: Which brain areas recorded in this experiment start with the capital letter V?

Solution

sql(
"""
SELECT DISTINCT
    brain_area
FROM 
    "data/steinmetz/*/cells.parquet"
WHERE brain_area LIKE 'V%'
""")

┌────────────┐
│ brain_area │
│  varchar   │
├────────────┤
│ VISam      │
│ VISrl      │
│ VISl       │
│ VPL        │
│ VPM        │
│ VISa       │
│ VISpm      │
│ VISp       │
│ VAL        │
└────────────┘

Exercise: Which brain groups recorded in this experiment end with the word "cortex"?

Solution

sql(
"""
SELECT DISTINCT
    brain_groups
FROM '**/cells.parquet'
WHERE brain_groups LIKE '%cortex'
"""    
)

┌───────────────────┐
│   brain_groups    │
│      varchar      │
├───────────────────┤
│ visual cortex     │
│ non-visual cortex │
└───────────────────┘

Exercise: Which sessions had cell recordings in the visual cortex?

Solution

sql(
"""
SELECT DISTINCT
    session_id
FROM 
    "data/steinmetz/*/cells.parquet"
WHERE brain_groups = 'visual cortex'
""")

┌────────────┐
│ session_id │
│  varchar   │
├────────────┤
│ 8125       │
│ dda4       │
│ 403f       │
│ 49bb       │
│ 40f7       │
│ 8cbd       │
│ 2474       │
│ 99f4       │
│ e7bd       │
│ 6123       │
│ 1e77       │
│ e119       │
│ 2edb       │
│ 6207       │
│ a400       │
│ 77a8       │
│ 769e       │
│ ab16       │
│ b5b6       │
│ bba2       │
│ 03ce       │
│ 31dc       │
│ 3023       │
├────────────┤
│  23 rows   │
└────────────┘

Exercise: Which sessions had recordings in the hippocampus?

Solution

sql(
"""
SELECT DISTINCT
    session_id
FROM 
    "data/steinmetz/*/cells.parquet"
WHERE brain_groups = 'hippocampus'
""")

┌────────────┐
│ session_id │
│  varchar   │
├────────────┤
│ 0080       │
│ ab16       │
│ 77a8       │
│ 4363       │
│ bba2       │
│ 03ce       │
│ 0834       │
│ 2115       │
│ 31dc       │
│ 40f7       │
│  ·         │
│  ·         │
│  ·         │
│ 62c0       │
│ 3023       │
│ 8125       │
│ 93c7       │
│ 49bb       │
│ dda4       │
│ e119       │
│ 6123       │
│ 2edb       │
│ 1e77       │
├────────────┤
│  27 rows   │
│ (20 shown) │
└────────────┘

Section 5: (Demo) Plotting Data: Query, then Plot!

Most of the time, you don’t need any complex queries; in a language like Python is that you get to use all the other Python tools!

Exercises

Example: Making a Statistical Plot of Trial Performance

df = duckdb.sql('FROM "**/trials.csv" WHERE active_trials = true AND response_type <> 0').to_df()
sns.catplot(data=df, x='contrast_left', hue='contrast_right', y='response_type', kind='point');

Example: Rendering a 3D point cloud of cell positions in the Allen Brain Atlas.

# %pip install plotly nbformat bg-atlasapi

import plotly.express as px
import plotly.graph_objects as go
from bg_atlasapi.bg_atlas import BrainGlobeAtlas
import duckdb

df = duckdb.sql("FROM '**/cells.parquet'").to_df()
atlas_mesh = BrainGlobeAtlas("allen_mouse_25um").mesh_from_structure('root')


fig1 = px.scatter_3d(x=atlas_mesh.points[:, 0], y=atlas_mesh.points[:, 1], z=atlas_mesh.points[:, 2], opacity=0.1).update_traces(marker=dict(size=.6))
fig2 = px.scatter_3d(x=df.ccf_ap, y=df.ccf_dv, z=df.ccf_lr, color=df.brain_groups, width=700, title="Cell CCF Coordinates").update_traces(marker=dict(size=2))
go.Figure(data=fig1.data + fig2.data, layout=fig1.layout).update_layout(scene=dict(xaxis={'visible': False}, yaxis={'visible': False}, zaxis={'visible': False}), width=700, scene_camera=dict(eye=dict(x=.75, y=.75, z=.75)))