Working with a DataLad Dataset

Authors
Ole Bialas | Michał Szczepanik

Working with DataLad Datasets

In the recent decade there has been a big increase in the number of published neuroscience datasets and open science repositories. However, it can be challenging to manage this open science ecosystem if the repositories use different backends to share data. DataLad helps by providing a unified interface that can manage data across many different services. In this notebook, we are going to download data from OpenNeuro - a platform for hosting neuroimaging datasets that uses DataLad on the backend. Simply execute the cell below to clone the dataset into the current directory - it will be stored in a folder called ds004408/.

Note for Windows: you will get a message that says “detected a crippled filesystem”. Don’t worry this does not mean that there is anything wrong with your computer - it just means git-annex is working slightly differently on Windows (more on this later).

import os
# deactivate DataLad's progressbar for this notebook
os.environ['DATALAD_UI_PROGRESSBAR'] = 'none'

!datalad clone https://github.com/OpenNeuroDatasets/ds004408.git
                                                                                
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore 
[INFO   ] https://github.com/OpenNeuroDatasets/ds004408.git/config download failed: Not Found 
install(ok): /home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408 (dataset)

In the following section we will explore the content of this DataLad dataset and learn how to access and modify it. This will give you a deeper understanding of how DataLad works and equip you with the tools to reuse publicly available datasets.

Section 1: Understanding the Structure of a Dataset

Background

DataLad is a tool that is primarily used through the terminal. Thus, when exploring the content of a DataLad dataset, it makes sense to use terminal commands like ls (Linux/macOS) or dir (Windows). In VSCode you can open the terminal via the menu bar by clicking View > Terminal or by pressing the Ctrl+` keyboard shortcut and execute these commands there. On Windows you can also open a terminal like this: Open Explorer > Move to folder > Right Click > Open in Terminal. If you want to use the Linux terminal commands rather than the Windows alternative, you can use Git Bash as your terminal, which comes with the Git installation on Windows.

Alternatively, you can execute the terminal commands in the code cells of this Jupyter notebook by prefacing them with !. With ! we can execute any shell command as an independent subprocess. Because these commands can’t modify the state of the notebook, there is a special prefix for the cd (change directory) command: %cd. This allows the cd command to persistently change the working directory within the notebook.

Exercises

In the following exercises, we are going to explore the dataset we cloned at the beginning of the notebook. You can do this in the terminal or in the notebook using the ! and % operators, or try both - however you prefer! Here are the commands you need to know:

Linux/macOS Windows Description
ls dir List the content of the current directory
ls -a dir /a List the content of the current directory (including hidden files)
ls -a data dir /a data List the content of the data directory
cd code/ cd code/ Move to the code/ directory
cd .. cd .. Move to the parent of the current directory
cat file.txt type file.txt Display the content of file.txt

Example: Change the current directory to ds004408/ (i.e., the root directory of the cloned dataset).

%cd ds004408/
/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408

Example: List the content of the current directory (i.e., ds004408/).

# Linux/macOS
!ls
CHANGES			  stimuli  sub-005  sub-010  sub-015
README			  sub-001  sub-006  sub-011  sub-016
dataset_description.json  sub-002  sub-007  sub-012  sub-017
participants.json	  sub-003  sub-008  sub-013  sub-018
participants.tsv	  sub-004  sub-009  sub-014  sub-019
# Windows
!dir

Exercise: Change the current working directory to the stimuli/ folder and list the contents.

Solution
# Linux/macOS
%cd stimuli
!ls
/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408/stimuli
audio01.TextGrid  audio06.wav	    audio12.TextGrid  audio17.wav
audio01.wav	  audio07.TextGrid  audio12.wav       audio18.TextGrid
audio02.TextGrid  audio07.wav	    audio13.TextGrid  audio18.wav
audio02.wav	  audio08.TextGrid  audio13.wav       audio19.TextGrid
audio03.TextGrid  audio08.wav	    audio14.TextGrid  audio19.wav
audio03.wav	  audio09.TextGrid  audio14.wav       audio20.TextGrid
audio04.TextGrid  audio09.wav	    audio15.TextGrid  audio20.wav
audio04.wav	  audio10.TextGrid  audio15.wav       results.txt
audio05.TextGrid  audio10.wav	    audio16.TextGrid
audio05.wav	  audio11.TextGrid  audio16.wav
audio06.TextGrid  audio11.wav	    audio17.TextGrid
# Windows
%cd stimuli
!dir

Exercise: Change the directory back to ds004408/ (i.e., the parent directory of stimuli/).

Solution
%cd ..
/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408

Exercise: List the contents of ds004408/ including all hidden files and folders.

Solution
# Linux/macOS
!ls -a
.		CHANGES			  stimuli  sub-005  sub-010  sub-015
..		README			  sub-001  sub-006  sub-011  sub-016
.datalad	dataset_description.json  sub-002  sub-007  sub-012  sub-017
.git		participants.json	  sub-003  sub-008  sub-013  sub-018
.gitattributes	participants.tsv	  sub-004  sub-009  sub-014  sub-019
# Windows
!dir /a

Exercise: List the contents of the .git/ folder.

Solution
# Linux/macOS
!ls .git
HEAD   branches  description  index  logs     packed-refs
annex  config	 hooks	      info   objects  refs
# Windows
!dir .git

Exercise: List the contents of the .datalad/ folder.

Solution
# Linux/macOS
!ls .datalad
config
# Windows
!dir .datalad

Example: Display the content of README.md.

# Linux/macOS
!cat README
The data in one study [^1] and then added to by another [^2] and contains EEG responses of healthy, neurotypical adults who listened to naturalistic speech. The subjects listened to segments from an audio book version of "The Old Man and the Sea" and their brain activity was recorded using a 128-channel ActiveTwo EEG system (BioSemi). 

The stimuli folder contains .wav files of the presented audiobook segments as well as a .TextGrid file for each segment, containng the timing of  words and phonemes in that segment. The text grids were generated using the forced-alignment software Prosodylab-Aligner [^3] and inspected by eye. Each subject's folder contains one EEG-recording per audio segment and their starts are aligned (the EEG recordings are longer than the audio to a varying extent).  The recordings are unfiltered, unreferenced and sampled at 512 Hz.

[^1]: Di Liberto, G. M., O’sullivan, J. A., & Lalor, E. C. (2015). Low-frequency cortical entrainment to speech reflects phoneme-level processing. Current Biology, 25(19), 2457-2465.

[^2]: Broderick, M. P., Anderson, A. J., Di Liberto, G. M., Crosse, M. J., & Lalor, E. C. (2018). Electrophysiological correlates of semantic dissimilarity reflect the comprehension of natural, narrative speech. Current Biology, 28(5), 803-809.

[^3]: Gorman, K., Howell, J., & Wagner, M. (2011). Prosodylab-aligner: A tool for forced alignment of laboratory speech. Canadian Acoustics, 39(3), 192-193.
# Windows
!type README

Exercise: Display the file content of participants.tsv.

Solution
# Linux/macOS
!cat participants.tsv
participant_id	age	sex	hand	weight	height
sub-001	n/a	n/a	n/a	n/a	n/a
sub-002	n/a	n/a	n/a	n/a	n/a
sub-003	n/a	n/a	n/a	n/a	n/a
sub-004	n/a	n/a	n/a	n/a	n/a
sub-005	n/a	n/a	n/a	n/a	n/a
sub-006	n/a	n/a	n/a	n/a	n/a
sub-007	n/a	n/a	n/a	n/a	n/a
sub-008	n/a	n/a	n/a	n/a	n/a
sub-009	n/a	n/a	n/a	n/a	n/a
sub-010	n/a	n/a	n/a	n/a	n/a
sub-011	n/a	n/a	n/a	n/a	n/a
sub-012	n/a	n/a	n/a	n/a	n/a
sub-013	n/a	n/a	n/a	n/a	n/a
sub-014	n/a	n/a	n/a	n/a	n/a
sub-015	n/a	n/a	n/a	n/a	n/a
sub-016	n/a	n/a	n/a	n/a	n/a
sub-017	n/a	n/a	n/a	n/a	n/a
sub-018	n/a	n/a	n/a	n/a	n/a
sub-019	n/a	n/a	n/a	n/a	n/a
# Windows
!type participants.tsv

Exercise: Display the file content of .datalad/config. This file contains a DataLad ID that uniquely identifies this dataset.

Solution
# Linux/macOS
!cat .datalad/config
[datalad "dataset"]
	id = 37b1ac65-b33e-4e44-9188-d9d57cb1e50d
# Windows
!type .datalad\config

Section 2: Managing File Content

Background

You may have noticed that, even though the dataset contains lots of different folders, cloning it was really fast. This is because DataLad manages dataset structure and file content separately. When you cloned the dataset, you didn’t actually download the file content - you merely downloaded tiny symbolic links that represent the files. To download the actual content of specific files, we have to use the datalad get command. This is very useful when we are working with large datasets. For example, you can clone the whole dataset to your notebook, download some sample files for testing a new analysis you are developing, and integrate your changes into the original dataset without having to move large amounts of data.

The downloaded data will be stored in the files and folders you can see in your file tree but not in the hidden .git/ folder. For Windows this is actually different. Because of limitations in the Windows filesystem, DataLad has to duplicate the data and store one copy in your working tree and one backup copy in the .git/ folder.

Section 3: Exercises

In the following exercises, we are going to get the file content for some of the files in the dataset we just cloned, and we are also going to drop them again. As we do that, we’ll repeatedly check the disk usage (du -sh on Linux/macOS, dir /s on Windows) to see how the size of our dataset is changing. Here are the commands you need to know - the commands for listing folders and checking disk usage are OS-specific while the DataLad commands are the same across all platforms:

DataLad Commands

Command Description
datalad get data/ Download the content of the directory data/
datalad drop data/ Delete the content of the directory data/
datalad get data/example.txt Download the content of the file data/example.txt
datalad get data/*.txt Download the content of all .txt files in data/

OS-specific commands

Linux/macOS Windows Description
du -sh . dir /s Print the disk usage of the current directory
du -sh data dir /s data Print the disk usage of the data/ directory

Example: Print the size of the current directory.

# Linux/macOS
!du -sh .
15M	.
# Windows
!dir /s

Example: Get the data for the file stimuli/audio01.wav.

!datalad get stimuli/audio01.wav
                                                                                
get(ok): stimuli/audio01.wav (file) [from s3-PUBLIC...]

Exercise: Check the disk usage of the current directory, again.

Solution
# Linux/macOS
!du -sh
45M	.
# Windows
!dir /s

Exercise: Get the data for stimuli/audio02.wav, then print the disk usage for the current directory.

Solution
# Linux/macOS
!datalad get stimuli/audio02.wav
!du -sh
                                                                                
get(ok): stimuli/audio02.wav (file) [from s3-PUBLIC...]
76M	.
# Windows
!datalad get stimuli/audio02.wav
!dir /s

Exercise: Drop the data of the whole stimulus folder, then print the disk usage of the current directory.

Solution
# Linux/macOS
!datalad drop stimuli/
!du -sh
drop(ok): stimuli/audio01.wav (file)
drop(ok): stimuli/audio02.wav (file)
drop(ok): stimuli (directory)
action summary:
  drop (ok: 3)
15M	.
# Windows
!datalad drop stimuli/
!dir /s

Exercise: Get the disk usage of the stimuli/ folder.

Solution
# Linux/macOS
!du -sh stimuli
168K	stimuli
# Windows
!dir /s stimuli

Exercise: Get all *.TextGrid files in the stimuli/ folder, then get the folder’s disk usage again.

NOTE (for Windows): Because Windows doesn’t process the "*" wildcards correctly, the easiest way is to either get the whole stimuli folder (takes a while) or just a single file.

Solution
!datalad get stimuli/*.TextGrid
!du -sh stimuli/
                                                                                
get(ok): stimuli/audio12.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio17.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio02.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio15.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio04.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio18.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio20.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio01.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio05.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio10.TextGrid (file) [from s3-PUBLIC...]
  [1 similar message has been suppressed; disable with datalad.ui.suppress-similar-results=off]
  [10 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
action summary:
  get (ok: 20)
168K	stimuli/

Exercise: Get the size of the .git/ folder.

Solution
# Linux/macOS
!du -sh .git
9.8M	.git
# Windows
!dir /s .git

Section 4: Inspecting File Identifiers

Background

DataLad is a decentralized data management system, which means it does not rely on any central issuing service. This presents a challenge: how can files be unambiguously identified when there exists an unknown number of DataLad datasets that were created independently? The answer is checksums. Checksums are alphanumeric strings that are generated from the file content via a hashing algorithm. Even the tiniest change in the file will result in a different checksum, which makes them unique identifiers of file content.

DataLad manages these file identifiers for us using git-annex under the hood. While most of the time we don’t have to think about the git-annex operations, it can be useful to peek under the hood and use some git-annex commands directly to get more detailed information or configure the dataset’s behavior.

Exercises

In this section we are going to use git annex directly to get more detailed information on the files in our dataset, like their identifiers and storage locations. We’ll also use git annex to configure how many copies of a given file we want to keep. Here are the commands you need to know:

Code Description
git annex info Show the git-annex information for the whole dataset
git annex info folder/image.png Show the git-annex information for the file image.png
git annex whereis folder/image.png List the repositories that have the file content for image.png
git annex numcopies 2 Configure the dataset so that the required number of copies for a file is 2

Example: Get the git-annex info for the file stimuli/audio01.wav.

!git annex info stimuli/audio01.wav
file: stimuli/audio01.wav
size: 31.32 megabytes
key: SHA256E-s31322156--61207e6f7fe2f2d85a857800af6066048c5d18baa424d47d0f0ab596fafdbb12.wav
present: false

Exercise: Get the file content for stimuli/audio01.wav, then print the git-annex info for that file, again.

Solution
!datalad get stimuli/audio01.wav
!git annex info stimuli/audio01.wav
                                                                                
get(ok): stimuli/audio01.wav (file) [from s3-PUBLIC...]
file: stimuli/audio01.wav
size: 31.32 megabytes
key: SHA256E-s31322156--61207e6f7fe2f2d85a857800af6066048c5d18baa424d47d0f0ab596fafdbb12.wav
present: true

Exercise: List the repositories that contain the file content for stimuli/audio01.wav.

Solution
!git annex whereis stimuli/audio01.wav
whereis stimuli/audio01.wav (3 copies) 
  	35910075-8a45-4d2f-a851-eeba11d474f8 -- [s3-PUBLIC]
  	b602100b-9fb2-44d7-8bf1-0ed87e6d3aa4 -- OpenNeuro
  	f788b0ef-dbbe-43d4-a731-2f0451b5b021 -- olebi@iBots-7:~/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408 [here]

  s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds004408/stimuli/single_speaker/audio01.wav?versionId=DgF1hKqcMi0Mbi_Cjrcwxrhe.9fr2GRU
ok

Exercise: List the repositories that contain the file content for stimuli/audio02.wav - how is this different from the list of repositories in the previous exercise?

Solution
!git annex whereis stimuli/audio02.wav
whereis stimuli/audio02.wav (2 copies) 
  	35910075-8a45-4d2f-a851-eeba11d474f8 -- [s3-PUBLIC]
  	b602100b-9fb2-44d7-8bf1-0ed87e6d3aa4 -- OpenNeuro

  s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds004408/stimuli/single_speaker/audio02.wav?versionId=0k0s_818LeMgL_NnL..N5YZuAuwBUVrC
ok

Exercise: Set the number of required copies of a file to 3.

Solution
!git annex numcopies 3
numcopies 3 ok
(recording state in git...)

Exercise: Try to drop stimuli/audio01.wav. What does the error message say?

Solution
!datalad drop stimuli/audio01.wav
drop(error): stimuli/audio01.wav (file) [unsafe; Could only verify the existence of 1 out of 3 necessary copies.; (Note that these git remotes have annex-ignore set: origin); (Use --reckless availability to override this check, or adjust numcopies.)]

Exercise: Set the number of required copies of a file to 1 and drop stimuli/audio01.wav.

Solution
!git annex numcopies 1
!datalad drop stimuli/audio01.wav
numcopies 1 ok
(recording state in git...)
drop(ok): stimuli/audio01.wav (file)

Exercise: Print the git-annex info for the whole dataset.

Solution
!git annex info
trusted repositories: 0
semitrusted repositories: 5
	00000000-0000-0000-0000-000000000001 -- web
	00000000-0000-0000-0000-000000000002 -- bittorrent
	35910075-8a45-4d2f-a851-eeba11d474f8 -- [s3-PUBLIC]
	b602100b-9fb2-44d7-8bf1-0ed87e6d3aa4 -- OpenNeuro
	f788b0ef-dbbe-43d4-a731-2f0451b5b021 -- olebi@iBots-7:~/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408 [here]
untrusted repositories: 0
transfers in progress: none
available local disk space: 801.6 gigabytes (+100 megabytes reserved)
local annex keys: 20
local annex size: 3.53 megabytes
annexed files in working tree: 1181
size of annexed files in working tree: 20.08 gigabytes
combined annex size of all repositories: 60.66 gigabytes
annex sizes of repositories: 
	30.41 GB: b602100b-9fb2-44d7-8bf1-0ed87e6d3aa4 -- OpenNeuro
	30.25 GB: 35910075-8a45-4d2f-a851-eeba11d474f8 -- [s3-PUBLIC]
	 3.53 MB: f788b0ef-dbbe-43d4-a731-2f0451b5b021 -- olebi@iBots-7:~/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408 [here]
backend usage: 
	SHA256E: 1181
bloom filter size: 32 mebibytes (0% full)

Section 5: Examining a New Data Set

Now you are equipped to consume any DataLad dataset that has been published online - let’s try it out! Search the OpenNeuro database for a dataset that interests you and clone it. Then:

  • print the git annex info of that dataset
  • get some of the file contents and check the disk usage before and after
  • drop the file contents and check the disk usage again