Working with a DataLad Dataset
Authors
Working with DataLad Datasets
In the recent decade there has been a big increase in the number of published neuroscience datasets and open science repositories. However, it can be challenging to manage this open science ecosystem if the repositories use different backends to share data. DataLad helps by providing a unified interface that can manage data across many different services. In this notebook, we are going to download data from OpenNeuro - a platform for hosting neuroimaging datasets that uses DataLad on the backend. Simply execute the cell below to clone the dataset into the current directory - it will be stored in a folder called ds004408/.
Note for Windows: you will get a message that says “detected a crippled filesystem”. Don’t worry this does not mean that there is anything wrong with your computer - it just means git-annex is working slightly differently on Windows (more on this later).
import os
# deactivate DataLad's progressbar for this notebook
os.environ['DATALAD_UI_PROGRESSBAR'] = 'none'
!datalad clone https://github.com/OpenNeuroDatasets/ds004408.git
[INFO ] Remote origin not usable by git-annex; setting annex-ignore
[INFO ] https://github.com/OpenNeuroDatasets/ds004408.git/config download failed: Not Found
install(ok): /home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408 (dataset)
In the following section we will explore the content of this DataLad dataset and learn how to access and modify it. This will give you a deeper understanding of how DataLad works and equip you with the tools to reuse publicly available datasets.
Section 1: Understanding the Structure of a Dataset
Background
DataLad is a tool that is primarily used through the terminal. Thus, when exploring the content of a DataLad dataset, it makes sense to use terminal commands like ls (Linux/macOS) or dir (Windows). In VSCode you can open the terminal via the menu bar by clicking View > Terminal or by pressing the Ctrl+` keyboard shortcut and execute these commands there. On Windows you can also open a terminal like this: Open Explorer > Move to folder > Right Click > Open in Terminal. If you want to use the Linux terminal commands rather than the Windows alternative, you can use Git Bash as your terminal, which comes with the Git installation on Windows.
Alternatively, you can execute the terminal commands in the code cells of this Jupyter notebook by prefacing them with !. With ! we can execute any shell command as an independent subprocess. Because these commands can’t modify the state of the notebook, there is a special prefix for the cd (change directory) command: %cd. This allows the cd command to persistently change the working directory within the notebook.
Exercises
In the following exercises, we are going to explore the dataset we cloned at the beginning of the notebook. You can do this in the terminal or in the notebook using the ! and % operators, or try both - however you prefer! Here are the commands you need to know:
| Linux/macOS | Windows | Description |
|---|---|---|
ls |
dir |
List the content of the current directory |
ls -a |
dir /a |
List the content of the current directory (including hidden files) |
ls -a data |
dir /a data |
List the content of the data directory |
cd code/ |
cd code/ |
Move to the code/ directory |
cd .. |
cd .. |
Move to the parent of the current directory |
cat file.txt |
type file.txt |
Display the content of file.txt |
Example: Change the current directory to ds004408/ (i.e., the root directory of the cloned dataset).
%cd ds004408//home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408Example: List the content of the current directory (i.e., ds004408/).
# Linux/macOS
!lsCHANGES stimuli sub-005 sub-010 sub-015
README sub-001 sub-006 sub-011 sub-016
dataset_description.json sub-002 sub-007 sub-012 sub-017
participants.json sub-003 sub-008 sub-013 sub-018
participants.tsv sub-004 sub-009 sub-014 sub-019# Windows
!dirExercise: Change the current working directory to the stimuli/ folder and list the contents.
Solution
# Linux/macOS
%cd stimuli
!ls/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408/stimuli
audio01.TextGrid audio06.wav audio12.TextGrid audio17.wav
audio01.wav audio07.TextGrid audio12.wav audio18.TextGrid
audio02.TextGrid audio07.wav audio13.TextGrid audio18.wav
audio02.wav audio08.TextGrid audio13.wav audio19.TextGrid
audio03.TextGrid audio08.wav audio14.TextGrid audio19.wav
audio03.wav audio09.TextGrid audio14.wav audio20.TextGrid
audio04.TextGrid audio09.wav audio15.TextGrid audio20.wav
audio04.wav audio10.TextGrid audio15.wav results.txt
audio05.TextGrid audio10.wav audio16.TextGrid
audio05.wav audio11.TextGrid audio16.wav
audio06.TextGrid audio11.wav audio17.TextGrid# Windows
%cd stimuli
!dirExercise: Change the directory back to ds004408/ (i.e., the parent directory of stimuli/).
Solution
%cd ../home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408Exercise: List the contents of ds004408/ including all hidden files and folders.
Solution
# Linux/macOS
!ls -a. CHANGES stimuli sub-005 sub-010 sub-015
.. README sub-001 sub-006 sub-011 sub-016
.datalad dataset_description.json sub-002 sub-007 sub-012 sub-017
.git participants.json sub-003 sub-008 sub-013 sub-018
.gitattributes participants.tsv sub-004 sub-009 sub-014 sub-019# Windows
!dir /aExercise: List the contents of the .git/ folder.
Solution
# Linux/macOS
!ls .gitHEAD branches description index logs packed-refs
annex config hooks info objects refs# Windows
!dir .gitExercise: List the contents of the .datalad/ folder.
Solution
# Linux/macOS
!ls .dataladconfig# Windows
!dir .dataladExample: Display the content of README.md.
# Linux/macOS
!cat READMEThe data in one study [^1] and then added to by another [^2] and contains EEG responses of healthy, neurotypical adults who listened to naturalistic speech. The subjects listened to segments from an audio book version of "The Old Man and the Sea" and their brain activity was recorded using a 128-channel ActiveTwo EEG system (BioSemi).
The stimuli folder contains .wav files of the presented audiobook segments as well as a .TextGrid file for each segment, containng the timing of words and phonemes in that segment. The text grids were generated using the forced-alignment software Prosodylab-Aligner [^3] and inspected by eye. Each subject's folder contains one EEG-recording per audio segment and their starts are aligned (the EEG recordings are longer than the audio to a varying extent). The recordings are unfiltered, unreferenced and sampled at 512 Hz.
[^1]: Di Liberto, G. M., O’sullivan, J. A., & Lalor, E. C. (2015). Low-frequency cortical entrainment to speech reflects phoneme-level processing. Current Biology, 25(19), 2457-2465.
[^2]: Broderick, M. P., Anderson, A. J., Di Liberto, G. M., Crosse, M. J., & Lalor, E. C. (2018). Electrophysiological correlates of semantic dissimilarity reflect the comprehension of natural, narrative speech. Current Biology, 28(5), 803-809.
[^3]: Gorman, K., Howell, J., & Wagner, M. (2011). Prosodylab-aligner: A tool for forced alignment of laboratory speech. Canadian Acoustics, 39(3), 192-193.# Windows
!type READMEExercise: Display the file content of participants.tsv.
Solution
# Linux/macOS
!cat participants.tsvparticipant_id age sex hand weight height
sub-001 n/a n/a n/a n/a n/a
sub-002 n/a n/a n/a n/a n/a
sub-003 n/a n/a n/a n/a n/a
sub-004 n/a n/a n/a n/a n/a
sub-005 n/a n/a n/a n/a n/a
sub-006 n/a n/a n/a n/a n/a
sub-007 n/a n/a n/a n/a n/a
sub-008 n/a n/a n/a n/a n/a
sub-009 n/a n/a n/a n/a n/a
sub-010 n/a n/a n/a n/a n/a
sub-011 n/a n/a n/a n/a n/a
sub-012 n/a n/a n/a n/a n/a
sub-013 n/a n/a n/a n/a n/a
sub-014 n/a n/a n/a n/a n/a
sub-015 n/a n/a n/a n/a n/a
sub-016 n/a n/a n/a n/a n/a
sub-017 n/a n/a n/a n/a n/a
sub-018 n/a n/a n/a n/a n/a
sub-019 n/a n/a n/a n/a n/a# Windows
!type participants.tsvExercise: Display the file content of .datalad/config. This file contains a DataLad ID that uniquely identifies this dataset.
Solution
# Linux/macOS
!cat .datalad/config[datalad "dataset"]
id = 37b1ac65-b33e-4e44-9188-d9d57cb1e50d# Windows
!type .datalad\configSection 2: Managing File Content
Background
You may have noticed that, even though the dataset contains lots of different folders, cloning it was really fast. This is because DataLad manages dataset structure and file content separately. When you cloned the dataset, you didn’t actually download the file content - you merely downloaded tiny symbolic links that represent the files. To download the actual content of specific files, we have to use the datalad get command. This is very useful when we are working with large datasets. For example, you can clone the whole dataset to your notebook, download some sample files for testing a new analysis you are developing, and integrate your changes into the original dataset without having to move large amounts of data.
The downloaded data will be stored in the files and folders you can see in your file tree but not in the hidden .git/ folder.
For Windows this is actually different. Because of limitations in the Windows filesystem, DataLad has to duplicate the data and store one copy in your working tree and one backup copy in the .git/ folder.
Section 3: Exercises
In the following exercises, we are going to get the file content for some of the files in the dataset we just cloned, and we are also going to drop them again.
As we do that, we’ll repeatedly check the disk usage (du -sh on Linux/macOS, dir /s on Windows) to see how the size of our dataset is changing.
Here are the commands you need to know - the commands for listing folders and checking disk usage are OS-specific while the DataLad commands are the same across all platforms:
DataLad Commands
| Command | Description |
|---|---|
datalad get data/ |
Download the content of the directory data/ |
datalad drop data/ |
Delete the content of the directory data/ |
datalad get data/example.txt |
Download the content of the file data/example.txt |
datalad get data/*.txt |
Download the content of all .txt files in data/ |
OS-specific commands
| Linux/macOS | Windows | Description |
|---|---|---|
du -sh . |
dir /s |
Print the disk usage of the current directory |
du -sh data |
dir /s data |
Print the disk usage of the data/ directory |
Example: Print the size of the current directory.
# Linux/macOS
!du -sh .15M .# Windows
!dir /sExample: Get the data for the file stimuli/audio01.wav.
!datalad get stimuli/audio01.wav
get(ok): stimuli/audio01.wav (file) [from s3-PUBLIC...]
Exercise: Check the disk usage of the current directory, again.
Solution
# Linux/macOS
!du -sh45M .# Windows
!dir /sExercise: Get the data for stimuli/audio02.wav, then print the disk usage for the current directory.
Solution
# Linux/macOS
!datalad get stimuli/audio02.wav
!du -sh
get(ok): stimuli/audio02.wav (file) [from s3-PUBLIC...]
76M .
# Windows
!datalad get stimuli/audio02.wav
!dir /sExercise: Drop the data of the whole stimulus folder, then print the disk usage of the current directory.
Solution
# Linux/macOS
!datalad drop stimuli/
!du -shdrop(ok): stimuli/audio01.wav (file) drop(ok): stimuli/audio02.wav (file) drop(ok): stimuli (directory) action summary: drop (ok: 3) 15M .
# Windows
!datalad drop stimuli/
!dir /sExercise: Get the disk usage of the stimuli/ folder.
Solution
# Linux/macOS
!du -sh stimuli168K stimuli# Windows
!dir /s stimuliExercise: Get all *.TextGrid files in the stimuli/ folder, then get the folder’s disk usage again.
NOTE (for Windows): Because Windows doesn’t process the "*" wildcards correctly, the easiest way is to either get the whole stimuli folder (takes a while) or just a single file.
Solution
!datalad get stimuli/*.TextGrid
!du -sh stimuli/
get(ok): stimuli/audio12.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio17.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio02.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio15.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio04.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio18.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio20.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio01.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio05.TextGrid (file) [from s3-PUBLIC...]
get(ok): stimuli/audio10.TextGrid (file) [from s3-PUBLIC...]
[1 similar message has been suppressed; disable with datalad.ui.suppress-similar-results=off]
[10 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
action summary:
get (ok: 20)
168K stimuli/
Exercise: Get the size of the .git/ folder.
Solution
# Linux/macOS
!du -sh .git9.8M .git# Windows
!dir /s .gitSection 4: Inspecting File Identifiers
Background
DataLad is a decentralized data management system, which means it does not rely on any central issuing service. This presents a challenge: how can files be unambiguously identified when there exists an unknown number of DataLad datasets that were created independently? The answer is checksums. Checksums are alphanumeric strings that are generated from the file content via a hashing algorithm. Even the tiniest change in the file will result in a different checksum, which makes them unique identifiers of file content.
DataLad manages these file identifiers for us using git-annex under the hood. While most of the time we don’t have to think about the git-annex operations, it can be useful to peek under the hood and use some git-annex commands directly to get more detailed information or configure the dataset’s behavior.
Exercises
In this section we are going to use git annex directly to get more detailed information on the files in our dataset, like their identifiers and storage locations. We’ll also use git annex to configure how many copies of a given file we want to keep. Here are the commands you need to know:
| Code | Description |
|---|---|
git annex info |
Show the git-annex information for the whole dataset |
git annex info folder/image.png |
Show the git-annex information for the file image.png |
git annex whereis folder/image.png |
List the repositories that have the file content for image.png |
git annex numcopies 2 |
Configure the dataset so that the required number of copies for a file is 2 |
Example: Get the git-annex info for the file stimuli/audio01.wav.
!git annex info stimuli/audio01.wavfile: stimuli/audio01.wav
size: 31.32 megabytes
key: SHA256E-s31322156--61207e6f7fe2f2d85a857800af6066048c5d18baa424d47d0f0ab596fafdbb12.wav
present: falseExercise: Get the file content for stimuli/audio01.wav, then print the git-annex info for that file, again.
Solution
!datalad get stimuli/audio01.wav
!git annex info stimuli/audio01.wav
get(ok): stimuli/audio01.wav (file) [from s3-PUBLIC...]
file: stimuli/audio01.wav
size: 31.32 megabytes
key: SHA256E-s31322156--61207e6f7fe2f2d85a857800af6066048c5d18baa424d47d0f0ab596fafdbb12.wav
present: true
Exercise: List the repositories that contain the file content for stimuli/audio01.wav.
Solution
!git annex whereis stimuli/audio01.wavwhereis stimuli/audio01.wav (3 copies)
35910075-8a45-4d2f-a851-eeba11d474f8 -- [s3-PUBLIC]
b602100b-9fb2-44d7-8bf1-0ed87e6d3aa4 -- OpenNeuro
f788b0ef-dbbe-43d4-a731-2f0451b5b021 -- olebi@iBots-7:~/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408 [here]
s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds004408/stimuli/single_speaker/audio01.wav?versionId=DgF1hKqcMi0Mbi_Cjrcwxrhe.9fr2GRU
okExercise: List the repositories that contain the file content for stimuli/audio02.wav - how is this different from the list of repositories in the previous exercise?
Solution
!git annex whereis stimuli/audio02.wavwhereis stimuli/audio02.wav (2 copies)
35910075-8a45-4d2f-a851-eeba11d474f8 -- [s3-PUBLIC]
b602100b-9fb2-44d7-8bf1-0ed87e6d3aa4 -- OpenNeuro
s3-PUBLIC: https://s3.amazonaws.com/openneuro.org/ds004408/stimuli/single_speaker/audio02.wav?versionId=0k0s_818LeMgL_NnL..N5YZuAuwBUVrC
okExercise: Set the number of required copies of a file to 3.
Solution
!git annex numcopies 3numcopies 3 ok
(recording state in git...)Exercise: Try to drop stimuli/audio01.wav. What does the error message say?
Solution
!datalad drop stimuli/audio01.wavdrop(error): stimuli/audio01.wav (file) [unsafe; Could only verify the existence of 1 out of 3 necessary copies.; (Note that these git remotes have annex-ignore set: origin); (Use --reckless availability to override this check, or adjust numcopies.)]
Exercise: Set the number of required copies of a file to 1 and drop stimuli/audio01.wav.
Solution
!git annex numcopies 1
!datalad drop stimuli/audio01.wavnumcopies 1 ok (recording state in git...) drop(ok): stimuli/audio01.wav (file)
Exercise: Print the git-annex info for the whole dataset.
Solution
!git annex infotrusted repositories: 0
semitrusted repositories: 5
00000000-0000-0000-0000-000000000001 -- web
00000000-0000-0000-0000-000000000002 -- bittorrent
35910075-8a45-4d2f-a851-eeba11d474f8 -- [s3-PUBLIC]
b602100b-9fb2-44d7-8bf1-0ed87e6d3aa4 -- OpenNeuro
f788b0ef-dbbe-43d4-a731-2f0451b5b021 -- olebi@iBots-7:~/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408 [here]
untrusted repositories: 0
transfers in progress: none
available local disk space: 801.6 gigabytes (+100 megabytes reserved)
local annex keys: 20
local annex size: 3.53 megabytes
annexed files in working tree: 1181
size of annexed files in working tree: 20.08 gigabytes
combined annex size of all repositories: 60.66 gigabytes
annex sizes of repositories:
30.41 GB: b602100b-9fb2-44d7-8bf1-0ed87e6d3aa4 -- OpenNeuro
30.25 GB: 35910075-8a45-4d2f-a851-eeba11d474f8 -- [s3-PUBLIC]
3.53 MB: f788b0ef-dbbe-43d4-a731-2f0451b5b021 -- olebi@iBots-7:~/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/01_working_with_datalad_datasets/ds004408 [here]
backend usage:
SHA256E: 1181
bloom filter size: 32 mebibytes (0% full)Section 5: Examining a New Data Set
Now you are equipped to consume any DataLad dataset that has been published online - let’s try it out! Search the OpenNeuro database for a dataset that interests you and clone it. Then:
- print the git annex info of that dataset
- get some of the file contents and check the disk usage before and after
- drop the file contents and check the disk usage again