Research Data Management with DataLad

Courses

Learn to create datasets with version control, track computational provenance and use open science repositories to create more open and reproducible science

Authors

Dr. Ole Bialas | Dr. Michał Szczepanik

Most research datasets change and evolve as scientists collect additional samples or perform computations on the data. This can make it challenging to get information on previous states of the data set or to simply understand how a particular piece of data came into existence (this is called the provenance of data). To further complicate matters, most research data needs to be shared at some point, within the lab, with external collaborators or on publicly accessible repositories, and it has to be ensured that the shared copies stay up-to-date once the original changes.

DataLad is a software tool that provides solutions for these challenges. Built on Git and Git-annex, it allows users to manage their data with version control, duplicate and share data while keeping the copies up-to-date and track computational provenance to reproduce analyses. The workflows that DataLad enables are illustrated in the figure below (from Halchenko et al., 2021):

DataLad datasets can be divided into reusable modular components (e.g. raw data, preprocessed data, analysis results) which can be stored and shared separately, on private, password-protected servers or on publicly accessible repositories. Others can retrieve the data, analyze it and even add new data to their copy of the dataset. During all of this, DataLad establishes a complete provenance trail all the way from a publication to the original data, ensuring that the research stays transparent and reproducible.

Overview

In this course, you are going to learn how to use DataLad to manage your own scientific data analysis projects. In the first unit, you are going to retrieve public data from the OpenNeuro platform to understand the structure and usage of DataLad datasets. You are also going to create your own dataset from scratch and see how you can use version control to inspect and restore older versions of your data. In the second unit, you are going to learn how to create and manage copies of your dataset both locally and on the open science repositories GIN and OSF. Finally, you are going to see how DataLad captures the provenance of computational outputs and how you can use this provenance record to repeat individual analysis steps and entire analysis pipelines.

Credits

Dr. Ole Bialas

Dr. Michał Szczepanik

Installation

To run the course materials on your own machine:

Install VSCode as your editor
Install pixi or alternatively conda to create virtual Python environments (see the lessons on environment and package management)
Download the materials for a lesson using the "Download Materials" button
Extract the zip file and open the notebook in VSCode
In VSCode, open a new terminal and install the environment:

pixi install

conda env create -f environment.yml
conda activate datalad

Course Contents

Creating and Using DataLad Datasets

Working with a DataLad Dataset

Clone datasets from the OpenNeuro platform, retrieve file content and use git-annex to identify storage locations

Creating a DataLad Dataset from Scratch

Create a new dataset, add and modify content and use the git history to inspect and restore old states of the data

Data Sharing and Provenance Tracking

Creating and Managing Sibling Datasets

Copy you dataset locally or to online repositories (OSF, GIN, GitHub) and see how changes can propagate across these sibling datasets

Running Commands while Tracking Provenance

Run commands, track inputs and outputs and rerun single commands or entire pipelines from the Git history

Research Data Management with DataLad

Authors

Overview

Further Reading

Credits

Installation

Course Contents

Creating and Using DataLad Datasets

Working with a DataLad Dataset

Creating a DataLad Dataset from Scratch

Data Sharing and Provenance Tracking

Creating and Managing Sibling Datasets

Running Commands while Tracking Provenance