Research Data Management with DataLad

Research Data Management with DataLad

Learn to create datasets with version control, track computational provenance and use open science repositories to create more open and reproducible science


Research Data Management with DataLad
Authors
Dr. Ole Bialas | Dr. Michał Szczepanik

Most research datasets change and evolve as scientists collect additional samples or perform computations on the data. This can make it challenging to get information on previous states of the data set or to simply understand how a particular piece of data came into existence (this is called the provenance of data). To further complicate matters, most research data needs to be shared at some point, within the lab, with external collaborators or on publicly accessible repositories, and it has to be ensured that the shared copies stay up-to-date once the original changes.

DataLad is a software tool that provides solutions for these challenges. Built on Git and Git-annex, it allows users to manage their data with version control, duplicate and share data while keeping the copies up-to-date and track computational provenance to reproduce analyses. The workflows that DataLad enables are illustrated in the figure below (from Halchenko et al., 2021):

DataLad datasets can be divided into reusable modular components (e.g. raw data, preprocessed data, analysis results) which can be stored and shared separately, on private, password-protected servers or on publicly accessible repositories. Others can retrieve the data, analyze it and even add new data to their copy of the dataset. During all of this, DataLad establishes a complete provenance trail all the way from a publication to the original data, ensuring that the research stays transparent and reproducible.

Overview

In this course, you are going to learn how to use DataLad to manage your own scientific data analysis projects. In the first unit, you are going to retrieve public data from the OpenNeuro platform to understand the structure and usage of DataLad datasets. You are also going to create your own dataset from scratch and see how you can use version control to inspect and restore older versions of your data. In the second unit, you are going to learn how to create and manage copies of your dataset both locally and on the open science repositories GIN and OSF. Finally, you are going to see how DataLad captures the provenance of computational outputs and how you can use this provenance record to repeat individual analysis steps and entire analysis pipelines.

Further Reading

Credits

Dr. Ole Bialas
Dr. Michał Szczepanik

Installation

To run the course materials on your own machine, it is recommended that you:

Download the pixi.toml file and install the environment:

pixi install --manifest-path pixi.toml
pixi shell

Download the environment.yml file and install the environment:

conda env create -f environment.yml
conda activate datalad