Research Data Management with DataLad
Learn to create datasets with version control, track computational provenance and use open science repositories to create more open and reproducible science
Authors
Most research datasets change and evolve as scientists collect additional samples or perform computations on the data. This can make it challenging to get information on previous states of the data set or to simply understand how a particular piece of data came into existence (this is called the provenance of data). To further complicate matters, most research data needs to be shared at some point, within the lab, with external collaborators or on publicly accessible repositories, and it has to be ensured that the shared copies stay up-to-date once the original changes.
DataLad is a software tool that provides solutions for these challenges. Built on Git and Git-annex, it allows users to manage their data with version control, duplicate and share data while keeping the copies up-to-date and track computational provenance to reproduce analyses. The workflows that DataLad enables are illustrated in the figure below (from Halchenko et al., 2021):
DataLad datasets can be divided into reusable modular components (e.g. raw data, preprocessed data, analysis results) which can be stored and shared separately, on private, password-protected servers or on publicly accessible repositories. Others can retrieve the data, analyze it and even add new data to their copy of the dataset. During all of this, DataLad establishes a complete provenance trail all the way from a publication to the original data, ensuring that the research stays transparent and reproducible.
Overview
In this course, you are going to learn how to use DataLad to manage your own scientific data analysis projects. In the first unit, you are going to retrieve public data from the OpenNeuro platform to understand the structure and usage of DataLad datasets. You are also going to create your own dataset from scratch and see how you can use version control to inspect and restore older versions of your data. In the second unit, you are going to learn how to create and manage copies of your dataset both locally and on the open science repositories GIN and OSF. Finally, you are going to see how DataLad captures the provenance of computational outputs and how you can use this provenance record to repeat individual analysis steps and entire analysis pipelines.
Further Reading
- DataLad handbook: an excellent source for explanations and tutorials on basic and advanced functions of DataLad
- Command line reference: contains detailed information on all DataLad commands and their options
- Git-annex documentation: a great resource for those who want to understand the underlying file system operations in greater detail
- “DataLad: distributed system for joint management of code, data, and their relationship” (Halchenko et al., 2021): this paper contains a concise description of the scope and software architecture of DataLad
- FAIRly big: A framework for computationally reproducible processing of large-scale data (Wagner et al., 2022): this paper illustrates how DataLad provides transparency and scalability for large-scale data analysis
Credits
Installation
To run the course materials on your own machine, it is recommended that you:
- Install VSCode as your editor
- Install pixi or alternatively conda to create virtual Python environments (see the lessons on environment and package management)
- Create a dedicated folder for this course and install the virtual environment:
Download the pixi.toml file and install the environment:
pixi install --manifest-path pixi.toml
pixi shellDownload the environment.yml file and install the environment:
conda env create -f environment.yml
conda activate dataladCourse Contents
Creating and Using DataLad Datasets
Working with a DataLad Dataset
Clone datasets from the OpenNeuro platform, retrieve file content and use git-annex to identify storage locations
Creating a DataLad Dataset from Scratch
Create a new dataset, add and modify content and use the git history to inspect and restore old states of the data
Data Sharing and Provenance Tracking
Creating and Managing Sibling Datasets
Copy you dataset locally or to online repositories (OSF, GIN, GitHub) and see how changes can propagate across these sibling datasets
Running Commands while Tracking Provenance
Run commands, track inputs and outputs and rerun single commands or entire pipelines from the Git history