Data Sharing and Provenance Tracking

Courses

This unit covers two more advanced functions of DataLad: dataset siblings and the run command. Siblings are linked copies of a dataset that can be used, for example, to backup data on an external drive or publish a dataset to an online repository. You’ll learn how to use the Open Science Framework (OSF), the G-Node Infrastructure (GIN) and GitHub to host and publish your data as well as create local backup copies of your data. Once these siblings are created and registered, DataLad can ensure they stay in sync and propagate changes across datasets. Note that the exercises on creating online siblings expect you to have a (free) account for the respective service.

You’ll also learn about DataLad’s run command which can execute any command or script you would run in a terminal while keeping track of the inputs and outputs. This creates a full provenance trail for your data analysis, allowing you to easily rerun single analysis steps or entire pipelines. Together, siblings and the run command provide the tools for creating reproducible workflows. You can publish your dataset’s sibling (including the code and computational environment) to an online repository and others can simply clone it and rerun your analysis, thanks to the provenance stored by DataLad.

After this unit, you’ll know the following DataLad commands (note that the linked command-line reference contains a lot more detail than is required for the exercises):

Sessions

Creating and Managing Sibling Datasets

Copy you dataset locally or to online repositories (OSF, GIN, GitHub) and see how changes can propagate across these sibling datasets

Running Commands while Tracking Provenance

Run commands, track inputs and outputs and rerun single commands or entire pipelines from the Git history