Creating and Managing Sibling Datasets

Authors
Ole Bialas | Michał Szczepanik

Creating and Managing Sibling Datasets

One of the most valuable features of DataLad is the ability to create and manage multiple instances of a dataset. These so-called siblings are linked copies that can communicate changes just like git repositories. Whether you want to back up your dataset locally, transfer it to an HPC for analysis or publish it on an open science platform - DataLad’s siblings provide a convenient way of doing it without having to worry about the underlying file system operations.

In this session we will:

  1. Create a sibling on an open science platform (you can choose between GIN and OSF)
  2. Create a sibling locally, in a separate folder or external drive.
  3. Create a sibling on GitHub and link it to the data on the open science platform.

To create siblings, we first need a dataset. The cell below creates a new dataset with the -c yoda option which configures the dataset according to the YODA principles , a set of practices for data analysis in DataLad datasets:

  1. One thing, one dataset.
  2. Record where you got it from, and where it is now.
  3. Record what you did to it, and with what.

For our purposes, it is enough that this configuration option automatically creates some folders and files (e.g., README.md and code/README.md) so we can create siblings and exchange data without having to add content ourselves. Just run the cell below to create a YODA dataset in the my-data directory.

import os
# deactivate DataLad's progressbar for this notebook
os.environ['DATALAD_UI_PROGRESSBAR'] = 'none'

!datalad create -c yoda my-data
%cd my-data
!ls

Section 1: Using Open Science Repositories

Background

There are several open science repositories that can host your dataset for free. Some, like OpenNeuro and DANDI , require standardized data formats while others accept any kind of data. In this section we will use repositories of the latter kind: OSF and GIN . DataLad provides ready-made functions to export your data to these repositories: create-sibling-gin and create-sibling-osf (requires the datalad-osf extension). Once these siblings are created you can push and clone just like you would with a regular git repository.

The create-sibling commands require that DataLad can authenticate itself to create new repositories in your name. To do this, it requires access tokens which are like long passwords that expire after a certain time. These tokens can be created using the web interface of the respective services (see screenshots in the exercises). When creating siblings, DataLad will prompt you for the token - simply copy it from the website and paste it to your terminal.

Exercises

In the following exercises, you are going to create siblings for your dataset on GIN and OSF. It is recommended that you choose one of these options and then proceed with the rest of the notebook. If you have time, you can then return to this section to do the other option.

NOTE: Because the datalad create-sibling commands will prompt you for access tokens, you can not do these exercises within a Jupyter notebooks and you’ll have to use the terminal.

Command Description
datalad osf-credentials Authenticate DataLad with OSF using an access token
datalad siblings List all siblings of the current dataset
datalad create-sibling-gin my-repo -s gin Create a new GIN repository called my-repo and add it as a sibling named gin
datalad create-sibling-osf my-repo -s osf Create a new OSF repository called my-repo and add it as a sibling named osf
datalad push --to gin Push the dataset content to the sibling named gin
datalad credentials get gin Get the credentials stored under the name gin
datalad credentials remove gin Delete the credentials stored under the name gin

Option 1: GIN

GIN is run by the German Neuroinformatics Node (G-Node), a research group based at the Ludwig-Maximilians-Universität München (LMU Munich) in Germany. It is a free data management platform designed for neuroscience research that provides Git-based version control for scientific data, supporting both web interface and command-line access with Git/Git-annex integration for managing large datasets. In the following exercises, you’ll have to register your SSH key with GIN to gain access, create a sibling repository and publish your data.

Exercise: On GIN, go to your user settings (red box), then select “Applications” (blue box) and “Generate New Token”. Give the token any name and copy it.

NOTE: You won’t be able to see the token again once you closed the window!

Exercise: Run the cell below to create a new repository my-data on your gin account, register it as a sibling with the same gin, store the credential under the name gin and use the https access protocol. Paste the access token when you are prompted.

NOTE: If you entered a wrong token you can run datalad credential remove gin to delete it.

!datalad create-sibling-gin my-data -s gin --credential gin --access-protocol https
An access token is required for https://gin.g-node.org. Visit https://gin.g-node.org/user/settings/applications to create a token
token:
An access token is required for https://gin.g-node.org. Visit https://gin.g-node.org/user/settings/applications to create a token
token (repeat):

create_sibling_gin(ok): [sibling repository 'gin' created at https://gin.g-node.org/obi/my-data]
configure-sibling(ok): . ((sibling))
action summary:
  configure-sibling (ok: 1)
  create_sibling_gin (ok: 1)

Exercise: Check your GIN account to verify the repository was created and use datalad siblings to verify that the sibling was registered.

Solution
!datalad siblings
.: here(+) [git]
.: gin(-) [https://gin.g-node.org/obi/my-data (git)]

Exercise: Push --to gin and enter your GIN username and password when prompted. Then, check the repository in the browser to verify the data was transferred.

Solution
!datalad push --to gin
Update availability for 'gin':  75%|█████████████████████████       | 3.00/4.00 [00:00<00:00, 16.3k Steps/s
Username for 'https://gin.g-node.org': obi
Password for 'https://obi@gin.g-node.org':
publish(ok): . (dataset) [refs/heads/master->gin:refs/heads/master [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->gin:refs/heads/git-annex [new branch]]
action summary:
  copy (notneeded: 1)
  publish (ok: 2)

Option 2: OSF

The Open Science Framework (OSF) is run by the Center for Open Science (COS), a non-profit technology organization based in Charlottesville, Virginia, USA, dedicated to increasing openness, integrity, and reproducibility of research.

OSF interfaces with Git/Git-annex/DataLad through its storage backend that supports WebDAV protocol - DataLad can create siblings on OSF using the datalad create-sibling-osf command, which sets up Git-annex special remotes to store annexed files on OSF while tracking metadata in Git, enabling version-controlled data sharing and collaboration through OSF’s infrastructure.

Creating a sibling on OSF requires the datalad-osf extension which you have if you installed the course environment - if you don’t have it, just run pip install datalad-osf. It also is recommended to configure git to use the datalad-next extension, which can be done by running the following cell.

!git config --global --add datalad.extensions.load next

Exercise: Login to OSF, go to “Settings” > “Personal Access Token” (red box) and click on “Create Token” (blue box).

Give the token a name of your choice, grant it full read and write permissions and click on “Create Token”.

Copy the token - Careful: you won’t be able to see the token again once you closed the window!

Exercise: Run the datalad-osf-credentials command and paste the access token when prompted. You should see osf_credentials(ok): [authenticated as <your name>]

NOTE: This requires interactive input and has to be done in the terminal. If your token expired you can use datalad osf-credentials --reset to enter a new token.

Solution
!datalad osf-credentials
You need to authenticate with 'https://osf.io' credentials. https://osf.io/settings/tokens provides information on how to gain access
token:
osf_credentials(ok): [authenticated as Ole Bialas]

Exercise: Use create-sibling-osf to create a new OSF repository and register it as a sibling to my-data/ with the name osf.

Solution
!datalad create-sibling-osf --title my-data -s osf
create-sibling-osf(ok): https://osf.io/hvgkc/
[INFO   ] Configure additional publication dependency on "osf-storage" 
configure-sibling(ok): . (sibling)

Exercise: Check you OSF account to confirm that the repo was created and verify it was registered as a sibling of your dataset.

Solution
!datalad siblings
.: here(+) [git]
.: osf(-) [osf://hvgkc (git)]
.: gin(-) [https://gin.g-node.org/obi/my-data (git)]
.: osf-storage(+) [osf]

Exercise: Push to the osf sibling and inspect your OSF repository in the browser.

NOTE: The OSF repository will not contain the data in a form that is human-readable. You can push to and pull from this repository but you can’t explore files in the browser. Alternatively, you can configure OSF as a human-readable special remote which contains file data but not version history. See this tutorial for a description of how to do that.

Solution
!datalad push --to osf
publish(ok): . (dataset) [refs/heads/master->osf:refs/heads/master [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->osf:refs/heads/git-annex [new branch]]
action summary:
  publish (ok: 2)

Section 2: Creating Local Backups

Background

Siblings can not only be used to publish your data online, they can also be used to create backups on an external drive or share data with collaborators via a local server. To do this, we can simply initialize a --bare git repository at the desired location and add it as a sibling to our DataLad dataset. Bare means that the git repository has no working tree - the contents that are normally hidden in the .git folder are in the main directory. The absence of a working tree prevents issues of synchronization and accidental overwriting when pushing to and pulling from the repository. While a bare repository is not suited for editing files directly, it can provide a common endpoint for multiple collaborators.

Exercises

In the following exercises, you are going to initialize a --bare git repository and add it as a sibling to your dataset. If your setup allows it, you can create the sibling repository on a separate drive, to mimic creating a backup of your data. Once the sibling is created we can clone it and see how changes can propagate across siblings. Here are the commands you need to know:

Command Description
git init --bare ./mydir Create a --bare repository called mydir in the current directory
datalad siblings List all siblings of the current dataset
datalad siblings add --name new --url <path> Add the repository at the URL as a new sibling with the name new
datalad siblings remove --name new Remove the sibling with the name new
datalad push --to new Push the dataset content to the sibling named new
datalad clone <source> <destination> Clone a dataset from <source> to <destination>
echo "text" >> file.txt Append “text” to file.txt
datalad save Save all untracked changes in the current dataset
datalad update -s new Update the dataset’s content from the sibling new
datalad update -s new --merge Merge updates from sibling new

Example: Initialize a --bare git repository in the directory ./my_data_backup.

!git init --bare ../my-data-backup
Initialized empty Git repository in /home/olebi/projects/new-learning-platform/notebooks/datalad/02_data_sharing_and_provenance_tracking/01_creating_sibling_datasets/my-data-backup/

Exercise: Create a --bare git repository in another folder (preferably on a different drive).

Solution
!git init --bare ../my-data-backup2
Initialized empty Git repository in /home/olebi/projects/new-learning-platform/notebooks/datalad/02_data_sharing_and_provenance_tracking/01_creating_sibling_datasets/my-data-backup2/

Example: Add ../my-data-backup as a sibling to my-data/ with the name backup.

!datalad siblings add --name backup --url ../my-data-backup
.: backup(-) [../my-data-backup (git)]

Exercise: Add the --bare repository you created in a folder of your choice as a sibling to my-data/ with a name of your choice.

Solution
!datalad siblings add --name backup2 --url ../my-data-backup2
.: backup2(-) [../my-data-backup2 (git)]

Exercise: List all siblings of my-data/.

Solution
!datalad siblings
.: here(+) [git]
.: osf(-) [osf://hvgkc (git)]
.: backup2(-) [../my-data-backup2 (git)]
.: gin(-) [https://gin.g-node.org/obi/my-data (git)]
.: osf-storage(+) [osf]
.: backup(-) [../my-data-backup (git)]

Example: Push --to the sibling backup.

!datalad push --to backup
publish(ok): . (dataset) [refs/heads/master->backup:refs/heads/master [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->backup:refs/heads/git-annex [new branch]]
action summary:
  copy (notneeded: 1)
  publish (ok: 2)

Exercise: Push --to the sibling you created.

Solution
!datalad push --to backup2
publish(ok): . (dataset) [refs/heads/master->backup2:refs/heads/master [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->backup2:refs/heads/git-annex [new branch]]
action summary:
  copy (notneeded: 1)
  publish (ok: 2)

Exercise: Clone the dataset in my-data-backup/ to a new folder called recovery/.

Solution
%cd ..
!datalad clone ./my-data-backup ./recovery
/home/olebi/projects/new-learning-platform/notebooks/datalad/02_data_sharing_and_provenance_tracking/01_creating_sibling_datasets
install(ok): /home/olebi/projects/new-learning-platform/notebooks/datalad/02_data_sharing_and_provenance_tracking/01_creating_sibling_datasets/recovery (dataset)

Exercise: Go to the my-data/ directory, add a line to README.md and save the changes. Then, push --to the sibling backup.

Solution
%cd my-data
!echo "Hello Sibling!" >> README.md
!datalad save
!datalad push --to backup
/home/olebi/projects/new-learning-platform/notebooks/datalad/02_data_sharing_and_provenance_tracking/01_creating_sibling_datasets/my-data
add(ok): README.md (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
publish(ok): . (dataset) [refs/heads/master->backup:refs/heads/master 1d7a5fe..059799d]
action summary:
  publish (notneeded: 1, ok: 1)

Exercise: Now, go to the recovery/ directory and list all siblings.

Solution
%cd ../recovery
!datalad siblings
/home/olebi/projects/new-learning-platform/notebooks/datalad/02_data_sharing_and_provenance_tracking/01_creating_sibling_datasets/recovery
.: here(+) [git]
.: osf-storage(+) [osf]
.: origin(+) [../my-data-backup (git)]

Example: Update the dataset from the origin sibling.

!datalad update -s origin
[INFO   ] Fetching updates for Dataset(/home/olebi/projects/new-learning-platform/notebooks/datalad/02_data_sharing_and_provenance_tracking/01_creating_sibling_datasets/recovery) 
update(ok): . (dataset)

Exercise: You fetched the updates but didn’t merge them into the working tree (i.e. recovery/README.md does not contain the updates). Update again but use the --merge flag. Then, inspect the content of recovery/README.md - it should contain the added line.

Solution
!datalad update -s origin --merge
[INFO   ] Fetching updates for Dataset(/home/olebi/projects/new-learning-platform/notebooks/datalad/02_data_sharing_and_provenance_tracking/01_creating_sibling_datasets/recovery) 
merge(ok): . (dataset) [Merged origin/master]
update.annex_merge(ok): . (dataset) [Merged annex branch]
update(ok): . (dataset)
action summary:
  merge (ok: 1)
  update (ok: 1)
  update.annex_merge (ok: 1)

BONUS: Change the directory to recovery/, make a change to README.md, save it and push it --to origin. Then, change the directory to my-data/ and update from the backup sibling. You should see the change made to recovery/README.md in my-data/README.md.

Solution
!echo "Hello to you, too!" >> README.md
!datalad save
!datalad push --to origin
%cd ../my-data
!datalad update -s backup --merge
add(ok): README.md (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
publish(ok): . (dataset) [refs/heads/git-annex->origin:refs/heads/git-annex 5e4225e..67a74c2]
publish(ok): . (dataset) [refs/heads/master->origin:refs/heads/master 059799d..4b7afab]
action summary:
  publish (ok: 2)
/home/olebi/projects/new-learning-platform/notebooks/datalad/02_data_sharing_and_provenance_tracking/01_creating_sibling_datasets/my-data
[INFO   ] Fetching updates for Dataset(/home/olebi/projects/new-learning-platform/notebooks/datalad/02_data_sharing_and_provenance_tracking/01_creating_sibling_datasets/my-data) 
merge(ok): . (dataset) [Merged backup/master]
update.annex_merge(ok): . (dataset) [Merged annex branch]
update(ok): . (dataset)
action summary:
  merge (ok: 1)
  update (ok: 1)
  update.annex_merge (ok: 1)

Section 3: Creating a Sibling on GitHub

GitHub is a web-based platform for hosting Git repositories and collaborative software development, owned and operated by Microsoft Corporation since 2018. Even though a dataset sibling on GitHub does not serve the data, it constitutes a simple, findable access point for the dataset. When a user clones the dataset from GitHub and runs datalad get, DataLad will automatically get the data from other available locations that contain the annexed contents (like OSF or GIN).

Exercises

In the following exercises, you will create a sibling for your dataset on GitHub. The screenshots document how to authenticate DataLad for GitHub using an access token. Here are the commands you need to know:

Code Description
datalad create-sibling-github my-repo -s github --credential github Create a new GitHub repository called my-repo, add it as a sibling named github and store access token in the credential manager as github.
datalad credentials remove github Remove the github entry from DataLads credential manager
datalad push --to github Push the dataset content to the sibling named github

Exercise: Log in to GitHub to create an access token. First, click on your user icon in the top right and select “Settings”.

Then, select “Developer Settings” at the bottom of the menu on the left.

Select “Generate New Token (classic)”.

Grant full access to repositories, create the token and paste it. Careful - you won’t be able to see the token again after closing this window.

Exercise: Run the cell below to create a new GitHub repo called my-data and name the sibling github. Paste the token you generated when prompted - it will be stored under the name github in DataLads credential manager.

NOTE: If you entered a wrong token or your token expired, you can run datalad credentials remove github.

!datalad create-sibling-github my-data -s github --credential github
An access token is required for https://api.github.com. Visit https://github.com/settings/tokens to create a token.
token:
An access token is required for https://api.github.com. Visit https://github.com/settings/tokens to create a token.
token (repeat):
create_sibling_github(ok): [sibling repository 'github' created at https://github.com/OleBialas/my-data]
configure-sibling(ok): . (sibling)
action summary:
  configure-sibling (ok: 1)
  create_sibling_github (ok: 1)

Exercise: Check your GitHub account to confirm that the my-data repo was created and verify it was registered as a sibling of your dataset.

Solution
!datalad sibling
.: here(+) [git]
.: backup2(+) [../my-data-backup2 (git)]
.: osf-storage(+) [osf]
.: gin(-) [https://gin.g-node.org/obi/my-data (git)]
.: github(-) [https://github.com/OleBialas/my-data.git (git)]
.: backup(+) [../my-data-backup (git)]
.: osf(-) [osf://hvgkc (git)]

Exercise: Push --to github and inspect your repository in the browser.

Solution
!datalad push --to github
Update availability for 'github':  75%|████████████████████████████████▌        | 3.00/4.00 [00:00<00:00, 24.1k Steps/s]
Username for 'https://github.com': OleBialas
Password for 'https://OleBialas@github.com':
publish(ok): . (dataset) [refs/heads/master->github:refs/heads/master [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]]
action summary:
  publish (ok: 2)