Working with a DataLad Dataset

Authors
Ole Bialas | Michał Szczepanik

Git is great for tracking changes in code, but it struggles with large data files. While you can commit large files to Git, doing so will make it incredibly slow (any file greater than 50 MB will notably slow things down). This is where DataLad comes in: DataLad stores the content of large files and just commits a tiny pointer with a unique hash to Git. This way, Git knows about our files without having to handle the large files itself. You can think of DataLad as an extension to Git that allows it to track large files, enabling version control for scientific data. All of Git’s features, like reverting to older versions or creating branches, work with DataLad projects as well.

In this notebook, you are going to learn the basics of DataLad: you are going to clone an existing dataset, make modifications, and then create additional copies, or siblings, of the dataset on online repositories and your local file system.

Setup

Activate the pixi environment which contains DataLad and its dependencies and run git config to load the Datalad Next extension.

pixi shell
git config --global --add datalad.extensions.load next

Section 1: Working with DataLad Datasets

Background

The first section of this notebook will introduce you to DataLad by cloning and working with an existing dataset. When DataLad clones a dataset, it does not actually download the contents of large files. Instead, it just downloads the file pointers stored in Git. This makes cloning very fast, even if there are terabytes of data! To get the actual file content, we have to call datalad get. The reverse operation is datalad drop, which removes the file content while keeping the pointers. The ability to get and drop file content while keeping the folder structure intact is very useful when you are working with large datasets: you can clone the dataset, get the files you need for your script, run the script, and then use datalad drop to remove the file contents again.

Note for Windows users: When you clone a DataLad dataset, you will get a message that says “detected a crippled filesystem”. Don’t worry, this does not mean that there is anything wrong with your computer - it just means DataLad is working slightly differently on Windows (more on this later).

Exercises

The following exercises will introduce you to the DataLad command line interface (CLI). While the DataLad commands are the same on all operating systems, commands for listing folders and printing files differ between Linux/macOS and Windows. If you are on Windows and want to use Linux terminal commands, you can use Git Bash as your terminal, which comes with the Git installation on Windows. Below are the Linux/macOS and Windows commands as well as the DataLad command required in this section:

Terminal Commands

Linux/macOS Windows Command Prompt Description
ls dir List the content of the current directory
ls -a dir /a List the content of the current directory (including hidden files)
ls -a data dir /a data List the content of the data directory
cd code/ cd code/ Move to the code/ directory
cd .. cd .. Move to the parent of the current directory
cat folder/file.txt type folder/file.txt Print the content of file.txt

DataLad Commands

Code Description
datalad clone <url> Clone the dataset at the given URL into a new directory
datalad get folder/file.csv Get the content of file.csv
datalad get folder/ Get the content of all annexed files in folder/
datalad drop folder/file.csv Drop the content of file.csv
datalad drop folder/ Drop the content of all annexed files in folder/
datalad get * Get the content of the entire dataset
datalad drop * Drop the content of the entire dataset

Exercise: Use datalad clone to clone the dataset from https://hub.datalad.org/edu/penguins.git.

Solution
datalad clone https://hub.datalad.org/edu/penguins.git
install(ok): /tmp/datalad/penguins (dataset)

Exercise: Change the working directory to the penguins folder and list its contents

Solution
cd penguins
ls # dir on Windows
/tmp/datalad/penguins
LICENSE.txt  adelie	code	  gentoo
README.md    chinstrap	examples  portal.edirepository.org

Exercise: List the contents of the adelie/ folder

Solution
ls adelie # dir on Windows
knb-lter-pal.219.5.report.xml  knb-lter-pal.219.5.xml  table_219.csv
knb-lter-pal.219.5.txt	       manifest.txt

Exercise: Try to print the content of adelie/table_219.csv using the cat command (or type on Windows). You should see an error because, by default, DataLad does not download the actual file content

Solution
cat adelie/table_219.csv # type on Windows
cat: adelie/table_219.csv: No such file or directory

Exercise: Use datalad get to download the content of adelie/table_219.csv, then print it again.

Solution
datalad get adelie/table_219.csv
cat adelie/table_219.csv # type on Windows
get(ok): adelie/table_219.csv (file) [from origin...]
get(ok): adelie (directory)
action summary:
  get (ok: 2)
studyName,"Sample Number",Species,Region,Island,Stage,"Individual ID","Clutch Completion","Date Egg","Culmen Length (mm)","Culmen Depth (mm)","Flipper Length (mm)","Body Mass (g)",Sex,"Delta 15 N (o/oo)","Delta 13 C (o/oo)",Comments
PAL0708,1,"Adelie Penguin (Pygoscelis adeliae)",Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181,3750,MALE,,,"Not enough blood for isotopes."
PAL0708,2,"Adelie Penguin (Pygoscelis adeliae)",Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186,3800,FEMALE,8.94956,-24.69454,
PAL0708,3,"Adelie Penguin (Pygoscelis adeliae)",Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195,3250,FEMALE,8.36821,-25.33302,
PAL0708,4,"Adelie Penguin (Pygoscelis adeliae)",Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,,,,,"Adult not sampled."
PAL0708,5,"Adelie Penguin (Pygoscelis adeliae)",Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193,3450,FEMALE,8.76651,-25.32426,
PAL0708,6,"Adelie Penguin (Pygoscelis adeliae)",Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190,3650,MALE,8.66496,-25.29805,
PAL0708,7,"Adelie Penguin (Pygoscelis adeliae)",Anvers,Torgersen,"Adult, 1 Egg Stage",N4A1,No,2007-11-15,38.9,17.8,181,3625,FEMALE,9.18718,-25.21799,"Nest never observed with full clutch."
PAL0708,8,"Adelie Penguin (Pygoscelis adeliae)",Anvers,Torgersen,"Adult, 1 Egg Stage",N4A2,No,2007-11-15,39.2,19.6,195,4675,MALE,9.46060,-24.89958,"Nest never observed with full clutch."
PAL0708,9,"Adelie Penguin (Pygoscelis adeliae)",Anvers,Torgersen,"Adult, 1 Egg Stage",N5A1,Yes,2007-11-09,34.1,18.1,193,3475,,,,"No blood sample obtained."

Exercise: Use datalad drop to remove the content of adelie/table_219.csv again.

Solution
datalad drop adelie/table_219.csv
drop(ok): adelie/table_219.csv (file) [locking origin...]

Exercise: Use datalad get to download the contents of the entire adelie/ folder, then use datalad drop to remove the contents again.

Solution
datalad get adelie
datalad drop adelie
get(ok): adelie/table_219.csv (file) [from origin...]
get(ok): adelie (directory)
action summary:
  get (ok: 2)
drop(ok): adelie/knb-lter-pal.219.5.report.xml (file) [locking origin...]
drop(ok): adelie/knb-lter-pal.219.5.txt (file) [locking origin...]
drop(ok): adelie/knb-lter-pal.219.5.xml (file) [locking origin...]
drop(ok): adelie/manifest.txt (file) [locking origin...]
drop(ok): adelie/table_219.csv (file) [locking origin...]
drop(ok): adelie (directory)
action summary:
  drop (ok: 6)

Exercise: Use datalad get * to download the contents of all files in this dataset.

Solution
datalad get *
get(ok): examples/adelie.jpg (file) [from origin...]
get(ok): examples/chinstrap.jpg (file) [from origin...]
get(ok): examples/gentoo.jpg (file) [from origin...]
get(ok): gentoo/knb-lter-pal.220.5.report.xml (file) [from origin...]
get(ok): gentoo/knb-lter-pal.220.5.txt (file) [from origin...]
get(ok): gentoo/knb-lter-pal.220.5.xml (file) [from origin...]
get(ok): gentoo/manifest.txt (file) [from origin...]
get(ok): gentoo/table_220.csv (file) [from origin...]
get(ok): chinstrap/knb-lter-pal.221.6.report.xml (file) [from origin...]
get(ok): chinstrap/knb-lter-pal.221.6.txt (file) [from origin...]
  [11 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
get(ok): examples (directory)
get(ok): gentoo (directory)
get(ok): chinstrap (directory)
get(ok): portal.edirepository.org (directory)
get(ok): adelie (directory)
action summary:
  get (notneeded: 3, ok: 26)

Section 2: Modifying Content and Tracking Changes

Background

Whenever we make changes to our dataset, DataLad records them in the Git commit history. The ability to capture provenance - information about which activity, initiated by which entity, produced which outputs, given a set of parameters, a computational environment, and any input data - is great for research projects because it enables transparency and reproducibility. Since DataLad is built on top of Git, you don’t need to learn anything new here: the Git commit history works exactly the same for DataLad as it does for Git.

Because DataLad wants to make sure that we don’t accidentally overwrite our files once they are committed, it locks them to make them unmodifiable. This is why we have to use datalad unlock before we can modify them. When we run datalad save to save our changes, the file will be locked again. Even though DataLad reports unlocking as a file modification, it will only create a new entry in the commit history if the file actually changed.

Note for Windows: The Windows file system does not support file locking in the same way that Linux/macOS does. Instead, Windows duplicates the data and keeps one copy in the working directory and one backup copy for safety in the .git folder. This has the advantage that you don’t need to unlock files before modifying them, but it also makes your dataset twice as big! Another consequence is that DataLad is creating a separate commit for this adjustment, so the most recent entry in your git log will always show the message “git-annex adjusted branch”. This means that, to get the most recent commit you made to the dataset, you have to look at the second-to-last entry in git log.

Exercises

In the following exercises, we are going to inspect the git log to view the history of our dataset. We are also going to modify existing files by unlocking them and saving the changes to the commit history. Here are the commands you need to know:

Code Description
git log Display the commit history of the repository
code file.txt Open file.txt in VSCode
datalad unlock data/ Unlock the file content of the data/ folder
datalad unlock file.txt Unlock the file content of file.txt
datalad status Show changes in the current dataset
datalad save Save untracked changes and lock unlocked file contents
datalad save -m "message" Save untracked changes with a commit message

Exercise: Display the git log to view the commit history of the dataset

Solution
git log
commit 87e48f461293e12694cda52058d7e0897b3bb593 (HEAD -> main, origin/main, origin/HEAD)
Merge: 7b6199a e598ae1
Author: adina <adina@noreply.localhost>
Date:   Thu Jan 29 09:51:59 2026 +0000

    Merge pull request 'fix: correct syntax error in f-string' (#6) from candleindark/penguins:f-string-syntax into main
    
    Reviewed-on: https://hub.datalad.org/edu/penguins/pulls/6
    Reviewed-by: adina <adina@noreply.localhost>

commit e598ae1a2e213170d50d107eeee3850354acf893
Author: candleindark <candleindark@noreply.localhost>
Date:   Wed Jan 28 06:47:32 2026 +0000

    fix: correct syntax error in f-string

commit 7b6199a3d5798e9850d165202b2c0e16b9d85d1b
Author: Stephan Heunis <s.heunis@fz-juelich.de>
Date:   Mon Sep 22 22:20:25 2025 +0200

    Update script to convert data to demo-empirical-data schema
    
    This mainly involved changin some class names, namespaces, and urls.
    Also, it was decided not to generate display_name (i.e. preflabel)
    values for all records, so that these can rather be defined by the
    shacl-vue config option for templating display names.
    
    Lastly, the readme was updated to include an improved description
    for how to use the script to generate and POST metadata to a dumpthings
    backend, and how to delete existing data beforehand and restart the
    service afterwards.

Exercise: Use datalad unlock to unlock the file adelie/table_219.csv, then check datalad status.

Solution
datalad unlock adelie/table_219.csv
datalad status
unlock(ok): adelie/table_219.csv (file)
 modified: adelie/table_219.csv (file)

Exercise: Open the file in VSCode (code adelie/table_219.csv), delete some rows, save the file, then use datalad save to save the changes with a message (-m) and use git log to see your commit.

Solution
code adelie/table_219.csv
# After modifying table_219.csv
datalad save -m "removed some observations"
git log
add(ok): adelie/table_219.csv (file) [Copied metadata from old version of adelie/table_219.csv to new version. If you don't want this copied metadata, run: git annex metadata --remove-all adelie/table_219.csv]
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)

Exercise: Unlock gentoo/table_220.csv and run datalad save. Check datalad status and git log. The status should be clean and there should be no new entry in the commit history if the file wasn’t modified.

Solution
datalad get gentoo/table_220.csv
datalad unlock gentoo/table_220.csv
datalad save
datalad status
git log
unlock(ok): gentoo/table_220.csv (file)
add(ok): gentoo/table_220.csv (file)
action summary:
  add (ok: 1)
  save (notneeded: 1)
nothing to save, working tree clean
commit b2647a4d5cc6540ffbda82a213401fdd0ca6978d (HEAD -> main)
Author: obi <ole.bialas@posteo.de>
Date:   Wed Jun 3 15:52:02 2026 +0200

    removed some observations

commit 87e48f461293e12694cda52058d7e0897b3bb593 (origin/main, origin/HEAD)
Merge: 7b6199a e598ae1
Author: adina <adina@noreply.localhost>
Date:   Thu Jan 29 09:51:59 2026 +0000

    Merge pull request 'fix: correct syntax error in f-string' (#6) from candleindark/penguins:f-string-syntax into main
    
    Reviewed-on: https://hub.datalad.org/edu/penguins/pulls/6
    Reviewed-by: adina <adina@noreply.localhost>

commit e598ae1a2e213170d50d107eeee3850354acf893
Author: candleindark <candleindark@noreply.localhost>
Date:   Wed Jan 28 06:47:32 2026 +0000

    fix: correct syntax error in f-string

Section 3: Creating Siblings on Online Repositories

Background

One of DataLad’s most useful features is the ability to manage multiple copies, or siblings of the same dataset. Siblings allow you to share your data and function as backups. When a dataset has multiple siblings not every sibling has to contain all of the data. When downloading a file, DataLad will simply try all of the siblings until it finds one that has the data.

In this section, you will create sibling repositories on the Open Science Framework (OSF) and GitHub. These siblings are complementary: OSF provides the storage for the dataset and all of its versions, while the GitHub repository can be browsed and shared easily.

Exercises

In the following exercises, you will create sibling repositories on GitHub and OSF. For this, DataLad requires access tokens for your accounts. The exercises contain screenshots to show how they can be generated.

Command Description
datalad osf-credentials Authenticate DataLad with OSF using an access token
datalad create-sibling-osf --title my-repo Create a new OSF repository called my-repo
datalad create-sibling-github my-repo Create a new GitHub repository called my-repo
datalad push --to osf Push the dataset content to the sibling named osf
datalad push --to github Push the dataset content to the sibling named github
datalad osf-credentials --reset Reset your OSF access token
datalad credentials remove api.github.com Reset your GitHub access token
datalad siblings List all siblings of the current dataset
datalad update -s github --merge Merge new commits from the github sibling

To create an OSF sibling for your dataset, you must first generate an access token that allows DataLad to access your OSF account. Log in to OSF , go to “Settings” > “Personal Access Token” (red box), and click on “Create Token” (blue box).

Give the token a name of your choice, grant it full read and write permissions, and click on “Create Token”.

Copy the token. Be careful: you won’t be able to see the token again once you close the window!

Exercise: Run the datalad osf-credentials command and paste the access token when prompted. You should see osf_credentials(ok): [authenticated as <your name>]

Solution
datalad osf-credentials
You need to authenticate with 'https://osf.io' credentials. https://osf.io/settings/tokens provides information on how to gain access
token:
osf_credentials(ok): [authenticated as Ole Bialas]

Exercise: Use create-sibling-osf to publish your dataset to OSF with a --title of your choice.

Solution
datalad create-sibling-osf --title penguins
create-sibling-osf(ok): https://osf.io/tnzme/
[INFO   ] Configure additional publication dependency on "osf-storage" 
configure-sibling(ok): . (sibling)

Exercise: Push --to osf and inspect your OSF repository in the browser. The OSF repository will not contain the data in a human-readable form. You can push to and pull from this repository, but you can’t explore files in the browser (Note: only files that you have downloaded with datalad get will be pushed).

Solution
datalad push --to osf
copy(ok): gentoo/knb-lter-pal.220.5.xml (file) [to osf-storage...]
copy(ok): gentoo/manifest.txt (file) [to osf-storage...]
copy(ok): portal.edirepository.org/knb-lter-pal.219.5.zip (file) [to osf-storage...]
copy(ok): portal.edirepository.org/knb-lter-pal.220.5.zip (file) [to osf-storage...]
copy(ok): portal.edirepository.org/knb-lter-pal.221.6.zip (file) [to osf-storage...]
publish(ok): . (dataset) [refs/heads/main->osf:refs/heads/main [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->osf:refs/heads/git-annex [new branch]]
action summary:
  copy (ok: 5)
  publish (ok: 2)

To have a repository that can be easily browsed and shared, we will next create a sibling on GitHub. GitHub can’t store the actual file content, but that is not a problem - when we clone from GitHub, DataLad will automatically fetch the file content from another source that has it. To create a GitHub sibling, we need another access token. Log in to GitHub , click on your user icon in the top-right, and select “Settings”.

Then, select “Developer Settings” at the bottom of the menu on the left.

Select “Generate New Token (classic)”.

Grant full access to repositories, create the token, and paste it when prompted. Be careful: you won’t be able to see the token again after closing this window.

Exercise: Create a sibling on GitHub (you do not need the --title flag here) and push to it. Then, open the new repository in the browser.

Solution
datalad create-sibling-github penguins
datalad push --to github
create_sibling_github(ok): [sibling repository 'github' created at https://github.com/OleBialas/penguins]
configure-sibling(ok): . (sibling)
action summary:
  configure-sibling (ok: 1)
  create_sibling_github (ok: 1)
publish(ok): . (dataset) [refs/heads/main->github:refs/heads/main [new branch]]
publish(ok): . (dataset) [refs/heads/git-annex->github:refs/heads/git-annex [new branch]]
action summary:
  publish (ok: 2)

Exercise: Use the datalad siblings command to list all siblings of the current dataset.

Solution
.: here(+) [git]
.: uncurl(+) [uncurl]
.: github(-) [https://github.com/OleBialas/penguins.git (git)]
.: origin(+) [https://hub.datalad.org/edu/penguins.git (git)]
.: osf(-) [osf://tnzme (git)]
.: archivist(+) [archivist]
.: osf-storage(+) [osf]

Exercise: Open the GitHub sibling in the browser, edit the README.md and commit the changes. Then run datalad update -s github --merge to merge the changes into your local dataset and inspect the README.md to confirm the change was merged.

Solution
datalad update -s github --merge
# on Linux/macOS; use `type README.md` on Windows
cat README.md
[INFO   ] Fetching updates for Dataset(/tmp/datalad/penguins) 
merge(ok): . (dataset) [Merged github/main]
update.annex_merge(ok): . (dataset) [Merged annex branch]
update(ok): . (dataset)
action summary:
  merge (ok: 1)
  update (ok: 1)
  update.annex_merge (ok: 1)
Hello

Bonus: Go to a different folder on your computer, use datalad clone to clone the dataset from GitHub, and use datalad get * to download all of the file content.

/tmp/datalad
mkdir: cannot create directory ‘test’: File exists
/tmp/datalad/test
Solution
datalad clone https://github.com/<your-username>/<your-repo>
cd <your-repo-name>
datalad get *
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore 
[INFO   ] https://github.com/OleBialas/penguins/config download failed: Not Found 
install(ok): /tmp/datalad/test/penguins (dataset)
/tmp/datalad/test/penguins
get(ok): chinstrap/knb-lter-pal.221.6.report.xml (file) [from archivist...]
get(ok): chinstrap/knb-lter-pal.221.6.txt (file) [from archivist...]
get(ok): chinstrap/knb-lter-pal.221.6.xml (file) [from archivist...]
get(ok): chinstrap/manifest.txt (file) [from archivist...]
get(ok): chinstrap/table_221.csv (file) [from archivist...]
get(ok): gentoo/knb-lter-pal.220.5.report.xml (file) [from archivist...]
get(ok): gentoo/knb-lter-pal.220.5.txt (file) [from archivist...]
get(ok): gentoo/knb-lter-pal.220.5.xml (file) [from archivist...]
get(ok): gentoo/manifest.txt (file) [from archivist...]
get(ok): gentoo/table_220.csv (file) [from archivist...]
  [11 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
get(ok): chinstrap (directory)
get(ok): gentoo (directory)
get(ok): portal.edirepository.org (directory)
get(ok): examples (directory)
get(ok): adelie (directory)
action summary:
  get (notneeded: 3, ok: 26)

Section 4: Creating Local Backups

Background

Siblings can not only be used to publish your data online, they can also be used to create backups on an external drive or share data with collaborators via a local server. To do this, we can simply initialize a --bare git repository at the desired location and add it as a sibling to our DataLad dataset. Bare means that the git repository has no working tree - the contents that are normally hidden in the .git folder are in the main directory. The absence of a working tree prevents issues of synchronization and accidental overwriting when pushing to and pulling from the repository. While a bare repository is not suited for editing files directly, it can provide a common endpoint for multiple collaborators.

Exercises

In the following exercises, you are going to initialize a --bare git repository and add it as a sibling to your dataset. If your setup allows it, you can create the sibling repository on a separate drive, to mimic creating a backup of your data. Once the sibling is created we can clone it and see how changes can propagate across siblings. Here are the commands you need to know:

Command Description
git init --bare ./mydir Create a --bare repository called mydir in the current directory
datalad siblings List all siblings of the current dataset
datalad siblings add --name new --url <path> Add the repository at the URL as a new sibling with the name new
datalad siblings remove --name new Remove the sibling with the name new
datalad push --to new Push the dataset content to the sibling named new
datalad save Save all untracked changes in the current dataset
datalad update -s new --merge Merge updates from sibling new

Example: Initialize a --bare git repository in a new directory (Note: files in /tmp are temporary and not suited for actual backups).

git init --bare /tmp/penguins-backup
Initialized empty Git repository in /tmp/penguins-backup/

Exercise: Create a --bare git repository in another folder (preferably on a different drive).

Solution
git init --bare /tmp/penguins-backup2
Initialized empty Git repository in /tmp/penguins-backup2/

Example: Add /tmp/penguins-backup as a sibling to the current penguins dataset with the name backup.

datalad siblings add --name backup --url /tmp/penguins-backup
.: backup(-) [/tmp/penguins-backup (git)]

Exercise: Add the --bare repository you created in a folder of your choice as a sibling to the current penguins dataset with a name of your choice.

Solution
datalad siblings add --name backup2 --url /tmp/penguins-backup2
.: backup2(-) [/tmp/penguins-backup2 (git)]

Exercise: List all siblings of the current penguins dataset.

Solution
datalad siblings

Exercise: Push --to one of the backup siblings TWICE. You need to push twice because on the first push, DataLad initializes the data storage, and the second push transfers the actual file content.

Solution
datalad push --to backup
datalad push --to backup
copy(ok): adelie/knb-lter-pal.219.5.report.xml (file) [to backup...]
copy(ok): adelie/knb-lter-pal.219.5.txt (file) [to backup...]
copy(ok): adelie/knb-lter-pal.219.5.xml (file) [to backup...]
copy(ok): adelie/manifest.txt (file) [to backup...]
copy(ok): adelie/table_219.csv (file) [to backup...]
copy(ok): chinstrap/knb-lter-pal.221.6.report.xml (file) [to backup...]
copy(ok): chinstrap/knb-lter-pal.221.6.txt (file) [to backup...]
copy(ok): chinstrap/knb-lter-pal.221.6.xml (file) [to backup...]
copy(ok): chinstrap/manifest.txt (file) [to backup...]
copy(ok): chinstrap/table_221.csv (file) [to backup...]
  [11 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
publish(ok): . (dataset) [refs/heads/git-annex->backup:refs/heads/git-annex bc03603..1729d3f]
action summary:
  copy (ok: 21)
  publish (notneeded: 1, ok: 1)

Section 5: Bonus: Telling DataLad what to Track

Background

Under the hood, DataLad uses a program called git-annex to handle the content of files. By default, DataLad will use git-annex to handle the content of every single file in your dataset. However, this is not always desirable. For example, you may not want to annex small text files like code to avoid having to unlock them for every edit. We can tell DataLad which files should be annexed by editing the .gitattributes file. Let’s look at the .gitattributes file in the penguins dataset:

cat .gitattributes
* annex.backend=MD5E
**/.git* annex.largefiles=nothing

There are two lines in this file:

  • * annex.backend=MD5E: tells git-annex to use the MD5E backend for generating file hashes
  • **/.git* annex.largefiles=nothing: tells git-annex not to annex the .git folder (because that folder is where the annexed contents are stored)

We usually don’t want to edit these default values. Instead, we want to add lines to .gitattributes to specify which contents should and shouldn’t be annexed. Note that changes in the configuration will not automatically be applied to files that are already tracked. Thus, it is best to configure .gitattributes right after initializing the dataset, before data is added.

Exercises

In the following exercises, we are going to modify our dataset’s .gitattributes to apply some custom configurations. Here are the different commands and configuration options that you’ll need:

Code Description
* annex.largefiles=(mimeencoding=binary) Only annex files with a binary encoding
myfile.pdf annex.largefiles=nothing Don’t annex myfile.pdf
* annex.largefiles=(largerthan=5kb) Only annex files whose size exceeds 5KB
* annex.largefiles=((mimeencoding=binary)or(largerthan=5kb)) Only annex binary files and files greater than 5KB
code .gitattributes Open .gitattributes in VSCode
git annex unannex <files> Unannex the content of the given files
git rm --cached <files> Remove files from the Git index while keeping them in the working tree
git annex add <files> Add files to git-annex according to the current .gitattributes configuration
git annex whereis <files> Show the location of the annexed file content (empty if the file isn’t annexed)

Example: Get the location of the annexed file content of gentoo/table_220.csv. The entry marked with [here] refers to the local annexed copy.

git annex whereis gentoo/table_220.csv
whereis gentoo/table_220.csv (5 copies) 
  	34176649-4bdc-4685-84cf-96ba947110b5 -- olebi@iBOTS-7:/tmp/datalad/penguins [here]
  	83f2736d-70a7-412a-89df-01ee16326913 -- jsheunis@bnbmac48.local:~/Documents/psyinf/edu/palmerpenguins
  	92813bb4-4217-4837-b618-85d633eb9565 -- [archivist]
  	9948e337-6ea4-4df5-81a6-ef4709b6121f -- git@2f026f1d3470:/var/lib/gitea/git/repositories/jsheunis/palmerpenguins.git [origin]
  	d510bb94-be2a-41fc-aea5-b19ba3a680fa -- [osf-storage]

  archivist: dl+archive:MD5E-s17506--7c5573ab04ac7055a71197909e073894.5.zip#path=table_220.csv&size=18872
ok

Exercise: Get the location of the annexed file content of examples/gentoo.jpg.

Solution
git annex whereis examples/gentoo.jpg
whereis examples/gentoo.jpg (5 copies) 
  	34176649-4bdc-4685-84cf-96ba947110b5 -- olebi@iBOTS-7:/tmp/datalad/penguins [here]
  	41726d7d-6e14-4e23-936b-89452c0dccf8 -- [uncurl]
  	83f2736d-70a7-412a-89df-01ee16326913 -- jsheunis@bnbmac48.local:~/Documents/psyinf/edu/palmerpenguins
  	9948e337-6ea4-4df5-81a6-ef4709b6121f -- git@2f026f1d3470:/var/lib/gitea/git/repositories/jsheunis/palmerpenguins.git [origin]
  	d510bb94-be2a-41fc-aea5-b19ba3a680fa -- [osf-storage]

  uncurl: https://upload.wikimedia.org/wikipedia/commons/9/91/Brown_Bluff-2016-Tabarin_Peninsula%E2%80%93Gentoo_penguin_%28Pygoscelis_papua%29_02.jpg
ok

Exercise: Open .gitattributes in VSCode and add the line

**/*.jpg annex.largefiles=nothing

to avoid annexing JPG files.

code .gitattributes

Example: Unannex examples/gentoo.jpg and save to apply the changed configuration.

git annex unannex examples/gentoo.jpg
datalad save -m "unannex gentoo.jpg"

add(ok): .gitattributes (file)                  
save(ok): . (dataset)                           
action summary:                                                                 
  add (ok: 2)
  save (ok: 1)

Exercise: Unannex the other JPG files and save to apply the changed configuration (Hint: use examples/*.jpg for all .jpg files).

Solution
git annex unannex examples/*.jpg
datalad save -m "unannex all images"

add(ok): examples/chinstrap.jpg (file)          
save(ok): . (dataset)                           
action summary:                                                                 
  add (ok: 2)
  save (ok: 1)

Exercise: Get the location of the annexed file content of examples/gentoo.jpg again. It should not return anything because the file is no longer handled by git-annex.

Solution
git annex whereis examples/gentoo.jpg

Exercise: A configuration that is often useful is to only annex binary files (e.g. data) and leave text files, such as code, tracked directly by Git. Open .gitattributes in VSCode and change the last line to

* annex.largefiles=(mimeencoding=binary)

so that only binary files will be annexed.

code .gitattributes

Exercise: Unannex all CSV files (**/*.csv) and save to apply the changed configuration. Now you should be able to edit CSV files without having to unlock them.

Solution
git annex unannex **/*.csv
datalad save -m "annex only binary"

add(ok): chinstrap/table_221.csv (file)         
add(ok): gentoo/table_220.csv (file)            
add(ok): .gitattributes (file)                  
save(ok): . (dataset)                           
action summary:                                                                 
  add (ok: 4)
  save (ok: 1)

Exercise: Often, large text files, such as large CSV tables, are better left to git-annex. Open .gitattributes in VSCode and change the last line to

* annex.largefiles=((mimeencoding=binary)or(largerthan=5kb))

so that files greater than 5kb (even if they are not binary encoded) will be annexed. Then, run the code below to apply the updated configuration to existing CSV files.

code .gitattributes
Solution
git rm --cached **/*.csv
git annex add **/*.csv
datalad save -m "annex lage CSV files"

save(ok): . (dataset)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)