Creating a DataLad Dataset from Scratch
Authors
Creating a DataLad Dataset from Scratch
DataLad is a highly flexible tool that can be easily integrated into every workflow because, in its essence, a DataLad dataset is just a regular folder on your machine (with some additional metadata in the .git and .datalad folders).
In this section we are going to create a new dataset from scratch using DataLad’s create command.
Because every DataLad dataset is also a Git repository, this will initialize git automatically.
Once we create the dataset, we can add any kind of data.
We can even add other DataLad datasets as so-called subdatasets!
As we add data and make changes to our dataset, DataLad will keep track of everything in the git log.
This gives us a comprehensive history of our dataset which allows us (and anyone we share the dataset with) to understand what has been done and even restore older versions of files.
Section 1: Creating a new Dataset
Background
Once we create a dataset, DataLad will watch out for changes to any file.
By using the status command, we can get a report on any files or changes in our dataset that are currently untracked.
When we run datalad save, the untracked changes will be committed into the dataset’s history.
We can add a little message with the -m flag to describe what has been done, e.g., -m "added raw recordings".
While this is not required, it is a good practice that will make the dataset’s history more transparent to collaborators and your future self.
Exercises
In this section we are going to create a new DataLad dataset. We are then going to add different kinds of content like text files and PDFs downloaded from the web and save them so DataLad keeps track of them. Here are the commands you need to know:
| Code | Description |
|---|---|
mkdir data/ |
Create a new directory called data/ |
cd data/ |
Change the working directory to data/ |
datalad create my-dataset |
Create a DataLad dataset in the new directory my-dataset |
datalad status |
Show any untracked changes in the current dataset |
datalad save -m "adding data" |
Save all untracked changes in the current dataset with a commit message |
echo "hello" > file.txt |
Save the text "hello" to file.txt |
curl -o file.txt <URL> |
Download content from the given URL and write it to file.txt |
curl -s -o file.txt <URL> |
Silently download content from the given URL and write it to file.txt |
Example: Create a new DataLad dataset called my-dataset in the current directory.
import os
# deactivate DataLad's progressbar for this notebook
os.environ['DATALAD_UI_PROGRESSBAR'] = 'none'
!datalad create my-datasetcreate(ok): /home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/my-dataset (dataset)
Exercise: Create a new DataLad dataset called learn-datalad in the current directory.
Solution
!datalad create learn-dataladcreate(ok): /home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad (dataset)
Exercise: Change the current directory to learn-datalad and print the dataset’s status.
Solution
%cd learn-datalad
!datalad status/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad
nothing to save, working tree cleanExample: Create a new directory code/ in learn-datalad/.
!mkdir audioExercise: Create a new directory books/ in learn-datalad/ and change the current directory to books/.
Solution
!mkdir booksRun the cell below to download https://homepages.uc.edu/~becktl/byte_of_python.pdf and write it to the output file byte-of-python.pdf in books/.
!curl -s -o books/byte-of-python.pdf https://homepages.uc.edu/~becktl/byte_of_python.pdfExercise: Check the status of the dataset.
Solution
!datalad statusuntracked: books (directory)
Exercise: save the untracked file and add the message "add a book on Python". Then, check the status of the dataset again.
Solution
!datalad save -m "add a book on Python"
!datalad statusadd(ok): books/byte-of-python.pdf (file) save(ok): . (dataset) action summary: add (ok: 1) save (ok: 1) nothing to save, working tree clean
Exercise: Run the cell below to download https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf and write it to books/progit.pdf. Then, save the untracked file with a message "add a book on Git" and check the dataset’s status to make sure there are no untracked changes.
!curl -s -o books/progit.pdf https://github.com/progit/progit2/releases/download/2.1.154/progit.pdfSolution
!datalad save -m "add a book on Git"
!datalad statusadd(ok): books/progit.pdf (file) save(ok): . (dataset) action summary: add (ok: 1) save (ok: 1) nothing to save, working tree clean
Exercise: Run the cell below to create a new file README.md with the text "This is a DataLad dataset". Then, save the untracked file and check the dataset’s status.
!echo "This is a DataLad dataset" > README.mdSolution
!datalad save -m "add README"
!datalad statusadd(ok): README.md (file) save(ok): . (dataset) action summary: add (ok: 1) save (ok: 1) nothing to save, working tree clean
Section 2: Modifying Content and Tracking Changes
Background
While DataLad wraps many functions of Git, there are some instances where we need to use git directly.
Viewing the git log is one of those instances.
This log is a critical part of any Git repository and DataLad dataset because it contains a comprehensive history of our dataset and every time we run datalad save, a new entry is created.
Each commit has a unique hash and contains the commit’s author and their email as well as the commit message.
Because DataLad wants to make sure that we don’t accidentally overwrite our files once they are committed, it locks them to make them unmodifiable.
This is why we have to use datalad unlock before we can modify them.
When we run datalad save to save our changes, the file will be locked again.
Even though DataLad reports unlocking as a file modification, it will only create a new entry in the commit history if the file actually changed.
Note for Windows: The Windows file system does not support file locking in the same way that Linux/macOS does. Instead, Windows duplicates the data and keeps one copy in the working directory and one backup copy for safety in the .git folder. This has the advantage that you don’t need to unlock files before modifying them, but it also makes your dataset twice as big!
Another consequence is that DataLad is creating a separate commit for this adjustment, so the most recent entry in your git log will always show the message “git-annex adjusted branch”. This means that, to get the most recent commit you made to the dataset, you have to look at the second-to-last entry in git log.
Exercises
In the following exercises we are going to inspect the git log to view the history of our dataset. We are also going to modify existing files by unlocking them and saving the changes to the commit history. Here are the commands you need to know:
| Code | Description |
|---|---|
git log |
Display the commit history of the repository |
git log -2 |
Display the last two entries in commit history |
git log --oneline |
Display a compact one-line view of the commit history |
datalad unlock data/ |
Unlock the file content of the data/ folder |
datalad unlock file.txt |
Unlock the file content of file.txt |
datalad status |
Show any untracked changes in the current dataset |
datalad save |
Save untracked changes and lock unlocked file contents |
datalad save -m "message" |
Save untracked changes with a commit message |
echo "content" >> file.txt |
Append the text "content" to file.txt |
Exercise: Display the git log to view all commits you made to the learn-datalad dataset.
Solution
!git logcommit e808ed1cbeee1cbac35783b3339a2262608c584f (HEAD -> master) Author: obi <ole.bialas@posteo.de> Date: Wed Dec 10 15:51:23 2025 +0100 add README commit 4fc9ab1ceb12768fce2f39a785b1e3f4d4bd363c Author: obi <ole.bialas@posteo.de> Date: Wed Dec 10 15:51:22 2025 +0100 add a book on Git commit e7ff5e0709ee650c24616ab6e24e80918dc81b9e Author: obi <ole.bialas@posteo.de> Date: Wed Dec 10 15:51:19 2025 +0100 add a book on Python commit 34b88b0b1009622d1d8910dc495b684c61a8281c Author: obi <ole.bialas@posteo.de> Date: Wed Dec 10 15:51:12 2025 +0100 [DATALAD] new dataset
Exercise: Display the git log in a compact one-line view.
Solution
!git log --onelinee808ed1 (HEAD -> master) add README 4fc9ab1 add a book on Git e7ff5e0 add a book on Python 34b88b0 [DATALAD] new dataset
Exercise: Unlock the content of README.md. Then, check the dataset’s status.
Solution
!datalad unlock README.md
!datalad statusunlock(ok): README.md (file) modified: README.md (file)
Example: Append the line "It uses git and git-annex" to README.md, either using your editor or the echo command. Then, save with a message and check the dataset’s status.
!echo "It uses git and git-annex" >> README.md
!datalad save -m "add line"
!datalad statusadd(ok): README.md (file) save(ok): . (dataset) action summary: add (ok: 1) save (ok: 1) nothing to save, working tree clean
Exercise: Unlock README.md and append another line "for decentralized version control". Then, save the changes and check the status.
Solution
!datalad unlock README.md
!echo "For decentralized version control" >> README.md
!datalad save -m "add another line"
!datalad statusunlock(ok): README.md (file) add(ok): README.md (file) save(ok): . (dataset) action summary: add (ok: 1) save (ok: 1) nothing to save, working tree clean
Exercise: Display the last two entries in the git history.
Solution
!git log -2commit bfddf2ec59900bc61f8175a8a2d9693db73a1898 (HEAD -> master) Author: obi <ole.bialas@posteo.de> Date: Wed Dec 10 15:51:27 2025 +0100 add another line commit a9979bd13cdd06f808dd8b14f4d19edf82988bb5 Author: obi <ole.bialas@posteo.de> Date: Wed Dec 10 15:51:25 2025 +0100 add line
Exercise: Unlock README.md and then, without making any changes, save with a message. Check the last two entries in the git history - did your save command create an entry?
Solution
!datalad unlock README.md
!datalad save -m "did nothing"unlock(ok): README.md (file) add(ok): README.md (file) action summary: add (ok: 1) save (notneeded: 1)
!git log -2commit bfddf2ec59900bc61f8175a8a2d9693db73a1898 (HEAD -> master) Author: obi <ole.bialas@posteo.de> Date: Wed Dec 10 15:51:27 2025 +0100 add another line commit a9979bd13cdd06f808dd8b14f4d19edf82988bb5 Author: obi <ole.bialas@posteo.de> Date: Wed Dec 10 15:51:25 2025 +0100 add line
Section 3: Installing Subdatasets
Background
You can add any data to your DataLad dataset, including other datasets! DataLad allows you to install datasets as submodules, which means that they are added to your repository while maintaining their own, independent git history. This allows us to modularize research projects by, for example, creating subdatasets for different modalities, conditions, or analysis methods. Modularizing the dataset often results in a cleaner history and easier-to-maintain project, and it also increases the reusability because it allows you and others to reuse only specific components of the data.
Installing subdatasets is done via DataLad’s install command.
This works similarly to clone but is more versatile and allows us to install a subdataset into a given path while automatically registering it into the superdataset’s history.
Exercises
In the following exercises, we are going to install datasets from OpenNeuro as subdatasets into our new dataset. Here are the commands you need to know:
| Code | Description |
|---|---|
datalad install -d my-dataset <URL> |
Install the dataset from the given URL as a subdataset into the my-dataset/ directory |
datalad install -d . <URL> |
Install the dataset from the given URL as a subdataset into the current directory |
datalad subdatasets |
List all subdatasets of the current directory |
Example: Install the dataset from the OpenNeuro URL https://github.com/OpenNeuroDatasets/ds005131.git as a subdataset into the current dataset.
!datalad install -d . https://github.com/OpenNeuroDatasets/ds005131.git[INFO ] Remote origin not usable by git-annex; setting annex-ignore [INFO ] https://github.com/OpenNeuroDatasets/ds005131.git/config download failed: Not Found install(ok): ds005131 (dataset) add(ok): ds005131 (dataset) add(ok): .gitmodules (file) save(ok): . (dataset) add(ok): .gitmodules (file) save(ok): . (dataset) action summary: add (ok: 3) install (ok: 1) save (ok: 2)
Exercise: Change the directory to the root of the newly installed subdataset ds005131/ and check its git log.
Solution
%cd ds005131
!git log --oneline/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad/ds005131 51c1338 (HEAD -> main, tag: 1.0.1, origin/master, origin/main, origin/HEAD) [OpenNeuro] Recorded changes 7bb0e92 [OpenNeuro] Recorded changes 579b3aa [OpenNeuro] Recorded changes 95b3ce9 [OpenNeuro] Recorded changes 82286ff [OpenNeuro] Recorded changes 577f003 (tag: 1.0.0) [OpenNeuro] Recorded changes 2779065 [OpenNeuro] Recorded changes de27cca [OpenNeuro] Recorded changes 087aafd [OpenNeuro] Recorded changes 86fb2d1 [OpenNeuro] Recorded changes 3e04d03 [OpenNeuro] Recorded changes 3a6eca0 [OpenNeuro] Recorded changes f9191e6 [OpenNeuro] Recorded changes bc72cc4 [OpenNeuro] Recorded changes f941f62 [OpenNeuro] Recorded changes 7c5d834 [OpenNeuro] Dataset created
Exercise: Change the directory back to the parent learn-datalad/. Then, browse the OpenNeuro database, choose a dataset and install it as another subdataset.
Solution
%cd ..
!datalad install -d . https://github.com/OpenNeuroDatasets/ds003507.git/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad [INFO ] Remote origin not usable by git-annex; setting annex-ignore [INFO ] https://github.com/OpenNeuroDatasets/ds003507.git/config download failed: Not Found [INFO ] access to 1 dataset sibling s3-PRIVATE not auto-enabled, enable with: | datalad siblings -d "/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad/ds003507" enable -s s3-PRIVATE install(ok): ds003507 (dataset) add(ok): ds003507 (dataset) add(ok): .gitmodules (file) save(ok): . (dataset) add(ok): .gitmodules (file) save(ok): . (dataset) action summary: add (ok: 3) install (ok: 1) save (ok: 2)
Exercise: Change the directory to the newly installed subdataset and inspect its git log.
Solution
%cd ds003507
!git log --oneline/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad/ds003507 8b8fad4 (HEAD -> master, tag: 1.0.1, origin/master, origin/HEAD) [DATALAD] Recorded changes 29ce3cc [DATALAD] Recorded changes b82c86f [DATALAD] Recorded changes e548886 [DATALAD] Recorded changes f212034 [DATALAD] Recorded changes 540710b [DATALAD] Recorded changes ea7a5e4 [DATALAD] Recorded changes 80eeffd [DATALAD] Recorded changes a821149 [DATALAD] Recorded changes b461339 [DATALAD] Recorded changes 98e47d8 [DATALAD] Recorded changes a9d5d59 [DATALAD] Recorded changes 5cb3c0b (tag: 1.0.0) [DATALAD] Recorded changes 2f692c4 [DATALAD] Recorded changes 7527e33 [DATALAD] exclude paths from annex'ing cac78dc [DATALAD] new dataset
Exercise: Change the directory back to the parent learn-datalad/ and list all subdatasets.
Solution
%cd ..
!datalad subdatasets/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad subdataset(ok): ds003507 (dataset) subdataset(ok): ds005131 (dataset)
Section 4: Going Back and Forth in Time
Background
Because DataLad keeps track of all changes to our dataset, we can restore any previous version of a given file. This can be very useful if we made a mistake and want to restore an older version of our project, or we simply want to check how the data looked previously. In this section, we are going to learn two ways of doing this: checking out to a specific commit and resetting the repository. The checkout is mostly useful if we want to look at an older state of our project without actually changing the current state of the repository, while the reset is used to modify the repository’s state.
Note that a checkout creates a new branch of the dataset.
A branch is like a copy that can be modified independently of the original.
To switch back to the previous branch (i.e., main/master), use git switch -.
Exercises
In the following exercises we are going to use the git history to look at old file versions and restore previous states of our dataset. Here are the commands you need to know:
| Code | Description |
|---|---|
git log --oneline |
Display a compact one-line view of the commit history |
git checkout d0e83f29 |
checkout to the state of the repository at the commit with the hash d0e83f29 |
git switch - |
Switch back to the previous branch |
git reset --mixed d0e83f29 |
reset the state of the repository to the commit with the hash d0e83f29 but keep the working directory as-is |
git reset --hard d0e83f29 |
reset the state of the repository and delete files from the working directory |
datalad status |
Show any untracked changes in the current dataset |
datalad save -m "message" |
Save untracked changes with a commit message |
cat file.txt |
Display the content of file.txt (Linux/macOS) |
type file.txt |
Display the content of file.txt (Windows) |
Exercise: Run the cell below to print the git history, identify the commit where README.md was added to the repository and note its commit hash.
!git log --oneline7174564 (HEAD -> master) [DATALAD] Added subdataset f52db6a [DATALAD] Added subdataset bfddf2e add another line a9979bd add line e808ed1 add README 4fc9ab1 add a book on Git e7ff5e0 add a book on Python 34b88b0 [DATALAD] new dataset
Exercise: Use checkout to the commit where README.md was added to the repository.
Solution
!git checkout HEAD~4warning: unable to rmdir 'ds003507': Directory not empty
warning: unable to rmdir 'ds005131': Directory not empty
Note: switching to 'HEAD~4'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c <new-branch-name>
Or undo this operation with:
git switch -
Turn off this advice by setting config variable advice.detachedHead to false
HEAD is now at e808ed1 add READMEExercise: Check the git commit history and inspect the content of README.md.
Solution
!git log --onelinee808ed1 (HEAD) add README 4fc9ab1 add a book on Git e7ff5e0 add a book on Python 34b88b0 [DATALAD] new dataset
# Linux/macOS
!cat README.mdThis is a DataLad dataset# Windows
!type README.mdExercise: Switch back to the previous (i.e. the main/master) branch.
Solution
!git switch -Previous HEAD position was e808ed1 add README
Switched to branch 'master'Exercise: Identify the hash of the commit where we appended the first line to README.md. Then, checkout to that commit and inspect the content of README.md.
Solution
!git checkout HEAD~3warning: unable to rmdir 'ds003507': Directory not empty
warning: unable to rmdir 'ds005131': Directory not empty
Note: switching to 'HEAD~3'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:
git switch -c <new-branch-name>
Or undo this operation with:
git switch -
Turn off this advice by setting config variable advice.detachedHead to false
HEAD is now at a9979bd add line# Linux/macOS
!cat README.mdThis is a DataLad dataset
It uses git and git-annex# Windows
!type README.mdExercise: Switch back to the master branch and inspect the content of README.md to make sure it was restored.
Solution
!git switch -Previous HEAD position was a9979bd add line
Switched to branch 'master'# Linux/macOS
!cat README.mdThis is a DataLad dataset
It uses git and git-annex
For decentralized version control# Windows
!type README.mdExercise: Use git reset --mixed to reset the repository’s state to the point before README.md was modified. Then, check the git log and the dataset’s status.
NOTE: Using --mixed resets the repository’s state but does not affect your working directory - commits that happened after the point of reset will appear as unstaged changes.
Solution
!git reset --mixed HEAD~4
!git log --oneline
!datalad statusUnstaged changes after reset: M README.md e808ed1 (HEAD -> master) add README 4fc9ab1 add a book on Git e7ff5e0 add a book on Python 34b88b0 [DATALAD] new dataset untracked: .gitmodules (file) untracked: ds003507 (directory) untracked: ds005131 (directory) modified: README.md (symlink)
Exercise: Save the unstaged changes to README.md. Then, check the content of README.md to make sure nothing got lost.
NOTE: Since you are adding what were multiple commits in a single operation, you may choose a different commit message.
Solution
!datalad save -m "adding info to README"add(ok): ds003507 (dataset) add(ok): ds005131 (dataset) add(ok): .gitmodules (file) add(ok): README.md (file) save(ok): . (dataset) action summary: add (ok: 4) save (ok: 1)
# Linux/macOS
!cat README.mdThis is a DataLad dataset
It uses git and git-annex
For decentralized version control# Windows
!type README.mdExercise: Use git reset --hard to reset the repository’s state to the point before README.md was modified. Then, check the git log and the dataset’s status.
NOTE: Using --hard modifies your working directory and all commits that happened after the point of reset will be gone (they can still be recovered if they haven’t been deleted by git’s garbage collector, which happens after 30 days by default). Also, this won’t remove the installed subdatasets (you can simply remove them manually).
Solution
!git reset --hard HEAD~4
!git log --oneline
!datalad statuswarning: unable to rmdir 'ds003507': Directory not empty warning: unable to rmdir 'ds005131': Directory not empty HEAD is now at dbe3ff9 add README dbe3ff9 (HEAD -> master) add README fa67eb5 add a book on Git bdd7b31 add a book on Python 417f3a3 [DATALAD] new dataset untracked: ds003507 (directory) untracked: ds005131 (directory)
Section 5: Dataset Configurations: To Annex or not to Annex?
Background
By default, DataLad will use git-annex to handle the content of every single file in your dataset. However, this is not always desirable. For example, you may not want to annex small text files like code to avoid having to unlock them for every edit. We can tell DataLad which files should be annexed by editing the .gitattributes file. Let’s look at the default .gitattributes that was created when we initialized the dataset:
!cat .gitattributes* annex.backend=MD5E
**/.git* annex.largefiles=nothingThere are two lines in this file:
* annex.backend=MD5E: tells git-annex to use theMD5Ebackend for generating file hashes**/.git* annex.largefiles=nothing: tells git-annex to not annex the.gitfolder (because that folder is where the annexed contents are stored)
We usually don’t want to edit these default values. Instead, we want to add lines to .gitattributes to specify which contents should and shouldn’t be annexed. Note that changes in the configuration will not automatically be applied to files that are already tracked. Thus, it is best to configure .gitattributes right after initializing the dataset, before data is added.
Exercises
In the following exercises, we are going to modify our dataset’s .gitattributes to apply some custom configurations.
Here are the different commands and configuration options that you’ll need:
| Code | Description |
|---|---|
* annex.largefiles=(mimeencoding=binary) |
Only annex files with a binary encoding |
myfile.pdf annex.largefiles=nothing |
Don’t annex myfile.pdf |
* annex.largefiles=(largerthan=5kb) |
Only annex files whose size exceeds 5KB |
* annex.largefiles=((largerthan=5kb)or(mimeencoding=binary)) |
Only annex binary files and files greater than 5KB |
git annex unannex <files> |
Unannex the content of the given files |
git annex whereis <files> |
Show the location of the annexed file content (empty if the file isn’t annexed) |
Exercise: Get the location of the annexed file content of books/byte-of-python.pdf.
Solution
!git annex whereis books/byte-of-python.pdfwhereis books/byte-of-python.pdf (1 copy)
eeb39b7d-85ed-43ed-8fe4-86d2512a1535 -- olebi@iBots-7:~/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad [here]
okExercise: Add the line below to .gitattributes to avoid annexing pdfs.
**/*.pdf annex.largefiles=nothingExample: Unannex books/byte-of-python.pdf and save to apply the changed configuration.
!git annex unannex books/byte-of-python.pdf
!datalad save -m "unannex book"unannex books/byte-of-python.pdf ok (recording state in git...) add(ok): books/byte-of-python.pdf (file) add(ok): .gitattributes (file) save(ok): . (dataset) action summary: add (ok: 2) save (ok: 1)
Exercise: Get the location of the annexed file content of books/byte-of-python.pdf again. This should return nothing since the file isn’t annexed anymore.
Solution
!git annex whereis books/byte-of-python.pdfExercise: Change the last line in .gitattributes, so that only binary files will be annexed.
Solution
# new content of .gitattributes
* annex.backend=MD5E
**/.git* annex.largefiles=nothing
* annex.largefiles=(mimeencoding=binary)save(ok): . (dataset) action summary: add (ok: 1) save (ok: 1)
Exercise: Unannex README.md and save to apply the changed configuration. Now you should be able to edit README.md without having to unlock it.
Solution
!git annex unannex README.md
!datalad save -m "annex only binary"Exercise: Change the last line in .gitattributes so that (non-binary) files greater than 5kb will also be annexed.
Solution
# new content of .gitattributes
* annex.backend=MD5E
**/.git* annex.largefiles=nothing
* annex.largefiles=((mimeencoding=binary)or(largerthan=5kb))Exercise: Execute the cell below to save a large text file. Then inspect README.md and the new file test.txt. Then, get the location of the annexed file content for test.txt and README.md - if you configured .gitattributes correctly in the exercise above, test.txt should be a symlink but README.md shouldn’t.
open('test.txt', 'w').write('he' * 5000)
!datalad save -m "add large text file"add(ok): test.txt (file) add(ok): .gitattributes (file) Total: 100%|██████████████████████████| 1.00/1.00 [00:00<00:00, 15.1 datasets/s] action summary: add (ok: 2) save (ok: 1)
Solution
!git annex whereis test.txt
!git annex whereis README.mdwhereis test.txt (1 copy)
eeb39b7d-85ed-43ed-8fe4-86d2512a1535 -- olebi@iBots-7:~/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad [here]
ok