Creating a DataLad Dataset from Scratch

Authors
Ole Bialas | Michał Szczepanik

Creating a DataLad Dataset from Scratch

DataLad is a highly flexible tool that can be easily integrated into every workflow because, in its essence, a DataLad dataset is just a regular folder on your machine (with some additional metadata in the .git and .datalad folders).

In this section we are going to create a new dataset from scratch using DataLad’s create command. Because every DataLad dataset is also a Git repository, this will initialize git automatically. Once we create the dataset, we can add any kind of data.

We can even add other DataLad datasets as so-called subdatasets! As we add data and make changes to our dataset, DataLad will keep track of everything in the git log. This gives us a comprehensive history of our dataset which allows us (and anyone we share the dataset with) to understand what has been done and even restore older versions of files.

Section 1: Creating a new Dataset

Background

Once we create a dataset, DataLad will watch out for changes to any file. By using the status command, we can get a report on any files or changes in our dataset that are currently untracked. When we run datalad save, the untracked changes will be committed into the dataset’s history. We can add a little message with the -m flag to describe what has been done, e.g., -m "added raw recordings". While this is not required, it is a good practice that will make the dataset’s history more transparent to collaborators and your future self.

Exercises

In this section we are going to create a new DataLad dataset. We are then going to add different kinds of content like text files and PDFs downloaded from the web and save them so DataLad keeps track of them. Here are the commands you need to know:

Code Description
mkdir data/ Create a new directory called data/
cd data/ Change the working directory to data/
datalad create my-dataset Create a DataLad dataset in the new directory my-dataset
datalad status Show any untracked changes in the current dataset
datalad save -m "adding data" Save all untracked changes in the current dataset with a commit message
echo "hello" > file.txt Save the text "hello" to file.txt
curl -o file.txt <URL> Download content from the given URL and write it to file.txt
curl -s -o file.txt <URL> Silently download content from the given URL and write it to file.txt

Example: Create a new DataLad dataset called my-dataset in the current directory.

import os
# deactivate DataLad's progressbar for this notebook
os.environ['DATALAD_UI_PROGRESSBAR'] = 'none'

!datalad create my-dataset
create(ok): /home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/my-dataset (dataset)

Exercise: Create a new DataLad dataset called learn-datalad in the current directory.

Solution
!datalad create learn-datalad
create(ok): /home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad (dataset)

Exercise: Change the current directory to learn-datalad and print the dataset’s status.

Solution
%cd learn-datalad
!datalad status
/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad
nothing to save, working tree clean

Example: Create a new directory code/ in learn-datalad/.

!mkdir audio

Exercise: Create a new directory books/ in learn-datalad/ and change the current directory to books/.

Solution
!mkdir books

Run the cell below to download https://homepages.uc.edu/~becktl/byte_of_python.pdf and write it to the output file byte-of-python.pdf in books/.

!curl -s -o books/byte-of-python.pdf https://homepages.uc.edu/~becktl/byte_of_python.pdf

Exercise: Check the status of the dataset.

Solution
!datalad status
untracked: books (directory)

Exercise: save the untracked file and add the message "add a book on Python". Then, check the status of the dataset again.

Solution
!datalad save -m "add a book on Python"
!datalad status
add(ok): books/byte-of-python.pdf (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean

Exercise: Run the cell below to download https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf and write it to books/progit.pdf. Then, save the untracked file with a message "add a book on Git" and check the dataset’s status to make sure there are no untracked changes.

!curl -s -o books/progit.pdf https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf
Solution
!datalad save -m "add a book on Git"
!datalad status
add(ok): books/progit.pdf (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean

Exercise: Run the cell below to create a new file README.md with the text "This is a DataLad dataset". Then, save the untracked file and check the dataset’s status.

!echo "This is a DataLad dataset" > README.md
Solution
!datalad save -m "add README"
!datalad status
add(ok): README.md (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean

Section 2: Modifying Content and Tracking Changes

Background

While DataLad wraps many functions of Git, there are some instances where we need to use git directly. Viewing the git log is one of those instances. This log is a critical part of any Git repository and DataLad dataset because it contains a comprehensive history of our dataset and every time we run datalad save, a new entry is created. Each commit has a unique hash and contains the commit’s author and their email as well as the commit message.

Because DataLad wants to make sure that we don’t accidentally overwrite our files once they are committed, it locks them to make them unmodifiable. This is why we have to use datalad unlock before we can modify them. When we run datalad save to save our changes, the file will be locked again. Even though DataLad reports unlocking as a file modification, it will only create a new entry in the commit history if the file actually changed.

Note for Windows: The Windows file system does not support file locking in the same way that Linux/macOS does. Instead, Windows duplicates the data and keeps one copy in the working directory and one backup copy for safety in the .git folder. This has the advantage that you don’t need to unlock files before modifying them, but it also makes your dataset twice as big! Another consequence is that DataLad is creating a separate commit for this adjustment, so the most recent entry in your git log will always show the message “git-annex adjusted branch”. This means that, to get the most recent commit you made to the dataset, you have to look at the second-to-last entry in git log.

Exercises

In the following exercises we are going to inspect the git log to view the history of our dataset. We are also going to modify existing files by unlocking them and saving the changes to the commit history. Here are the commands you need to know:

Code Description
git log Display the commit history of the repository
git log -2 Display the last two entries in commit history
git log --oneline Display a compact one-line view of the commit history
datalad unlock data/ Unlock the file content of the data/ folder
datalad unlock file.txt Unlock the file content of file.txt
datalad status Show any untracked changes in the current dataset
datalad save Save untracked changes and lock unlocked file contents
datalad save -m "message" Save untracked changes with a commit message
echo "content" >> file.txt Append the text "content" to file.txt

Exercise: Display the git log to view all commits you made to the learn-datalad dataset.

Solution
!git log
commit e808ed1cbeee1cbac35783b3339a2262608c584f (HEAD -> master)
Author: obi <ole.bialas@posteo.de>
Date:   Wed Dec 10 15:51:23 2025 +0100

    add README

commit 4fc9ab1ceb12768fce2f39a785b1e3f4d4bd363c
Author: obi <ole.bialas@posteo.de>
Date:   Wed Dec 10 15:51:22 2025 +0100

    add a book on Git

commit e7ff5e0709ee650c24616ab6e24e80918dc81b9e
Author: obi <ole.bialas@posteo.de>
Date:   Wed Dec 10 15:51:19 2025 +0100

    add a book on Python

commit 34b88b0b1009622d1d8910dc495b684c61a8281c
Author: obi <ole.bialas@posteo.de>
Date:   Wed Dec 10 15:51:12 2025 +0100

    [DATALAD] new dataset

Exercise: Display the git log in a compact one-line view.

Solution
!git log --oneline
e808ed1 (HEAD -> master) add README
4fc9ab1 add a book on Git
e7ff5e0 add a book on Python
34b88b0 [DATALAD] new dataset

Exercise: Unlock the content of README.md. Then, check the dataset’s status.

Solution
!datalad unlock README.md
!datalad status
unlock(ok): README.md (file)
 modified: README.md (file)

Example: Append the line "It uses git and git-annex" to README.md, either using your editor or the echo command. Then, save with a message and check the dataset’s status.

!echo "It uses git and git-annex" >> README.md
!datalad save -m "add line"
!datalad status
add(ok): README.md (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean

Exercise: Unlock README.md and append another line "for decentralized version control". Then, save the changes and check the status.

Solution
!datalad unlock README.md
!echo "For decentralized version control" >> README.md
!datalad save -m "add another line"
!datalad status
unlock(ok): README.md (file)
add(ok): README.md (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
nothing to save, working tree clean

Exercise: Display the last two entries in the git history.

Solution
!git log -2
commit bfddf2ec59900bc61f8175a8a2d9693db73a1898 (HEAD -> master)
Author: obi <ole.bialas@posteo.de>
Date:   Wed Dec 10 15:51:27 2025 +0100

    add another line

commit a9979bd13cdd06f808dd8b14f4d19edf82988bb5
Author: obi <ole.bialas@posteo.de>
Date:   Wed Dec 10 15:51:25 2025 +0100

    add line

Exercise: Unlock README.md and then, without making any changes, save with a message. Check the last two entries in the git history - did your save command create an entry?

Solution
!datalad unlock README.md
!datalad save -m "did nothing"
unlock(ok): README.md (file)
add(ok): README.md (file)
action summary:
  add (ok: 1)
  save (notneeded: 1)
!git log -2
commit bfddf2ec59900bc61f8175a8a2d9693db73a1898 (HEAD -> master)
Author: obi <ole.bialas@posteo.de>
Date:   Wed Dec 10 15:51:27 2025 +0100

    add another line

commit a9979bd13cdd06f808dd8b14f4d19edf82988bb5
Author: obi <ole.bialas@posteo.de>
Date:   Wed Dec 10 15:51:25 2025 +0100

    add line

Section 3: Installing Subdatasets

Background

You can add any data to your DataLad dataset, including other datasets! DataLad allows you to install datasets as submodules, which means that they are added to your repository while maintaining their own, independent git history. This allows us to modularize research projects by, for example, creating subdatasets for different modalities, conditions, or analysis methods. Modularizing the dataset often results in a cleaner history and easier-to-maintain project, and it also increases the reusability because it allows you and others to reuse only specific components of the data.

Installing subdatasets is done via DataLad’s install command. This works similarly to clone but is more versatile and allows us to install a subdataset into a given path while automatically registering it into the superdataset’s history.

Exercises

In the following exercises, we are going to install datasets from OpenNeuro as subdatasets into our new dataset. Here are the commands you need to know:

Code Description
datalad install -d my-dataset <URL> Install the dataset from the given URL as a subdataset into the my-dataset/ directory
datalad install -d . <URL> Install the dataset from the given URL as a subdataset into the current directory
datalad subdatasets List all subdatasets of the current directory

Example: Install the dataset from the OpenNeuro URL https://github.com/OpenNeuroDatasets/ds005131.git as a subdataset into the current dataset.

!datalad install -d . https://github.com/OpenNeuroDatasets/ds005131.git
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore 
[INFO   ] https://github.com/OpenNeuroDatasets/ds005131.git/config download failed: Not Found 
install(ok): ds005131 (dataset)
add(ok): ds005131 (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
  add (ok: 3)
  install (ok: 1)
  save (ok: 2)

Exercise: Change the directory to the root of the newly installed subdataset ds005131/ and check its git log.

Solution
%cd ds005131
!git log --oneline
/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad/ds005131
51c1338 (HEAD -> main, tag: 1.0.1, origin/master, origin/main, origin/HEAD) [OpenNeuro] Recorded changes
7bb0e92 [OpenNeuro] Recorded changes
579b3aa [OpenNeuro] Recorded changes
95b3ce9 [OpenNeuro] Recorded changes
82286ff [OpenNeuro] Recorded changes
577f003 (tag: 1.0.0) [OpenNeuro] Recorded changes
2779065 [OpenNeuro] Recorded changes
de27cca [OpenNeuro] Recorded changes
087aafd [OpenNeuro] Recorded changes
86fb2d1 [OpenNeuro] Recorded changes
3e04d03 [OpenNeuro] Recorded changes
3a6eca0 [OpenNeuro] Recorded changes
f9191e6 [OpenNeuro] Recorded changes
bc72cc4 [OpenNeuro] Recorded changes
f941f62 [OpenNeuro] Recorded changes
7c5d834 [OpenNeuro] Dataset created

Exercise: Change the directory back to the parent learn-datalad/. Then, browse the OpenNeuro database , choose a dataset and install it as another subdataset.

Solution
%cd ..
!datalad install -d . https://github.com/OpenNeuroDatasets/ds003507.git
/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore 
[INFO   ] https://github.com/OpenNeuroDatasets/ds003507.git/config download failed: Not Found 
[INFO   ] access to 1 dataset sibling s3-PRIVATE not auto-enabled, enable with:
| 		datalad siblings -d "/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad/ds003507" enable -s s3-PRIVATE 
install(ok): ds003507 (dataset)
add(ok): ds003507 (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
add(ok): .gitmodules (file)
save(ok): . (dataset)
action summary:
  add (ok: 3)
  install (ok: 1)
  save (ok: 2)

Exercise: Change the directory to the newly installed subdataset and inspect its git log.

Solution
%cd ds003507
!git log --oneline
/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad/ds003507
8b8fad4 (HEAD -> master, tag: 1.0.1, origin/master, origin/HEAD) [DATALAD] Recorded changes
29ce3cc [DATALAD] Recorded changes
b82c86f [DATALAD] Recorded changes
e548886 [DATALAD] Recorded changes
f212034 [DATALAD] Recorded changes
540710b [DATALAD] Recorded changes
ea7a5e4 [DATALAD] Recorded changes
80eeffd [DATALAD] Recorded changes
a821149 [DATALAD] Recorded changes
b461339 [DATALAD] Recorded changes
98e47d8 [DATALAD] Recorded changes
a9d5d59 [DATALAD] Recorded changes
5cb3c0b (tag: 1.0.0) [DATALAD] Recorded changes
2f692c4 [DATALAD] Recorded changes
7527e33 [DATALAD] exclude paths from annex'ing
cac78dc [DATALAD] new dataset

Exercise: Change the directory back to the parent learn-datalad/ and list all subdatasets.

Solution
%cd ..
!datalad subdatasets
/home/olebi/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad
subdataset(ok): ds003507 (dataset)
subdataset(ok): ds005131 (dataset)

Section 4: Going Back and Forth in Time

Background

Because DataLad keeps track of all changes to our dataset, we can restore any previous version of a given file. This can be very useful if we made a mistake and want to restore an older version of our project, or we simply want to check how the data looked previously. In this section, we are going to learn two ways of doing this: checking out to a specific commit and resetting the repository. The checkout is mostly useful if we want to look at an older state of our project without actually changing the current state of the repository, while the reset is used to modify the repository’s state.

Note that a checkout creates a new branch of the dataset. A branch is like a copy that can be modified independently of the original. To switch back to the previous branch (i.e., main/master), use git switch -.

Exercises

In the following exercises we are going to use the git history to look at old file versions and restore previous states of our dataset. Here are the commands you need to know:

Code Description
git log --oneline Display a compact one-line view of the commit history
git checkout d0e83f29 checkout to the state of the repository at the commit with the hash d0e83f29
git switch - Switch back to the previous branch
git reset --mixed d0e83f29 reset the state of the repository to the commit with the hash d0e83f29 but keep the working directory as-is
git reset --hard d0e83f29 reset the state of the repository and delete files from the working directory
datalad status Show any untracked changes in the current dataset
datalad save -m "message" Save untracked changes with a commit message
cat file.txt Display the content of file.txt (Linux/macOS)
type file.txt Display the content of file.txt (Windows)

Exercise: Run the cell below to print the git history, identify the commit where README.md was added to the repository and note its commit hash.

!git log --oneline
7174564 (HEAD -> master) [DATALAD] Added subdataset
f52db6a [DATALAD] Added subdataset
bfddf2e add another line
a9979bd add line
e808ed1 add README
4fc9ab1 add a book on Git
e7ff5e0 add a book on Python
34b88b0 [DATALAD] new dataset

Exercise: Use checkout to the commit where README.md was added to the repository.

Solution
!git checkout HEAD~4
warning: unable to rmdir 'ds003507': Directory not empty
warning: unable to rmdir 'ds005131': Directory not empty
Note: switching to 'HEAD~4'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at e808ed1 add README

Exercise: Check the git commit history and inspect the content of README.md.

Solution
!git log --oneline
e808ed1 (HEAD) add README
4fc9ab1 add a book on Git
e7ff5e0 add a book on Python
34b88b0 [DATALAD] new dataset
# Linux/macOS
!cat README.md
This is a DataLad dataset
# Windows
!type README.md

Exercise: Switch back to the previous (i.e. the main/master) branch.

Solution
!git switch -
Previous HEAD position was e808ed1 add README
Switched to branch 'master'

Exercise: Identify the hash of the commit where we appended the first line to README.md. Then, checkout to that commit and inspect the content of README.md.

Solution
!git checkout HEAD~3
warning: unable to rmdir 'ds003507': Directory not empty
warning: unable to rmdir 'ds005131': Directory not empty
Note: switching to 'HEAD~3'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at a9979bd add line
# Linux/macOS
!cat README.md
This is a DataLad dataset
It uses git and git-annex
# Windows
!type README.md

Exercise: Switch back to the master branch and inspect the content of README.md to make sure it was restored.

Solution
!git switch -
Previous HEAD position was a9979bd add line
Switched to branch 'master'
# Linux/macOS
!cat README.md
This is a DataLad dataset
It uses git and git-annex
For decentralized version control
# Windows
!type README.md

Exercise: Use git reset --mixed to reset the repository’s state to the point before README.md was modified. Then, check the git log and the dataset’s status.

NOTE: Using --mixed resets the repository’s state but does not affect your working directory - commits that happened after the point of reset will appear as unstaged changes.

Solution
!git reset --mixed HEAD~4
!git log --oneline
!datalad status
Unstaged changes after reset:
M	README.md
e808ed1 (HEAD -> master) add README
4fc9ab1 add a book on Git
e7ff5e0 add a book on Python
34b88b0 [DATALAD] new dataset
untracked: .gitmodules (file)
untracked: ds003507 (directory)
untracked: ds005131 (directory)
 modified: README.md (symlink)

Exercise: Save the unstaged changes to README.md. Then, check the content of README.md to make sure nothing got lost.

NOTE: Since you are adding what were multiple commits in a single operation, you may choose a different commit message.

Solution
!datalad save -m "adding info to README"
add(ok): ds003507 (dataset)
add(ok): ds005131 (dataset)
add(ok): .gitmodules (file)
add(ok): README.md (file)
save(ok): . (dataset)
action summary:
  add (ok: 4)
  save (ok: 1)
# Linux/macOS
!cat README.md
This is a DataLad dataset
It uses git and git-annex
For decentralized version control
# Windows
!type README.md

Exercise: Use git reset --hard to reset the repository’s state to the point before README.md was modified. Then, check the git log and the dataset’s status.

NOTE: Using --hard modifies your working directory and all commits that happened after the point of reset will be gone (they can still be recovered if they haven’t been deleted by git’s garbage collector, which happens after 30 days by default). Also, this won’t remove the installed subdatasets (you can simply remove them manually).

Solution
!git reset --hard HEAD~4
!git log --oneline
!datalad status
warning: unable to rmdir 'ds003507': Directory not empty
warning: unable to rmdir 'ds005131': Directory not empty
HEAD is now at dbe3ff9 add README
dbe3ff9 (HEAD -> master) add README
fa67eb5 add a book on Git
bdd7b31 add a book on Python
417f3a3 [DATALAD] new dataset
untracked: ds003507 (directory)
untracked: ds005131 (directory)

Section 5: Dataset Configurations: To Annex or not to Annex?

Background

By default, DataLad will use git-annex to handle the content of every single file in your dataset. However, this is not always desirable. For example, you may not want to annex small text files like code to avoid having to unlock them for every edit. We can tell DataLad which files should be annexed by editing the .gitattributes file. Let’s look at the default .gitattributes that was created when we initialized the dataset:

!cat .gitattributes
* annex.backend=MD5E
**/.git* annex.largefiles=nothing

There are two lines in this file:

  • * annex.backend=MD5E: tells git-annex to use the MD5E backend for generating file hashes
  • **/.git* annex.largefiles=nothing: tells git-annex to not annex the .git folder (because that folder is where the annexed contents are stored)

We usually don’t want to edit these default values. Instead, we want to add lines to .gitattributes to specify which contents should and shouldn’t be annexed. Note that changes in the configuration will not automatically be applied to files that are already tracked. Thus, it is best to configure .gitattributes right after initializing the dataset, before data is added.

Exercises

In the following exercises, we are going to modify our dataset’s .gitattributes to apply some custom configurations. Here are the different commands and configuration options that you’ll need:

Code Description
* annex.largefiles=(mimeencoding=binary) Only annex files with a binary encoding
myfile.pdf annex.largefiles=nothing Don’t annex myfile.pdf
* annex.largefiles=(largerthan=5kb) Only annex files whose size exceeds 5KB
* annex.largefiles=((largerthan=5kb)or(mimeencoding=binary)) Only annex binary files and files greater than 5KB
git annex unannex <files> Unannex the content of the given files
git annex whereis <files> Show the location of the annexed file content (empty if the file isn’t annexed)

Exercise: Get the location of the annexed file content of books/byte-of-python.pdf.

Solution
!git annex whereis books/byte-of-python.pdf
whereis books/byte-of-python.pdf (1 copy) 
  	eeb39b7d-85ed-43ed-8fe4-86d2512a1535 -- olebi@iBots-7:~/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad [here]
ok

Exercise: Add the line below to .gitattributes to avoid annexing pdfs.

**/*.pdf annex.largefiles=nothing

Example: Unannex books/byte-of-python.pdf and save to apply the changed configuration.

!git annex unannex books/byte-of-python.pdf
!datalad save -m "unannex book"
unannex books/byte-of-python.pdf ok
(recording state in git...)
add(ok): books/byte-of-python.pdf (file)
add(ok): .gitattributes (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  save (ok: 1)

Exercise: Get the location of the annexed file content of books/byte-of-python.pdf again. This should return nothing since the file isn’t annexed anymore.

Solution
!git annex whereis books/byte-of-python.pdf

Exercise: Change the last line in .gitattributes, so that only binary files will be annexed.

Solution
# new content of .gitattributes
* annex.backend=MD5E
**/.git* annex.largefiles=nothing
* annex.largefiles=(mimeencoding=binary)

save(ok): . (dataset)                           
action summary:                                                                 
  add (ok: 1)
  save (ok: 1)

Exercise: Unannex README.md and save to apply the changed configuration. Now you should be able to edit README.md without having to unlock it.

Solution
!git annex unannex README.md
!datalad save -m "annex only binary"

Exercise: Change the last line in .gitattributes so that (non-binary) files greater than 5kb will also be annexed.

Solution
# new content of .gitattributes
* annex.backend=MD5E
**/.git* annex.largefiles=nothing
* annex.largefiles=((mimeencoding=binary)or(largerthan=5kb))

Exercise: Execute the cell below to save a large text file. Then inspect README.md and the new file test.txt. Then, get the location of the annexed file content for test.txt and README.md - if you configured .gitattributes correctly in the exercise above, test.txt should be a symlink but README.md shouldn’t.

open('test.txt', 'w').write('he' * 5000)
!datalad save -m "add large text file"
add(ok): test.txt (file)
add(ok): .gitattributes (file)
Total: 100%|██████████████████████████| 1.00/1.00 [00:00<00:00, 15.1 datasets/s]
action summary:
  add (ok: 2)
  save (ok: 1)

Solution
!git annex whereis test.txt
!git annex whereis README.md
whereis test.txt (1 copy) 
  	eeb39b7d-85ed-43ed-8fe4-86d2512a1535 -- olebi@iBots-7:~/projects/new-learning-platform/notebooks/datalad/01_using_and_creating_datasets/02_creating_a_dataset_from_scratch/learn-datalad [here]
ok