Static Workflows with Snakemake
Author
Snakemake is a workflow tool for turning a collection of scripts, commands, notebooks, and expected files into a reproducible pipeline. Instead of manually remembering which script to run first, which files it produces, and which later steps depend on those files, we describe those relationships explicitly in a workflow file either called Snakefile or a file with a .smk file extension.
The central idea is:
- rules describe how files are created
- outputs define what Snakemake can make
- inputs define what each rule depends on
Snakemake uses those relationships to build a DAG (Directed Acyclic Graph) to plan out what code needs to be run on what data, and while executing the code, monitors whether the plan is working as intended. This is especially useful in biology and neuroscience projects, where analyses often grow from a few scripts into a fragile chain of preprocessing, statistics, plots, reports, and manual reruns. Snakemake helps make that chain visible, testable, and easier to rerun.
Section 1: Running Code with run, shell, script, and notebook
run, shell, script, and notebook
A Snakemake rule needs some way to execute work. In this section, we start with the simplest possible rules: rules that just run code. These examples do not yet track output files or dependencies; the goal is only to see the different ways Snakemake can call code:
run:executes Python code directly inside the Snakefile.shell:executes a shell command.script:runs an code script. supports Python, Xonsh, Hy, R, R Markdown, Julia, Rust, Bash filesnotebook:runs a Jupyter notebook as part of the workflow. Very useful to also includelog: notebook="logs/my_notebook.ipynb"as a place to store the executed notebook.
For small examples, run: is convenient because everything is visible in one file. For real workflows, script: and notebook: are often cleaner because they keep analysis code separate from workflow structure. They also make it possible to run rules in parallel and in different computational environments.
Exercises
Preparation: Create a Snakemake workflow file called workflow.smk and use it to create the requested rules in the exercises below. To run the rule, the following command is helpful:
snakemake --cores 1 -s workflow.smk <rule_name>Example: use the run directive to create a say_hello rule that prints “Hello!”.
rule say_hello:
run:
print("Hello!")Exercise: use the run directive to create a say_bye rule that prints “Goodbye!”.
Solution
rule say_bye:
run:
print("Goodbye!")Exercise: Create a python script file in a scripts/ directory called smoke_test.py that prints the message “successfully run”
Solution
Create scripts/smoke_test.py:
print("successfully run")Then call it from the workflow:
rule smoke_test_script:
script:
"scripts/smoke_test.py"Exercise: Create a jupyter notebook that prints “I ran!” in a single cell. Use the log: notebook= directive to store the executed notebook (see below for syntax). Note: you may need to install papermill for this to work correctly.
Solution
rule notebook_example:
log:
notebook="logs/notebooks/notebook.ipynb"
notebook:
"notebook.ipynb"Section 2: Tracking Code Outputs
Running code is not enough for a reproducible workflow. Snakemake becomes useful when rules declare the files they create. The output: directive tells Snakemake what a rule promises to produce.
This gives Snakemake a way to check whether the rule succeeded. If a rule finishes but the declared output file does not exist, Snakemake treats that as an error. This is one of the main benefits of using a workflow manager instead of manually running scripts: the workflow can detect when a step did not actually produce what it claimed to produce.
In this section, we will practice:
- declaring one or multiple output file;
- referring to output paths inside
run:blocks - passing output paths to external scripts.
The key habit is: do not hard-code output paths in two places if Snakemake already knows them. Prefer using output[0], output["name"], or named output attributes where possible.
Exercises
Preparation: Create a new Snakemake workflow file called workflow2.smk and use it to create the requested rules in the exercises below.
Note: Snakemake will also refuse to re-run a rule if it dectecs that the output was already created. You can force it to rerun all rules with:
snakemake --cores 1 -s workflow2.smk --forceall <rule_name> Example: Create a rule called smoke_test that writes a data/smoke.txt file that contains the text “The code was run successfully.”
rule smoke_test:
output:
'data/smoke.txt'
run:
from pathlib import Path
path = Path('data/smoke.txt')
path.write_text('The code was run successfully.')Exercise: Create a rule called write_greeting that writes a data/hello.txt file that contains the text “Hi!”
Solution
rule write_greeting:
output:
'data/hello.txt'
run:
from pathlib import Path
path = Path('data/hello.txt')
path.write_text('Hi!')Exercise: What happens if the file that snakemake expects isn’t created by the code? In write_greeting, try modifying either the filename in output or run and check that Snakemake raises an error.
Solution
Snakemake raises a MissingOutputException because the rule finished but the declared output file was not created. For example, this rule writes data/hello.txt while promising data/hello2.txt:
rule write_greeting:
output:
"data/hello2.txt"
run:
from pathlib import Path
path = Path("data/hello.txt")
path.write_text("Hi!")Exercise: A rule can have multiple outputs, simply by seperating them with a comma (,). reate a rule called write_greeting that writes both a data/hello.txt file and a data/bye.txt file.
Solution
rule write_greetings:
output:
'data/hello.txt',
'data/bye.txt'
run:
from pathlib import Path
Path('data/hello.txt').write_text('Hi!')
Path('data/bye.txt').write_text('Bye!')Exercise: Writing the same filename in two locations isn’t the nicest way to maintain a workflow. Modify write_greetings so that that the run: code references the output variables directly, using the Snakemake-provided output variable: output[0] and output[1].
Solution
rule write_greetings:
output:
"data/hello.txt",
"data/bye.txt"
run:
from pathlib import Path
Path(output[0]).write_text("Hi!")
Path(output[1]).write_text("Bye!")Exercise: Outputs can also be referenced by keyword. Modify write_greetings so that that the run: code references the output variables directly, using the Snakemake-provided output variable: output['hi'] and output['bye'].
Syntax looks like this:
rule example:
output:
a = 'file1.txt',
b = 'file2.txt'
run:
print(output['a'])
print(output['b'])Solution
rule write_greetings:
output:
hi = "data/hello.txt",
bye = "data/bye.txt"
run:
from pathlib import Path
Path(output["hi"]).write_text("Hi!")
Path(output["bye"]).write_text("Bye!")Exercise: When running code with the shell: directive, useful for seperating out scripts from workflow management, you can also pass snakemake variables into the string. Make a write_greetings_shell rule that can be run as a command-line script, running the following code:
from pathlib import Path
import sys
Path(sys.argv[1]).write_text('Hi!')
Path(sys.argv[2]).write_text('Bye!')To run the script from the shell from Snakemake, the syntax is: python my_script.py {output[0]} {output[1]}
Solution
Create write_greetings.py:
from pathlib import Path
import sys
Path(sys.argv[1]).write_text("Hi!")
Path(sys.argv[2]).write_text("Bye!")Then call it from Snakemake:
rule write_greetings_shell:
output:
"data/hello.txt",
"data/bye.txt"
shell:
"python write_greetings.py {output[0]} {output[1]}"Section 3: Mapping Outputs to Inputs
So far, each rule has been isolated. Real workflows are chains: one rule creates a file, and another rule uses that file as input. Snakemake connects rules by matching input: files to output: files.
This is the core workflow model:
- rule A creates a file
- rule B needs that file
- therefore: rule A must run before rule B
Snakemake uses these input/output relationships to build a directed acyclic graph, or DAG. The DAG is the plan of which jobs need to run and in which order. So, than telling Snakemake “run step 1, then step 2,” we tell it “I want this final file.” Snakemake then works backward to determine which rules are needed.
In this section, we:
- move from individual rules to connected rules.
- introduce the common
rule allpattern.rule allusually sits at the top of a Snakefile and declares the final files we want the workflow to produce.
| --dag \| dot -Tpdf > dag.pdf | Print the DAG in Graphviz DOT format |
Exercises
Preparation: Create a new Snakemake workflow file called workflow3.smk and use it to create the requested rules in the exercises below. If you’d like to view the DAG that Snakemake is creating, the following command is helpful (requires graphviz to be installed:
snakemake --cores 1 -s workflow3.smk --dag | dot -Tpdf > dag.pdf
Exercise: Create two seperate rules, using input and output directives:
say_hithat creates file calleddata/hi.txtthat just contains the word “Hello”add_namethat loads thedata/hi.txtfile and creates a new file,data/hi_name.txtthat appends theHellowith “, World!”
Solution
from pathlib import Path
rule say_hi:
output:
"data/hi.txt"
run:
Path(output[0]).write_text("Hello")
rule add_name:
input:
"data/hi.txt"
output:
"data/hi_name.txt"
run:
greeting = Path(input[0]).read_text().strip()
Path(output[0]).write_text(greeting + ", World!")Exercise: Unless a rule is specified in the command line, Snakemake will always run the first rule listed. So a common pattern for generating all the files needed in a workflow is tro create a rule called all that just contains an input: directive with a list of the files needed for the workflow to be successful. Then, no rule is needed in the command line!
Make an all rule that asks for the data/hi_name.txt flie to be created. Then run the workflow without specifying a rule name, and confirm that it ran correctly.
Solution
from pathlib import Path
rule all:
input:
"data/hi_name.txt"
rule say_hi:
output:
"data/hi.txt"
run:
Path(output[0]).write_text("Hello")
rule add_name:
input:
"data/hi.txt"
output:
"data/hi_name.txt"
run:
greeting = Path(input[0]).read_text().strip()
Path(output[0]).write_text(greeting + ", World!")Run it with:
snakemake --cores 1 -s workflow3.smkSection 4: Extra Info: Extra Directives
The directives we have used so far are enough to build simple workflows: input:, output:, and one execution directive such as run:, shell:, script:, or notebook:.
Real workflows usually need a bit more structure. They may need parameters, log files, benchmarks, software environments, containers, or human-readable messages. These extra directives do not change the basic idea of Snakemake, but they make workflows easier to debug, reproduce, and run on different systems.
The most useful distinction is:
- inputs and outputs are scientific or workflow results
- params store non-file settings
- logs explain what happened during execution
- benchmarks measure how expensive execution was
- software directives describe the runtime environment
Exercises:
Explore the rule directives below, adding them to the workflow to get them to work.
| Directive | Use |
|---|---|
message: "Analyzing {wildcards.sample}" |
Print a human-readable message when the rule runs |
params: threshold=0.05 |
Pass non-file parameters to a rule |
log: "logs/clean.log" |
Store log files or executed notebooks |
benchmark: "benchmarks/clean.tsv" |
Record runtime and resource-use statistics |
conda: "envs/analysis.yaml" |
Attach a rule-specific Conda environment |
container: "my_image.sif" or container: docker://python:3.12 |
Attach a rule-specific container image |