iBOTS Learn: Static Workflows with Snakemake

Software Delivery for Scientific Python Projects

Static Workflows with Snakemake

Author

Dr. Nicholas A. Del Grosso

Download Materials

Snakemake is a workflow tool for turning a collection of scripts, commands, notebooks, and expected files into a reproducible pipeline. Instead of manually remembering which script to run first, which files it produces, and which later steps depend on those files, we describe those relationships explicitly in a workflow file either called Snakefile or a file with a .smk file extension.

The central idea is:

rules describe how files are created
outputs define what Snakemake can make
inputs define what each rule depends on

Snakemake uses those relationships to build a DAG (Directed Acyclic Graph) to plan out what code needs to be run on what data, and while executing the code, monitors whether the plan is working as intended. This is especially useful in biology and neuroscience projects, where analyses often grow from a few scripts into a fragile chain of preprocessing, statistics, plots, reports, and manual reruns. Snakemake helps make that chain visible, testable, and easier to rerun.

Section 1: Running Code with `run`, `shell`, `script`, and `notebook`

A Snakemake rule needs some way to execute work. In this section, we start with the simplest possible rules: rules that just run code. These examples do not yet track output files or dependencies; the goal is only to see the different ways Snakemake can call code:

run: executes Python code directly inside the Snakefile.
shell: executes a shell command.
script: runs an code script. supports Python, Xonsh, Hy, R, R Markdown, Julia, Rust, Bash files
notebook: runs a Jupyter notebook as part of the workflow. Very useful to also include log: notebook="logs/my_notebook.ipynb" as a place to store the executed notebook.

For small examples, run: is convenient because everything is visible in one file. For real workflows, script: and notebook: are often cleaner because they keep analysis code separate from workflow structure. They also make it possible to run rules in parallel and in different computational environments.

Exercises

Preparation: Create a Snakemake workflow file called workflow.smk and use it to create the requested rules in the exercises below. To run the rule, the following command is helpful:

snakemake --cores 1 -s workflow.smk <rule_name>

Example: use the run directive to create a say_hello rule that prints “Hello!”.

rule say_hello:
    run:
        print("Hello!")

Exercise: use the run directive to create a say_bye rule that prints “Goodbye!”.

Solution

rule say_bye:
    run:
        print("Goodbye!")

Exercise: Create a python script file in a scripts/ directory called smoke_test.py that prints the message “successfully run”

Solution

Create scripts/smoke_test.py:

print("successfully run")

Then call it from the workflow:

rule smoke_test_script:
    script:
        "scripts/smoke_test.py"

Exercise: Create a jupyter notebook that prints “I ran!” in a single cell. Use the log: notebook= directive to store the executed notebook (see below for syntax). Note: you may need to install papermill for this to work correctly.

Solution

rule notebook_example:
    log:
        notebook="logs/notebooks/notebook.ipynb"
    notebook:
        "notebook.ipynb"

Section 2: Tracking Code Outputs

Running code is not enough for a reproducible workflow. Snakemake becomes useful when rules declare the files they create. The output: directive tells Snakemake what a rule promises to produce.

This gives Snakemake a way to check whether the rule succeeded. If a rule finishes but the declared output file does not exist, Snakemake treats that as an error. This is one of the main benefits of using a workflow manager instead of manually running scripts: the workflow can detect when a step did not actually produce what it claimed to produce.

In this section, we will practice:

declaring one or multiple output file;
referring to output paths inside run: blocks
passing output paths to external scripts.

The key habit is: do not hard-code output paths in two places if Snakemake already knows them. Prefer using output[0], output["name"], or named output attributes where possible.

Exercises

Preparation: Create a new Snakemake workflow file called workflow2.smk and use it to create the requested rules in the exercises below.

Note: Snakemake will also refuse to re-run a rule if it dectecs that the output was already created. You can force it to rerun all rules with:

snakemake --cores 1 -s workflow2.smk --forceall <rule_name>

Example: Create a rule called smoke_test that writes a data/smoke.txt file that contains the text “The code was run successfully.”

rule smoke_test:
    output:
        'data/smoke.txt'
    run:
        from pathlib import Path
        path = Path('data/smoke.txt')
        path.write_text('The code was run successfully.')

Exercise: Create a rule called write_greeting that writes a data/hello.txt file that contains the text “Hi!”

Solution

rule write_greeting:
    output:
        'data/hello.txt'
    run:
        from pathlib import Path
        path = Path('data/hello.txt')
        path.write_text('Hi!')

Exercise: What happens if the file that snakemake expects isn’t created by the code? In write_greeting, try modifying either the filename in output or run and check that Snakemake raises an error.

Solution

Snakemake raises a MissingOutputException because the rule finished but the declared output file was not created. For example, this rule writes data/hello.txt while promising data/hello2.txt:

rule write_greeting:
    output:
        "data/hello2.txt"
    run:
        from pathlib import Path
        path = Path("data/hello.txt")
        path.write_text("Hi!")

Exercise: A rule can have multiple outputs, simply by seperating them with a comma (,). reate a rule called write_greeting that writes both a data/hello.txt file and a data/bye.txt file.

Solution

rule write_greetings:
    output:
        'data/hello.txt',
        'data/bye.txt'
    run:
        from pathlib import Path
        Path('data/hello.txt').write_text('Hi!')
        Path('data/bye.txt').write_text('Bye!')

Exercise: Writing the same filename in two locations isn’t the nicest way to maintain a workflow. Modify write_greetings so that that the run: code references the output variables directly, using the Snakemake-provided output variable: output[0] and output[1].

Solution

rule write_greetings:
    output:
        "data/hello.txt",
        "data/bye.txt"
    run:
        from pathlib import Path
        Path(output[0]).write_text("Hi!")
        Path(output[1]).write_text("Bye!")

Exercise: Outputs can also be referenced by keyword. Modify write_greetings so that that the run: code references the output variables directly, using the Snakemake-provided output variable: output['hi'] and output['bye'].

Syntax looks like this:

rule example:
    output:
        a = 'file1.txt',
        b = 'file2.txt'
    run:
        print(output['a'])
        print(output['b'])

Solution

rule write_greetings:
    output:
        hi = "data/hello.txt",
        bye = "data/bye.txt"
    run:
        from pathlib import Path
        Path(output["hi"]).write_text("Hi!")
        Path(output["bye"]).write_text("Bye!")

Exercise: When running code with the shell: directive, useful for seperating out scripts from workflow management, you can also pass snakemake variables into the string. Make a write_greetings_shell rule that can be run as a command-line script, running the following code:

from pathlib import Path
import sys
Path(sys.argv[1]).write_text('Hi!')
Path(sys.argv[2]).write_text('Bye!')

To run the script from the shell from Snakemake, the syntax is: python my_script.py {output[0]} {output[1]}

Solution

Create write_greetings.py:

from pathlib import Path
import sys

Path(sys.argv[1]).write_text("Hi!")
Path(sys.argv[2]).write_text("Bye!")

Then call it from Snakemake:

rule write_greetings_shell:
    output:
        "data/hello.txt",
        "data/bye.txt"
    shell:
        "python write_greetings.py {output[0]} {output[1]}"

Section 3: Mapping Outputs to Inputs

So far, each rule has been isolated. Real workflows are chains: one rule creates a file, and another rule uses that file as input. Snakemake connects rules by matching input: files to output: files.

This is the core workflow model:

rule A creates a file
rule B needs that file
therefore: rule A must run before rule B

Snakemake uses these input/output relationships to build a directed acyclic graph, or DAG. The DAG is the plan of which jobs need to run and in which order. So, than telling Snakemake “run step 1, then step 2,” we tell it “I want this final file.” Snakemake then works backward to determine which rules are needed.

In this section, we:

move from individual rules to connected rules.
introduce the common rule all pattern. rule all usually sits at the top of a Snakefile and declares the final files we want the workflow to produce.

| --dag \| dot -Tpdf > dag.pdf | Print the DAG in Graphviz DOT format |

Exercises

Preparation: Create a new Snakemake workflow file called workflow3.smk and use it to create the requested rules in the exercises below. If you’d like to view the DAG that Snakemake is creating, the following command is helpful (requires graphviz to be installed:

snakemake --cores 1 -s workflow3.smk --dag | dot -Tpdf > dag.pdf

Exercise: Create two seperate rules, using input and output directives:

say_hi that creates file called data/hi.txt that just contains the word “Hello”
add_name that loads the data/hi.txt file and creates a new file, data/hi_name.txt that appends the Hello with “, World!”

Solution

from pathlib import Path

rule say_hi:
    output:
        "data/hi.txt"
    run:
        Path(output[0]).write_text("Hello")


rule add_name:
    input:
        "data/hi.txt"
    output:
        "data/hi_name.txt"
    run:
        greeting = Path(input[0]).read_text().strip()
        Path(output[0]).write_text(greeting + ", World!")

Exercise: Unless a rule is specified in the command line, Snakemake will always run the first rule listed. So a common pattern for generating all the files needed in a workflow is tro create a rule called all that just contains an input: directive with a list of the files needed for the workflow to be successful. Then, no rule is needed in the command line!

Make an all rule that asks for the data/hi_name.txt flie to be created. Then run the workflow without specifying a rule name, and confirm that it ran correctly.

Solution

from pathlib import Path

rule all:
    input:
        "data/hi_name.txt"


rule say_hi:
    output:
        "data/hi.txt"
    run:
        Path(output[0]).write_text("Hello")


rule add_name:
    input:
        "data/hi.txt"
    output:
        "data/hi_name.txt"
    run:
        greeting = Path(input[0]).read_text().strip()
        Path(output[0]).write_text(greeting + ", World!")

Run it with:

snakemake --cores 1 -s workflow3.smk

Section 4: Extra Info: Extra Directives

The directives we have used so far are enough to build simple workflows: input:, output:, and one execution directive such as run:, shell:, script:, or notebook:.

Real workflows usually need a bit more structure. They may need parameters, log files, benchmarks, software environments, containers, or human-readable messages. These extra directives do not change the basic idea of Snakemake, but they make workflows easier to debug, reproduce, and run on different systems.

The most useful distinction is:

inputs and outputs are scientific or workflow results
params store non-file settings
logs explain what happened during execution
benchmarks measure how expensive execution was
software directives describe the runtime environment

Exercises:

Explore the rule directives below, adding them to the workflow to get them to work.

Directive	Use
`message: "Analyzing {wildcards.sample}"`	Print a human-readable message when the rule runs
`params: threshold=0.05`	Pass non-file parameters to a rule
`log: "logs/clean.log"`	Store log files or executed notebooks
`benchmark: "benchmarks/clean.tsv"`	Record runtime and resource-use statistics
`conda: "envs/analysis.yaml"`	Attach a rule-specific Conda environment
`container: "my_image.sif"` or `container: docker://python:3.12`	Attach a rule-specific container image

Static Workflows with Snakemake

Author

Section 1: Running Code with run, shell, script, and notebook

Exercises

Section 2: Tracking Code Outputs

Exercises

Section 3: Mapping Outputs to Inputs

Exercises

Section 4: Extra Info: Extra Directives

Section 1: Running Code with `run`, `shell`, `script`, and `notebook`