iBOTS Learn: Generalizing Workflows with Wildcards

Software Delivery for Scientific Python Projects

Generalizing Workflows with Wildcards

Author

Dr. Nicholas A. Del Grosso

Download Materials

Wildcards are Snakemake’s way of describing those repeated patterns. A wildcard is a named placeholder inside a file path, such as {person} in data/greetings/{person}.txt. One rule can then create many related files, as long as Snakemake is given concrete files to aim for.

The important idea is: wildcards do not tell Snakemake what to make by themselves. Snakemake still needs final target files, usually listed in rule all. Once it sees those target files, it can match parts of the filename to wildcard values and run the same rule once for each needed output.

Section 1: Using a Wildcard in a Single Rule

A wildcard appears inside curly braces in an input: or output: path. If a rule has the output path data/greetings/{person}.txt, Snakemake can use that rule to create data/greetings/Ada.txt, data/greetings/Grace.txt, or any other filename that fits the same pattern.

In this first section, we will use one wildcard in one rule. The rule will write a greeting file for one person at a time. rule all will request several concrete greeting files, and Snakemake will run the wildcard rule separately for each person.

Exercises

Preparation: Create a new Snakemake workflow file called workflow4.smk and use it for the exercises below. To run the whole workflow, use:

snakemake --cores 1 -s workflow4.smk

Example: Create a rule called write_greeting that can write one greeting file for any person. Then use rule all to request greeting files for Ada, Grace, and Linus.

Solution

rule all:
    input:
        "data/greetings/Ada.txt",
        "data/greetings/Grace.txt",
        "data/greetings/Linus.txt"


rule write_greeting:
    output:
        "data/greetings/{person}.txt"
    run:
        from pathlib import Path

        path = Path(output[0])
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(f"Hello, {wildcards.person}!\n")

Exercise: Run the workflow and inspect the files in data/greetings/. Then add one more person to rule all and run it again.

Solution

rule all:
    input:
        "data/greetings/Ada.txt",
        "data/greetings/Grace.txt",
        "data/greetings/Linus.txt",
        "data/greetings/Katherine.txt"


rule write_greeting:
    output:
        "data/greetings/{person}.txt"
    run:
        from pathlib import Path

        path = Path(output[0])
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(f"Hello, {wildcards.person}!\n")

Exercise: Run a dry run and look at how many jobs Snakemake plans. This is useful when you want to check the workflow before creating files.

snakemake --cores 1 -s workflow4.smk -n

Solution

If the output files do not already exist, Snakemake should plan one write_greeting job for each requested person, plus the all rule. The same rule is reused, but each job gets a different value for wildcards.person.

Section 2: Mapping Inputs and Outputs with Wildcards

Wildcards become especially useful when rules are connected. If one rule has output data/names/{person}.txt and another rule has input data/names/{person}.txt, Snakemake uses the same wildcard value for both paths in a single job.

For example, if the final target is results/greetings/Ada.txt, Snakemake can work backward and decide that it also needs data/names/Ada.txt. It does not need us to say “run the Ada version first”. The matching filenames describe the connection.

Exercises

Preparation: Create a new Snakemake workflow file called workflow5.smk and use it for the exercises below. To run the workflow, use:

snakemake --cores 1 -s workflow5.smk

Exercise: Create two connected rules:

write_name creates data/names/{person}.txt
write_greeting reads data/names/{person}.txt and creates results/greetings/{person}.txt

Use rule all to request greeting files for Ada, Grace, and Linus.

Solution

rule all:
    input:
        "results/greetings/Ada.txt",
        "results/greetings/Grace.txt",
        "results/greetings/Linus.txt"


rule write_name:
    output:
        "data/names/{person}.txt"
    run:
        from pathlib import Path

        path = Path(output[0])
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(f"{wildcards.person}\n")


rule write_greeting:
    input:
        "data/names/{person}.txt"
    output:
        "results/greetings/{person}.txt"
    run:
        from pathlib import Path

        name = Path(input[0]).read_text().strip()
        path = Path(output[0])
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(f"Hello, {name}!\n")

Exercise: Add a third rule called write_excited_greeting. It should read results/greetings/{person}.txt and create results/excited/{person}.txt. Change rule all so that the excited greeting files are the final outputs.

Solution

rule all:
    input:
        "results/excited/Ada.txt",
        "results/excited/Grace.txt",
        "results/excited/Linus.txt"


rule write_name:
    output:
        "data/names/{person}.txt"
    run:
        from pathlib import Path

        path = Path(output[0])
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(f"{wildcards.person}\n")


rule write_greeting:
    input:
        "data/names/{person}.txt"
    output:
        "results/greetings/{person}.txt"
    run:
        from pathlib import Path

        name = Path(input[0]).read_text().strip()
        path = Path(output[0])
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(f"Hello, {name}!\n")


rule write_excited_greeting:
    input:
        "results/greetings/{person}.txt"
    output:
        "results/excited/{person}.txt"
    run:
        from pathlib import Path

        greeting = Path(input[0]).read_text().strip()
        path = Path(output[0])
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(greeting + "!!\n")

Section 3: Generating Filenames with `expand()` and `glob_wildcards()`

Writing every final filename by hand works for three people, but it quickly becomes tedious. The expand() function is a convenient way to generate concrete filenames from a pattern and one or more lists of values.

expand() is usually used in rule all, because rule all needs concrete filenames. It is not a wildcard rule by itself. It simply produces a list like results/greetings/Ada.txt, results/greetings/Grace.txt, and results/greetings/Linus.txt.

Exercises

Preparation: Create a new Snakemake workflow file called workflow6.smk and use it for the exercises below.

Example: Rewrite the people-only greeting workflow so that rule all uses PEOPLE and expand() instead of a hand-written list of files.

Solution

PEOPLE = ["Ada", "Grace", "Linus"]


rule all:
    input:
        expand("results/greetings/{person}.txt", person=PEOPLE)


rule write_greeting:
    output:
        "results/greetings/{person}.txt"
    run:
        from pathlib import Path

        path = Path(output[0])
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(f"Hello, {wildcards.person}!\n")

Exercise: Add a second wildcard for language. Generate greeting files in paths like results/greetings/english/Ada.txt and results/greetings/spanish/Grace.txt. Use expand() to request all combinations of people and languages.

Solution

PEOPLE = ["Ada", "Grace", "Linus"]
LANGUAGES = ["english", "spanish", "german"]

GREETINGS = {
    "english": "Hello",
    "spanish": "Hola",
    "german": "Hallo",
}


rule all:
    input:
        expand(
            "results/greetings/{language}/{person}.txt",
            language=LANGUAGES,
            person=PEOPLE,
        )


rule write_greeting:
    output:
        "results/greetings/{language}/{person}.txt"
    run:
        from pathlib import Path

        greeting = GREETINGS[wildcards.language]
        path = Path(output[0])
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(f"{greeting}, {wildcards.person}!\n")

Exercise: Add another person and another language. Then run a dry run and check how many greeting jobs Snakemake plans.

snakemake --cores 1 -s workflow6.smk -n

Solution

PEOPLE = ["Ada", "Grace", "Linus", "Katherine"]
LANGUAGES = ["english", "spanish", "german", "french"]

GREETINGS = {
    "english": "Hello",
    "spanish": "Hola",
    "german": "Hallo",
    "french": "Bonjour",
}


rule all:
    input:
        expand(
            "results/greetings/{language}/{person}.txt",
            language=LANGUAGES,
            person=PEOPLE,
        )


rule write_greeting:
    output:
        "results/greetings/{language}/{person}.txt"
    run:
        from pathlib import Path

        greeting = GREETINGS[wildcards.language]
        path = Path(output[0])
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(f"{greeting}, {wildcards.person}!\n")

With 4 people and 4 languages, expand() creates 16 target filenames. If those files do not already exist, Snakemake should plan 16 write_greeting jobs, plus the all rule.

Exercise: Snakemake’s glob_wildcards() function searches files that already exist to discover what wildcard values to match (e.g. SAMPLES, PATIENT_IDS = glob_wildcards("data/{patient_id}_{sample}.fastq.gz") will find all patient_id values and sample values from the files that match the path). This can reduce hard-coding filenames in your snakemake workflows. Let’s try it out: use glob_wildcards() at the top of the file to discover people from files that already exist, so that if a new file like data/names/Rocky.txt is created, Snakemake automatically includes the file in the workflow.

Solution

PEOPLE, = glob_wildcards("data/names/{person}.txt")


rule all:
    input:
        expand("results/greetings/{person}.txt", person=PEOPLE)


rule write_greeting:
    input:
        "data/names/{person}.txt"
    output:
        "results/greetings/{person}.txt"
    run:
        from pathlib import Path

        name = Path(input[0]).read_text().strip()
        path = Path(output[0])
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(f"Hello, {name}!\n")

Generalizing Workflows with Wildcards

Author

Section 1: Using a Wildcard in a Single Rule

Exercises

Section 2: Mapping Inputs and Outputs with Wildcards

Exercises

Section 3: Generating Filenames with expand() and glob_wildcards()

Exercises

Section 3: Generating Filenames with `expand()` and `glob_wildcards()`