Generalizing Workflows with Wildcards
Author
Wildcards are Snakemake’s way of describing those repeated patterns. A wildcard is a named placeholder inside a file path, such as {person} in data/greetings/{person}.txt. One rule can then create many related files, as long as Snakemake is given concrete files to aim for.
The important idea is: wildcards do not tell Snakemake what to make by themselves. Snakemake still needs final target files, usually listed in rule all. Once it sees those target files, it can match parts of the filename to wildcard values and run the same rule once for each needed output.
Section 1: Using a Wildcard in a Single Rule
A wildcard appears inside curly braces in an input: or output: path. If a rule has the output path data/greetings/{person}.txt, Snakemake can use that rule to create data/greetings/Ada.txt, data/greetings/Grace.txt, or any other filename that fits the same pattern.
In this first section, we will use one wildcard in one rule. The rule will write a greeting file for one person at a time. rule all will request several concrete greeting files, and Snakemake will run the wildcard rule separately for each person.
Exercises
Preparation: Create a new Snakemake workflow file called workflow4.smk and use it for the exercises below. To run the whole workflow, use:
snakemake --cores 1 -s workflow4.smkExample: Create a rule called write_greeting that can write one greeting file for any person. Then use rule all to request greeting files for Ada, Grace, and Linus.
Solution
rule all:
input:
"data/greetings/Ada.txt",
"data/greetings/Grace.txt",
"data/greetings/Linus.txt"
rule write_greeting:
output:
"data/greetings/{person}.txt"
run:
from pathlib import Path
path = Path(output[0])
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(f"Hello, {wildcards.person}!\n")Exercise: Run the workflow and inspect the files in data/greetings/. Then add one more person to rule all and run it again.
Solution
rule all:
input:
"data/greetings/Ada.txt",
"data/greetings/Grace.txt",
"data/greetings/Linus.txt",
"data/greetings/Katherine.txt"
rule write_greeting:
output:
"data/greetings/{person}.txt"
run:
from pathlib import Path
path = Path(output[0])
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(f"Hello, {wildcards.person}!\n")Exercise: Run a dry run and look at how many jobs Snakemake plans. This is useful when you want to check the workflow before creating files.
snakemake --cores 1 -s workflow4.smk -nSolution
write_greeting job for each requested person, plus the all rule. The same rule is reused, but each job gets a different value for wildcards.person.
Section 2: Mapping Inputs and Outputs with Wildcards
Wildcards become especially useful when rules are connected. If one rule has output data/names/{person}.txt and another rule has input data/names/{person}.txt, Snakemake uses the same wildcard value for both paths in a single job.
For example, if the final target is results/greetings/Ada.txt, Snakemake can work backward and decide that it also needs data/names/Ada.txt. It does not need us to say “run the Ada version first”. The matching filenames describe the connection.
Exercises
Preparation: Create a new Snakemake workflow file called workflow5.smk and use it for the exercises below. To run the workflow, use:
snakemake --cores 1 -s workflow5.smkExercise: Create two connected rules:
write_namecreatesdata/names/{person}.txtwrite_greetingreadsdata/names/{person}.txtand createsresults/greetings/{person}.txt
Use rule all to request greeting files for Ada, Grace, and Linus.
Solution
rule all:
input:
"results/greetings/Ada.txt",
"results/greetings/Grace.txt",
"results/greetings/Linus.txt"
rule write_name:
output:
"data/names/{person}.txt"
run:
from pathlib import Path
path = Path(output[0])
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(f"{wildcards.person}\n")
rule write_greeting:
input:
"data/names/{person}.txt"
output:
"results/greetings/{person}.txt"
run:
from pathlib import Path
name = Path(input[0]).read_text().strip()
path = Path(output[0])
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(f"Hello, {name}!\n")Exercise: Add a third rule called write_excited_greeting. It should read results/greetings/{person}.txt and create results/excited/{person}.txt. Change rule all so that the excited greeting files are the final outputs.
Solution
rule all:
input:
"results/excited/Ada.txt",
"results/excited/Grace.txt",
"results/excited/Linus.txt"
rule write_name:
output:
"data/names/{person}.txt"
run:
from pathlib import Path
path = Path(output[0])
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(f"{wildcards.person}\n")
rule write_greeting:
input:
"data/names/{person}.txt"
output:
"results/greetings/{person}.txt"
run:
from pathlib import Path
name = Path(input[0]).read_text().strip()
path = Path(output[0])
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(f"Hello, {name}!\n")
rule write_excited_greeting:
input:
"results/greetings/{person}.txt"
output:
"results/excited/{person}.txt"
run:
from pathlib import Path
greeting = Path(input[0]).read_text().strip()
path = Path(output[0])
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(greeting + "!!\n")Section 3: Generating Filenames with expand() and glob_wildcards()
expand() and glob_wildcards()
Writing every final filename by hand works for three people, but it quickly becomes tedious. The expand() function is a convenient way to generate concrete filenames from a pattern and one or more lists of values.
expand() is usually used in rule all, because rule all needs concrete filenames. It is not a wildcard rule by itself. It simply produces a list like results/greetings/Ada.txt, results/greetings/Grace.txt, and results/greetings/Linus.txt.
Exercises
Preparation: Create a new Snakemake workflow file called workflow6.smk and use it for the exercises below.
Example: Rewrite the people-only greeting workflow so that rule all uses PEOPLE and expand() instead of a hand-written list of files.
Solution
PEOPLE = ["Ada", "Grace", "Linus"]
rule all:
input:
expand("results/greetings/{person}.txt", person=PEOPLE)
rule write_greeting:
output:
"results/greetings/{person}.txt"
run:
from pathlib import Path
path = Path(output[0])
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(f"Hello, {wildcards.person}!\n")Exercise: Add a second wildcard for language. Generate greeting files in paths like results/greetings/english/Ada.txt and results/greetings/spanish/Grace.txt. Use expand() to request all combinations of people and languages.
Solution
PEOPLE = ["Ada", "Grace", "Linus"]
LANGUAGES = ["english", "spanish", "german"]
GREETINGS = {
"english": "Hello",
"spanish": "Hola",
"german": "Hallo",
}
rule all:
input:
expand(
"results/greetings/{language}/{person}.txt",
language=LANGUAGES,
person=PEOPLE,
)
rule write_greeting:
output:
"results/greetings/{language}/{person}.txt"
run:
from pathlib import Path
greeting = GREETINGS[wildcards.language]
path = Path(output[0])
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(f"{greeting}, {wildcards.person}!\n")Exercise: Add another person and another language. Then run a dry run and check how many greeting jobs Snakemake plans.
snakemake --cores 1 -s workflow6.smk -nSolution
PEOPLE = ["Ada", "Grace", "Linus", "Katherine"]
LANGUAGES = ["english", "spanish", "german", "french"]
GREETINGS = {
"english": "Hello",
"spanish": "Hola",
"german": "Hallo",
"french": "Bonjour",
}
rule all:
input:
expand(
"results/greetings/{language}/{person}.txt",
language=LANGUAGES,
person=PEOPLE,
)
rule write_greeting:
output:
"results/greetings/{language}/{person}.txt"
run:
from pathlib import Path
greeting = GREETINGS[wildcards.language]
path = Path(output[0])
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(f"{greeting}, {wildcards.person}!\n")With 4 people and 4 languages, expand() creates 16 target filenames. If those files do not already exist, Snakemake should plan 16 write_greeting jobs, plus the all rule.
Exercise: Snakemake’s glob_wildcards() function searches files that already exist to discover what wildcard values to match (e.g. SAMPLES, PATIENT_IDS = glob_wildcards("data/{patient_id}_{sample}.fastq.gz") will find all patient_id values and sample values from the files that match the path). This can reduce hard-coding filenames in your snakemake workflows. Let’s try it out: use glob_wildcards() at the top of the file to discover people from files that already exist, so that if a new file like data/names/Rocky.txt is created, Snakemake automatically includes the file in the workflow.
Solution
PEOPLE, = glob_wildcards("data/names/{person}.txt")
rule all:
input:
expand("results/greetings/{person}.txt", person=PEOPLE)
rule write_greeting:
input:
"data/names/{person}.txt"
output:
"results/greetings/{person}.txt"
run:
from pathlib import Path
name = Path(input[0]).read_text().strip()
path = Path(output[0])
path.parent.mkdir(parents=True, exist_ok=True)
path.write_text(f"Hello, {name}!\n")