Data Pipelines


Here, we focus on making data processing work more transparent, reliable, and easier to return to. Research workflows often begin as small scripts or manual steps, but they can quickly become hard to inspect, repeat, or share. By the end of the session, we have practised the habits that make data analysis more reproducible: describing what should exist, letting the workflow decide what needs to run, and reducing hard-coded repetition as projects grow.

First, we work with common metadata formats such as JSON, YAML, and XML, in order to get flexible methods for representing information in ways that both people and machines can read. This matters because good metadata gives our data context: what was recorded, how it was produced, and how later analysis steps should interpret it.

We then move from storing structured information to organising whole analysis workflows with Snakemake. The goal is not just to automate code, but to make the relationships between inputs, outputs, scripts, notebooks, and final results explicit. We start with simple rules and file outputs, then build towards connected workflows, target files, wildcards, expand(), and glob_wildcards().