iBOTS Learn: Data Pipelines

Data Pipelines

Here, we focus on making data processing work more transparent, reliable, and easier to return to. Research workflows often begin as small scripts or manual steps, but they can quickly become hard to inspect, repeat, or share. By the end of the session, we have practised the habits that make data analysis more reproducible: describing what should exist, letting the workflow decide what needs to run, and reducing hard-coded repetition as projects grow.

First, we work with common metadata formats such as JSON, YAML, and XML, in order to get flexible methods for representing information in ways that both people and machines can read. This matters because good metadata gives our data context: what was recorded, how it was produced, and how later analysis steps should interpret it.

We then move from storing structured information to organising whole analysis workflows with Snakemake. The goal is not just to automate code, but to make the relationships between inputs, outputs, scripts, notebooks, and final results explicit. We start with simple rules and file outputs, then build towards connected workflows, target files, wildcards, expand(), and glob_wildcards().

Sessions

Storing and Extracting Metadata From Files: Converting Lists and Dicts into JSON and YAML

Somehow, we have to store get our data into files; otherwise, we'd lose our data every time we exited Python! Ideally, the way we store our data will make it easy to read and write, both in our own favorite computational environment and in those of our colleagues, without requiring that everyone develop some ultra-complex custom code. That's where standardized file formats come in. In this set of exercises, we'll practice **serializing** data into a string that we can write into a text file and **deserializing** text into Python data structures, using three different text file formats

Static Workflows with Snakemake

Build Snakemake rules that run code, declare outputs, and connect inputs into simple reproducible workflows.

Generalizing Workflows with Wildcards

Use wildcards, expand(), and glob_wildcards() to scale Snakemake workflows across repeated filenames and connected rules.