Data Validation Patterns

Author
Dr. Nicholas A. Del Grosso

Setup

Run the helper below before working through the exercises. It provides lightweight feedback similar to a tiny test runner.

import numpy as np

def check(code, expected, exception_message="", verbose=True):
    """
    a "pytest-lite" function.
    
    Takes code to evaluate and what's expected (whether a value, a exception type, or addtionally even a substring in the exception message).
    Returns whether the exception was met, and prints a message describing the finding.
    """
    try:
        output = eval(code) 

    except BaseException as exc:
        output = exc
        if type(expected) == type and issubclass(expected, BaseException):
            valid_exception_type = isinstance(exc, expected)
            valid_exception_message = exception_message in str(exc)
            valid = valid_exception_type and valid_exception_message
            if not valid:
                if not valid_exception_type:
                    expected_str = expected.__name__
                else:
                    expected_str = '\"...' + exception_message + '...\"'
            else:
                expected_str = ''
        else:
            valid = False
            expected_str = str(expected)

        output_str = type(output).__name__
    
    else:
        if type(expected) == type and issubclass(expected, Exception):
            valid = False
            expected_str = expected.__name__
        elif type(expected) == type:
            valid = True
            expected_str = expected.__name__ 
            
        else:
            valid = output == expected
            expected_str = str(expected)

        if " object at " in str(output):
            output_str = type(output).__name__
        else:
            output_str = str(output)

    
    if verbose:
        valid_str = "✅" if valid else "❌"

        # output_str = output if not isinstance(output, Exception) else type(output).__name__
        print(valid_str, code, "->", output_str, "" if valid else f"(Expected: {expected_str})")
    return valid
        

Section 1: Background

Why Validate at the Edges

Realistically, most bugs in scientific code come from bad assumptions about inputs: wrong types, missing keys, unexpected shapes, empty strings, negative values that “should never happen,” and so on. Guard clauses are the blunt tool that prevents this mess. Here, we’ll practice the basic pattern behind data validation before reaching for larger frameworks: check assumptions at the boundary, raise clear errors, and keep invalid objects from existing in the first place.

The goal is to:

  • Fail fast: you don’t waste time debugging downstream errors caused by nonsense inputs.
  • Reduce nesting: you don’t wrap the real logic in “if” jungles.
  • Document assumptions: the function states clearly what it won’t accept.
  • Reduce ambiguity: users can’t silently pass the wrong thing and hope for the best.

Practice: Validate at the Edges

When you’re writing code for another researcher, the cleanest way to keep things stable is to validate inputs right at the boundary — the moment they enter your function, class, or pipeline. Your client will give you messy, half-specified data; that’s normal. If you don’t check it immediately, the mistake shows up later in a place that looks like your bug. Validating at the edges prevents that: you reject bad inputs early so the rest of the code can stay simple and trustworthy.

The Data Will Get More Complex

In this notebook, we’re focusing on very simple data–just single values, so we can see how challenging it is even in these cases. In later sessions, we’ll leverage frameworks for testing much more complex data structures, including those that live in scientific data files.

In the exercises in this notebook, you will practice the basics of data validation and gradually move from manual guard clauses to framework-supported validation:

  1. Data Validation when Running Functions
  2. Data Validation when Instantiating Classes
  3. Data Validation when using Dataclasses
  4. Pydantic for Data Validation on Custom Classes

The “Guard Clause” Pattern: “Check Yourself Before You Break Yourself”

A guard clause is a short, early check at the top of a function or method that refuses to continue when something is off. No ceremony, no clever abstractions. You validate the input and immediately raise, instead of letting the code wander forward and fail three layers deeper.

They’re the simplest, most reliable form of defensive programming.

Exercises

Example: Make all the checks pass.

def greet(name):
    """
    Says Hi to whoever you want!
    """

    ## Guard Clauses Go Here: ######
    if isinstance(name, (float, int)):
        raise TypeError("`name` should be a string. You are not a number.")
    ##################################
    
    return f"Hi, {name}!"


check("greet('Nicholas')", "Hi, Nicholas!");
check("greet(24601)", TypeError, "You are not a number");
✅ greet('Nicholas') -> Hi, Nicholas! 
✅ greet(24601) -> TypeError 

Exercise: Make all the checks pass.

def total_length(x, y):
    """
    Computes the total of two lengths of wire.

    Arguments:
      - x: a positive number
      - y: another positive number

    """

    ## Guard Clauses Go Here: ####


    ##############################

    return x + y


check("total_length(3.2, 1.2)", 4.4)
check("total_length([1, 2], [])", TypeError, "number")
check("total_length(-3, 5)", ValueError, "positive")
check("total_length(3, -5.2)", ValueError, "positive")
check("total_length(3, 'a')", TypeError, "number")
check("total_length('hello, ', 'world')", TypeError, "number")
check("total_length(1., 2)", 3.)
check("total_length(np.float32(3), 3)", np.float32(6));
✅ total_length(3.2, 1.2) -> 4.4 
❌ total_length([1, 2], []) -> [1, 2] (Expected: TypeError)
❌ total_length(-3, 5) -> 2 (Expected: ValueError)
❌ total_length(3, -5.2) -> -2.2 (Expected: ValueError)
❌ total_length(3, 'a') -> TypeError (Expected: "...number...")
❌ total_length('hello, ', 'world') -> hello, world (Expected: TypeError)
✅ total_length(1., 2) -> 3.0 
✅ total_length(np.float32(3), 3) -> 6.0 
Solution
def total_length(x, y):
    """
    Computes the total of two lengths of wire.

    Arguments:
      - x: a positive number
      - y: another positive number
    """

    if not isinstance(x, (int, float)):
        raise TypeError("x must be a number")
    if not isinstance(y, (int, float)):
        raise TypeError("y must be a number")
    if x <= 0 or y <= 0:
        raise ValueError("lengths must be positive")

    return x + y


check("total_length(3.2, 1.2)", 4.4)
check("total_length([1, 2], [])", TypeError, "number")
check("total_length(-3, 5)", ValueError, "positive")
check("total_length(3, -5.2)", ValueError, "positive")
check("total_length(3, 'a')", TypeError, "number")
check("total_length('hello, ', 'world')", TypeError, "number")
check("total_length(1., 2)", 3.)
check("total_length(np.float32(3), 3)", np.float32(6));

Exercise: Make all the checks pass.


def translate(rna):
    """
    Change a DNA sequence into an RNA sequence.
    """

    ## Guard Clauses Go Here: ##################



    ############################################

    from urllib.request import urlopen
    import json

    codons_url = "https://raw.githubusercontent.com/nickdelgrosso/dna-transcription-kata/refs/heads/master/data/codons.json"
    with urlopen(codons_url) as response_c:
        peptides = json.loads(response_c.read())

    peptides_url = "https://raw.githubusercontent.com/nickdelgrosso/dna-transcription-kata/refs/heads/master/data/peptides.json"
    with urlopen(peptides_url) as response_p:
        peptides_shorts = json.loads(response_p.read())
    
    
    out = []
    for c0, c1, c2 in zip(rna[::3], rna[1::3], rna[2::3]):
        codon = (c0 + c1 + c2)
        peptide = peptides[codon]
        peptide_short = peptides_shorts[peptide.lower()]
        out.append(peptide_short)

    return "".join(out)
    

check("translate('CCC')", 'P');
check("translate('GCAUUA')", 'AL');
check("translate('gca')", ValueError, "upper")
check("translate('TTT')", ValueError, "GCAU")
check("translate('GG')", ValueError, "three")
# check("")
✅ translate('CCC') -> P 
✅ translate('GCAUUA') -> AL 
❌ translate('gca') -> KeyError (Expected: ValueError)
❌ translate('TTT') -> KeyError (Expected: ValueError)
❌ translate('GG') ->  (Expected: ValueError)
False
Solution
def translate(rna):
    """
    Translate an RNA sequence into a peptide sequence.
    """

    if not isinstance(rna, str):
        raise TypeError("rna must be a string")
    if rna != rna.upper():
        raise ValueError("rna must be upper-case")
    if any(base not in "GCAU" for base in rna):
        raise ValueError("rna must only contain G, C, A, and U")
    if len(rna) % 3 != 0:
        raise ValueError("rna length must be divisible by three")

    from urllib.request import urlopen
    import json

    codons_url = "https://raw.githubusercontent.com/nickdelgrosso/dna-transcription-kata/refs/heads/master/data/codons.json"
    with urlopen(codons_url) as response_c:
        peptides = json.loads(response_c.read())

    peptides_url = "https://raw.githubusercontent.com/nickdelgrosso/dna-transcription-kata/refs/heads/master/data/peptides.json"
    with urlopen(peptides_url) as response_p:
        peptides_shorts = json.loads(response_p.read())

    out = []
    for c0, c1, c2 in zip(rna[::3], rna[1::3], rna[2::3]):
        codon = c0 + c1 + c2
        peptide = peptides[codon]
        peptide_short = peptides_shorts[peptide.lower()]
        out.append(peptide_short)

    return "".join(out)


check("translate('CCC')", 'P');
check("translate('GCAUUA')", 'AL');
check("translate('gca')", ValueError, "upper")
check("translate('TTT')", ValueError, "GCAU")
check("translate('GG')", ValueError, "three")

Data Validation when Instantiating Objects

When you create an object, you’re claiming: “This thing represents something real and internally consistent.” Most bugs show up because that claim quietly isn’t true.

In OOP, the constructor __init__ is the boundary where you decide what counts as a valid object. If you let invalid data slip through here, the error will surface later in a place that’s harder to diagnose. That leads to the classic Python debugging experience: the real mistake happened 40 lines earlier, but you only notice when something unrelated explodes.

So the rule is simple: If your object must obey certain constraints, enforce them at creation time.

Exercises

Example:

from dataclasses import dataclass

@dataclass
class Rectangle:

    def __init__(self, length, width):

        self.length = length
        self.width = width

        ## Data Validation Goes Here: #################
        if not isinstance(self.length, (int, float)):
            raise TypeError("length must be a number.")
        if self.length <= 0:
            raise ValueError("length must be positive")
        
        if not isinstance(self.width, (int, float)):
            raise TypeError("width must be a number.")
        if self.width <= 0:
            raise ValueError("width must be positive")
        
        ###############################################
    

check("Rectangle(4, 5)", Rectangle);
check("Rectangle('wide', 'tall')", TypeError);
check("Rectangle(-2, 2)", ValueError, "positive");
✅ Rectangle(4, 5) -> Rectangle() 
✅ Rectangle('wide', 'tall') -> TypeError 
✅ Rectangle(-2, 2) -> ValueError 

Exercise: Make all the checks pass.

class Person:

    def __init__(self, name, age) -> None:

        self.name = name
        self.age = age

        ## Data Validation Goes Here: ##########


        ####################################



check("Person('Nick', 37)", Person)
check("Person('Santa', 'old')", TypeError, "integer")
check("Person('', -200)", ValueError, "positive")
check("Person('', 12)", ValueError, "empty")
✅ Person('Nick', 37) -> Person 
❌ Person('Santa', 'old') -> Person (Expected: TypeError)
❌ Person('', -200) -> Person (Expected: ValueError)
❌ Person('', 12) -> Person (Expected: ValueError)
False
Solution
class Person:

    def __init__(self, name, age) -> None:
        if not isinstance(age, int):
            raise TypeError("age must be an integer")
        if age <= 0:
            raise ValueError("age must be positive")
        if not isinstance(name, str):
            raise TypeError("name must be a string")
        if len(name) == 0:
            raise ValueError("name must not be empty")

        self.name = name
        self.age = age


check("Person('Nick', 37)", Person)
check("Person('Santa', 'old')", TypeError, "integer")
check("Person('', -200)", ValueError, "positive")
check("Person('', 12)", ValueError, "empty")

Data Validation when Writing Dataclasses

Python classes are often just bags of data with a little validation sprinkled on top. Writing all the boilerplate (__init__, __repr__, comparisons, etc.) is tedious and error-prone. dataclasses solve this by generating the boring parts for you.

When you mark a class with @dataclass, Python automatically creates:

  • an __init__ assigning your fields,
  • a readable __repr__,
  • and other convenience defaults.

However, data classes do not automatically ensure that your data is correct; that, we still have to write ourself. To provide a place for data validation, __post_init__ runs immediately after the automatically generated __init__. This is the hook where you enforce invariants — the things that must always be true for a valid instance.

Exercises

Example: Make all the checks pass.

from dataclasses import dataclass

@dataclass
class Rectangle:
    length: float
    width: float

    def __post_init__(self):
        if not isinstance(self.length, (int, float)):
            raise TypeError("length must be a number.")
        if self.length <= 0:
            raise ValueError("length must be positive")
        
        if not isinstance(self.width, (int, float)):
            raise TypeError("width must be a number.")
        if self.width <= 0:
            raise ValueError("width must be positive")
    

check("Rectangle(4, 5)", Rectangle);
check("Rectangle('wide', 'tall')", TypeError);
check("Rectangle(-2, 2)", ValueError, "positive");
✅ Rectangle(4, 5) -> Rectangle(length=4, width=5) 
✅ Rectangle('wide', 'tall') -> TypeError 
✅ Rectangle(-2, 2) -> ValueError 

Exercise: Make all the checks pass.

from dataclasses import dataclass

@dataclass
class Person:
    name: str
    age: int

    def __post_init__(self):
        ...
        ## Guard Clauses Go Here: #########
        

        ###################################


check("Person('Nick', 37)", Person)
check("Person('Santa', 'old')", TypeError, "integer")
check("Person('', -200)", ValueError, "positive")
check("Person('', 12)", ValueError, "empty")
Solution
from dataclasses import dataclass

@dataclass
class Person:
    name: str
    age: int

    def __post_init__(self):
        if not isinstance(self.age, int):
            raise TypeError("age must be an integer")
        if self.age <= 0:
            raise ValueError("age must be positive")
        if not isinstance(self.name, str):
            raise TypeError("name must be a string")
        if len(self.name) == 0:
            raise ValueError("name must not be empty")


check("Person('Nick', 37)", Person)
check("Person('Santa', 'old')", TypeError, "integer")
check("Person('', -200)", ValueError, "positive")
check("Person('', 12)", ValueError, "empty")

Pydantic: a Framework that simplifies Data Validation in Custom Classes

Manual guard clauses are fine for simple functions, but they get tiresome the moment you start defining structured objects — experiment configs, stimulus definitions, trial parameters, behavioral logs, etc. You end up repeating checks, writing boilerplate, and missing edge cases.

Pydantic exists to remove that tedium. It wraps your class in a validation layer that:

  • Enforces types automatically.
  • Runs field-level validation without you writing the same guard clauses over and over.
  • Builds errors that are actually readable, instead of stack traces buried in your own code.
  • Makes malformed data impossible to instantiate, which is exactly what you want for models representing “real-world” entities.

The key idea: your class shouldn’t exist in an invalid state. Pydantic makes that rule the default, not something you hope developers remember.

If your analysis pipeline depends on structured configuration or repeatedly loaded data formats, Pydantic pays for itself immediately. It standardizes validation, cuts boilerplate, and forces correctness at the boundary — before bad inputs poison the rest of your workflow.

Exercises

Example: Make all the checks pass.

from pydantic import ValidationError, field_validator
from pydantic.dataclasses import dataclass as p_dataclass

@p_dataclass
class Rectangle:
    length: float
    width: float

    @field_validator('length', 'width')
    @classmethod
    def validate_positive(cls, value: float):
        if value <= 0:
            raise ValueError("must be positive")
    

check("Rectangle(4, 5)", Rectangle);
check("Rectangle('wide', 'tall')", ValidationError);
check("Rectangle(-2, 2)", ValidationError, "positive");
✅ Rectangle(4, 5) -> Rectangle(length=None, width=None) 
✅ Rectangle('wide', 'tall') -> ValidationError 
✅ Rectangle(-2, 2) -> ValidationError 

Exercise: Make all the checks pass.

from pydantic import ValidationError
from pydantic.dataclasses import dataclass as p_dataclass

@p_dataclass
class Person:
    name: str
    age: int

    ## Add field validators Here: ####



    ###################################


check("Person('Nick', 37)", Person);
check("Person('Santa', 'old')", ValidationError, "integer");
check("Person('', -200)", ValidationError, "positive");
check("Person('', 12)", ValidationError, "empty");
✅ Person('Nick', 37) -> Person(name=None, age=None) 
✅ Person('Santa', 'old') -> ValidationError 
✅ Person('', -200) -> ValidationError 
✅ Person('', 12) -> ValidationError 
Solution
from pydantic import ValidationError, field_validator
from pydantic.dataclasses import dataclass as p_dataclass

@p_dataclass
class Person:
    name: str
    age: int

    @field_validator("name")
    @classmethod
    def validate_name(cls, value: str):
        if len(value) == 0:
            raise ValueError("name must not be empty")
        return value

    @field_validator("age")
    @classmethod
    def validate_age(cls, value: int):
        if value <= 0:
            raise ValueError("age must be positive")
        return value


check("Person('Nick', 37)", Person);
check("Person('Santa', 'old')", ValidationError, "integer");
check("Person('', -200)", ValidationError, "positive");
check("Person('', 12)", ValidationError, "empty");

Section 2: Conclusion

Data validation isn’t decoration; it’s the difference between code that quietly corrupts results and code you can trust. The pattern is always the same:

  • Reject invalid data early.
  • Fail fast and loudly.
  • Make illegal states unrepresentable.

Guard clauses handle the simple cases. dataclasses’ post_init gives you a clean place to enforce invariants. Pydantic scales the whole approach when your objects get complicated.

The Take-Away: Good validation eliminates entire classes of bugs before they exist.