Specialized Data Validation Frameworks

Author
Dr. Nicholas A. Del Grosso

Use specialized libraries to validate data at common application boundaries: tabular datasets, LLM workflows, serialized messages, and command-line arguments.

Section 1: Pandera: Data Validation for DataFrames

Pandera brings schema validation to pandas, Polars, and other DataFrame libraries. It lets you define expectations for columns—types, ranges, nullability, custom checks—and validates entire datasets at runtime. It’s essentially unit tests for data pipelines.

Exercises

Example: Validate a Simple DataFrame Schema

import pandera.pandas as pa
import pandas as pd

class PersonSchema(pa.DataFrameModel):
    name: pa.typing.Series[str] = pa.Field(nullable=False)
    age: pa.typing.Series[int] = pa.Field(ge=0, le=120)

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [30, 41, -5],   # <- invalid, age < 0
})

PersonSchema.validate(df)
---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
Cell In[6], line 13
      6     age: pa.typing.Series[int] = pa.Field(ge=0, le=120)
      8 df = pd.DataFrame({
      9     "name": ["Alice", "Bob", "Charlie"],
     10     "age": [30, 41, -5],   # <- invalid, age < 0
     11 })
---> 13 PersonSchema.validate(df)

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\pandas\model.py:191, in DataFrameModel.validate(cls, check_obj, head, tail, sample, random_state, lazy, inplace)
    176 @classmethod
    177 @docstring_substitution(validate_doc=BaseSchema.validate.__doc__)
    178 def validate(
   (...)    186     inplace: bool = False,
    187 ) -> DataFrame[Self]:
    188     """%(validate_doc)s"""
    189     return cast(
    190         DataFrame[Self],
--> 191         cls.to_schema().validate(
    192             check_obj, head, tail, sample, random_state, lazy, inplace
    193         ),
    194     )

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\pandas\container.py:117, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    105     check_obj = check_obj.map_partitions(  # type: ignore [operator]
    106         self._validate,
    107         head=head,
   (...)    113         meta=check_obj,
    114     )
    115     return check_obj.pandera.add_schema(self)
--> 117 return self._validate(
    118     check_obj=check_obj,
    119     head=head,
    120     tail=tail,
    121     sample=sample,
    122     random_state=random_state,
    123     lazy=lazy,
    124     inplace=inplace,
    125 )

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\pandas\container.py:137, in DataFrameSchema._validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    127 def _validate(
    128     self,
    129     check_obj: pd.DataFrame,
   (...)    135     inplace: bool = False,
    136 ) -> pd.DataFrame:
--> 137     return self.get_backend(check_obj).validate(
    138         check_obj,
    139         schema=self,
    140         head=head,
    141         tail=tail,
    142         sample=sample,
    143         random_state=random_state,
    144         lazy=lazy,
    145         inplace=inplace,
    146     )

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\container.py:105, in DataFrameSchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace)
    100 components = self.collect_schema_components(
    101     check_obj, schema, column_info
    102 )
    104 # run the checks
--> 105 error_handler = self.run_checks_and_handle_errors(
    106     error_handler,
    107     schema,
    108     check_obj,
    109     column_info,
    110     sample,
    111     components,
    112     lazy,
    113     head,
    114     tail,
    115     random_state,
    116 )
    118 if error_handler.collected_errors:
    119     if getattr(schema, "drop_invalid_rows", False):
    120         # if the failure cases are a string, it means the error is
    121         # a schema-level error.

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\container.py:192, in DataFrameSchemaBackend.run_checks_and_handle_errors(self, error_handler, schema, check_obj, column_info, sample, components, lazy, head, tail, random_state)
    181         else:
    182             error = SchemaError(
    183                 schema,
    184                 data=check_obj,
   (...)    190                 reason_code=result.reason_code,
    191             )
--> 192         error_handler.collect_error(
    193             validation_type(result.reason_code),
    194             result.reason_code,
    195             error,
    196             result.original_exc,
    197         )
    199 return error_handler

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\base\error_handler.py:66, in ErrorHandler.collect_error(self, error_type, reason_code, schema_error, original_exc)
     59 """Collect schema error, raising exception if lazy is False.
     60 
     61 :param error_type: type of error
     62 :param reason_code: string representing reason for error
     63 :param schema_error: ``SchemaError`` object.
     64 """
     65 if not self._lazy:
---> 66     raise schema_error from original_exc
     68 # delete data of validated object from SchemaError object to prevent
     69 # storing copies of the validated DataFrame/Series for every
     70 # SchemaError collected.
     71 if hasattr(schema_error, "data"):

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\container.py:227, in DataFrameSchemaBackend.run_schema_component_checks(self, check_obj, schema, schema_components, lazy)
    223     # disable coercion at the schema component level since the
    224     # dataframe-level schema already coerced it.
    225     schema_component.coerce = False  # type: ignore
--> 227     result = schema_component.validate(
    228         check_obj, lazy=lazy, inplace=True
    229     )
    231     check_passed.append(is_table(result))
    232 except SchemaError as err:

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\dataframe\components.py:148, in ComponentSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    120 def validate(
    121     self,
    122     check_obj,
   (...)    128     inplace: bool = False,
    129 ):
    130     """Validate a series or specific column in dataframe.
    131 
    132     :check_obj: data object to validate.
   (...)    146 
    147     """
--> 148     return self.get_backend(check_obj).validate(
    149         check_obj,
    150         schema=self,
    151         head=head,
    152         tail=tail,
    153         sample=sample,
    154         random_state=random_state,
    155         lazy=lazy,
    156         inplace=inplace,
    157     )

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\components.py:140, in ColumnBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace)
    134 if getattr(schema, "drop_invalid_rows", False):
    135     # replace the check_obj with the validated
    136     check_obj = validate_column(
    137         check_obj, column_name, return_check_obj=True
    138     )
--> 140 validated_column = validate_column(
    141     check_obj,
    142     column_name,
    143     return_check_obj=True,
    144 )
    145 if schema.parsers:
    146     check_obj[column_name] = validated_column

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\components.py:100, in ColumnBackend.validate.<locals>.validate_column(check_obj, column_name, return_check_obj)
     98 except SchemaError as err:
     99     err.column_name = column_name
--> 100     error_handler.collect_error(
    101         validation_type(err.reason_code), err.reason_code, err
    102     )

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\base\error_handler.py:66, in ErrorHandler.collect_error(self, error_type, reason_code, schema_error, original_exc)
     59 """Collect schema error, raising exception if lazy is False.
     60 
     61 :param error_type: type of error
     62 :param reason_code: string representing reason for error
     63 :param schema_error: ``SchemaError`` object.
     64 """
     65 if not self._lazy:
---> 66     raise schema_error from original_exc
     68 # delete data of validated object from SchemaError object to prevent
     69 # storing copies of the validated DataFrame/Series for every
     70 # SchemaError collected.
     71 if hasattr(schema_error, "data"):

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\components.py:76, in ColumnBackend.validate.<locals>.validate_column(check_obj, column_name, return_check_obj)
     72 try:
     73     # make sure the schema component mutations are reverted after
     74     # validation
     75     _orig_name = schema.name
---> 76     validated_check_obj = super(ColumnBackend, self).validate(
     77         check_obj,
     78         schema.set_name(column_name),
     79         head=head,
     80         tail=tail,
     81         sample=sample,
     82         random_state=random_state,
     83         lazy=lazy,
     84         inplace=inplace,
     85     )
     86     # revert the schema component mutations
     87     schema.name = _orig_name

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\array.py:73, in ArraySchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace)
     66     error_handler.collect_error(
     67         validation_type(exc.reason_code),
     68         exc.reason_code,
     69         exc,
     70     )
     72 # run the core checks
---> 73 error_handler = self.run_checks_and_handle_errors(
     74     error_handler,
     75     schema,
     76     check_obj,
     77     head=head,
     78     tail=tail,
     79     sample=sample,
     80     random_state=random_state,
     81 )
     83 if lazy and error_handler.collected_errors:
     84     if getattr(schema, "drop_invalid_rows", False):

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\array.py:137, in ArraySchemaBackend.run_checks_and_handle_errors(self, error_handler, schema, check_obj, **subsample_kwargs)
    126         else:
    127             error = SchemaError(
    128                 schema=schema,
    129                 data=check_obj,
   (...)    135                 reason_code=result.reason_code,
    136             )
--> 137             error_handler.collect_error(
    138                 validation_type(result.reason_code),
    139                 result.reason_code,
    140                 error,
    141                 original_exc=result.original_exc,
    142             )
    144 return error_handler

File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\base\error_handler.py:66, in ErrorHandler.collect_error(self, error_type, reason_code, schema_error, original_exc)
     59 """Collect schema error, raising exception if lazy is False.
     60 
     61 :param error_type: type of error
     62 :param reason_code: string representing reason for error
     63 :param schema_error: ``SchemaError`` object.
     64 """
     65 if not self._lazy:
---> 66     raise schema_error from original_exc
     68 # delete data of validated object from SchemaError object to prevent
     69 # storing copies of the validated DataFrame/Series for every
     70 # SchemaError collected.
     71 if hasattr(schema_error, "data"):

SchemaError: Column 'age' failed element-wise validator number 0: greater_than_or_equal_to(0) failure cases: -5

Exercise: Create a schema for a DataFrame of rectangles, and validate the DataFrame below:

  • columns: length: float, width: float
  • both must be positive
  • add a custom check: area = length * width must be less than 100
df = pd.DataFrame({
    "length": [3.0, 20.0],
    "width": [4.0, 1.0]
})
Solution
class RectangleSchema(pa.DataFrameModel):
    length: pa.typing.Series[float] = pa.Field(gt=0)
    width: pa.typing.Series[float] = pa.Field(gt=0)

    @pa.dataframe_check
    @classmethod
    def area_less_than_100(cls, df):
        return (df["length"] * df["width"]) < 100

RectangleSchema.validate(df)

Section 2: Pydantic-AI: Validated Inputs + LLM Reasoning

Pydantic-AI extends Pydantic models into “agents” that control LLM inputs and outputs. It enforces strong structure around prompts, validated parameters, and model reasoning steps. It’s helpful when you need reproducible LLM workflows instead of loose free-form strings.

Exercises

Example: A Validated Agent Input Model

from getpass import getpass
import os

# Needs an OpenAI API Key: https://platform.openai.com/login
if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass()
from pydantic_ai import Agent
from pydantic import BaseModel, Field

class Rectangle(BaseModel):
    length: float = Field(gt=0)
    width: float = Field(gt=0)

agent = Agent("gpt-4o-mini")

result = await agent.run(
    user_prompt="Compute area.",
    deps=Rectangle(length=3, width=4),
)


print(result.output)   # LLM output
print(result.input)    # validated input

Exercise: Add a Person Model and Ask the LLM.

  • Create an agent
  • Ask the model: “How old will this person be in five years?”
  • Run it with a Person(name="Emma", age=3) object
  • Confirm Pydantic blocks invalid values like age=-10
Solution
from pydantic import BaseModel, Field, ValidationError
from pydantic_ai import Agent

class Person(BaseModel):
    name: str = Field(min_length=1)
    age: int = Field(ge=0)

agent = Agent("gpt-4o-mini")

result = await agent.run(
    user_prompt="How old will this person be in five years?",
    deps=Person(name="Emma", age=3),
)
print(result.output)

try:
    Person(name="Emma", age=-10)
except ValidationError as exc:
    print(exc)

Section 3: msgspec: Fast, Typed, and Strict Structured Data

msgspec provides ultra-fast, typed data structures with built-in validation when encoding or decoding. Think of it as dataclasses + validation + serialization, all optimized in C.

It’s especially good for:

  • JSON / MessagePack APIs
  • high-performance pipelines
  • applications needing strict schemas but minimal overhead

Exercises

Example: Define a Strict Typed Structure

import msgspec

class Person(msgspec.Struct):
    name: str
    age: int

    def __post_init__(self):
        if self.age < 0:
            raise ValueError("age must be non-negative")

data = b'{"name": "Alice", "age": -5}'  # <- invalid in your domain

person = msgspec.json.decode(data, type=Person)
print(person)

Exercise: Write JSON validation code to reject this invalid Rectangle:

Solution
import msgspec

class Rectangle(msgspec.Struct):
    length: float
    width: float

    def __post_init__(self):
        if self.length <= 0 or self.width <= 0:
            raise ValueError("length and width must be positive")

data = b'{"length": 3.0, "width": -2.0}'

try:
    msgspec.json.decode(data, type=Rectangle)
except ValueError as exc:
    print(exc)

Section 4: Typer: Validation and Structure for Command-Line Interfaces

Typer is a modern library for building command-line interfaces using Python type hints. It automatically parses arguments, enforces basic validation (types, required/optional values), and generates helpful error messages and documentation. While Typer isn’t a “data validation” library in the traditional sense, it does validate user input at the command boundary — one of the most critical validation layers in real applications.

Type will build a CLI and:

  • convert values
  • reject invalid types
  • show a nice help message if you pass invalid flags

Exercises

Example: A CLI Command With Typed Arguments

Put this into a file called rectangle.py and run it:

python app.py rectangle-area --length 3 --width 4
import typer

app = typer.Typer()

@app.command()
def rectangle_area(length: float, width: float):
    """
    Compute the area of a rectangle.
    """
    if length <= 0 or width <= 0:
        typer.echo("Both length and width must be positive!")
        raise typer.Exit(code=1)

    area = length * width
    typer.echo(f"Area: {area}")

if __name__ == "__main__":
    app()

Exercise: Build a Validated Person CLI Tool

Create a CLI command:

python app.py create-person --name "Emma" --age 3

Requirements:

  1. The command should define parameters with type hints:

    name: str  
    age: int  
  2. Validate inside the function that:

    • name is not empty
    • age ≥ 0
  3. On success, print: "Person(name='Emma', age=3) created!"

  4. On failure, print an error message and exit with typer.Exit(code=1).

Solution
import typer

app = typer.Typer()

@app.command()
def create_person(name: str, age: int):
    if not name:
        typer.echo("name must not be empty")
        raise typer.Exit(code=1)
    if age < 0:
        typer.echo("age must be non-negative")
        raise typer.Exit(code=1)

    typer.echo(f"Person(name={name!r}, age={age}) created!")

if __name__ == "__main__":
    app()