Specialized Data Validation Frameworks
Author
Use specialized libraries to validate data at common application boundaries: tabular datasets, LLM workflows, serialized messages, and command-line arguments.
Section 1: Pandera: Data Validation for DataFrames
Pandera brings schema validation to pandas, Polars, and other DataFrame libraries. It lets you define expectations for columns—types, ranges, nullability, custom checks—and validates entire datasets at runtime. It’s essentially unit tests for data pipelines.
Exercises
Example: Validate a Simple DataFrame Schema
import pandera.pandas as pa
import pandas as pd
class PersonSchema(pa.DataFrameModel):
name: pa.typing.Series[str] = pa.Field(nullable=False)
age: pa.typing.Series[int] = pa.Field(ge=0, le=120)
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [30, 41, -5], # <- invalid, age < 0
})
PersonSchema.validate(df)--------------------------------------------------------------------------- SchemaError Traceback (most recent call last) Cell In[6], line 13 6 age: pa.typing.Series[int] = pa.Field(ge=0, le=120) 8 df = pd.DataFrame({ 9 "name": ["Alice", "Bob", "Charlie"], 10 "age": [30, 41, -5], # <- invalid, age < 0 11 }) ---> 13 PersonSchema.validate(df) File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\pandas\model.py:191, in DataFrameModel.validate(cls, check_obj, head, tail, sample, random_state, lazy, inplace) 176 @classmethod 177 @docstring_substitution(validate_doc=BaseSchema.validate.__doc__) 178 def validate( (...) 186 inplace: bool = False, 187 ) -> DataFrame[Self]: 188 """%(validate_doc)s""" 189 return cast( 190 DataFrame[Self], --> 191 cls.to_schema().validate( 192 check_obj, head, tail, sample, random_state, lazy, inplace 193 ), 194 ) File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\pandas\container.py:117, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace) 105 check_obj = check_obj.map_partitions( # type: ignore [operator] 106 self._validate, 107 head=head, (...) 113 meta=check_obj, 114 ) 115 return check_obj.pandera.add_schema(self) --> 117 return self._validate( 118 check_obj=check_obj, 119 head=head, 120 tail=tail, 121 sample=sample, 122 random_state=random_state, 123 lazy=lazy, 124 inplace=inplace, 125 ) File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\pandas\container.py:137, in DataFrameSchema._validate(self, check_obj, head, tail, sample, random_state, lazy, inplace) 127 def _validate( 128 self, 129 check_obj: pd.DataFrame, (...) 135 inplace: bool = False, 136 ) -> pd.DataFrame: --> 137 return self.get_backend(check_obj).validate( 138 check_obj, 139 schema=self, 140 head=head, 141 tail=tail, 142 sample=sample, 143 random_state=random_state, 144 lazy=lazy, 145 inplace=inplace, 146 ) File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\container.py:105, in DataFrameSchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace) 100 components = self.collect_schema_components( 101 check_obj, schema, column_info 102 ) 104 # run the checks --> 105 error_handler = self.run_checks_and_handle_errors( 106 error_handler, 107 schema, 108 check_obj, 109 column_info, 110 sample, 111 components, 112 lazy, 113 head, 114 tail, 115 random_state, 116 ) 118 if error_handler.collected_errors: 119 if getattr(schema, "drop_invalid_rows", False): 120 # if the failure cases are a string, it means the error is 121 # a schema-level error. File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\container.py:192, in DataFrameSchemaBackend.run_checks_and_handle_errors(self, error_handler, schema, check_obj, column_info, sample, components, lazy, head, tail, random_state) 181 else: 182 error = SchemaError( 183 schema, 184 data=check_obj, (...) 190 reason_code=result.reason_code, 191 ) --> 192 error_handler.collect_error( 193 validation_type(result.reason_code), 194 result.reason_code, 195 error, 196 result.original_exc, 197 ) 199 return error_handler File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\base\error_handler.py:66, in ErrorHandler.collect_error(self, error_type, reason_code, schema_error, original_exc) 59 """Collect schema error, raising exception if lazy is False. 60 61 :param error_type: type of error 62 :param reason_code: string representing reason for error 63 :param schema_error: ``SchemaError`` object. 64 """ 65 if not self._lazy: ---> 66 raise schema_error from original_exc 68 # delete data of validated object from SchemaError object to prevent 69 # storing copies of the validated DataFrame/Series for every 70 # SchemaError collected. 71 if hasattr(schema_error, "data"): File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\container.py:227, in DataFrameSchemaBackend.run_schema_component_checks(self, check_obj, schema, schema_components, lazy) 223 # disable coercion at the schema component level since the 224 # dataframe-level schema already coerced it. 225 schema_component.coerce = False # type: ignore --> 227 result = schema_component.validate( 228 check_obj, lazy=lazy, inplace=True 229 ) 231 check_passed.append(is_table(result)) 232 except SchemaError as err: File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\dataframe\components.py:148, in ComponentSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace) 120 def validate( 121 self, 122 check_obj, (...) 128 inplace: bool = False, 129 ): 130 """Validate a series or specific column in dataframe. 131 132 :check_obj: data object to validate. (...) 146 147 """ --> 148 return self.get_backend(check_obj).validate( 149 check_obj, 150 schema=self, 151 head=head, 152 tail=tail, 153 sample=sample, 154 random_state=random_state, 155 lazy=lazy, 156 inplace=inplace, 157 ) File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\components.py:140, in ColumnBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace) 134 if getattr(schema, "drop_invalid_rows", False): 135 # replace the check_obj with the validated 136 check_obj = validate_column( 137 check_obj, column_name, return_check_obj=True 138 ) --> 140 validated_column = validate_column( 141 check_obj, 142 column_name, 143 return_check_obj=True, 144 ) 145 if schema.parsers: 146 check_obj[column_name] = validated_column File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\components.py:100, in ColumnBackend.validate.<locals>.validate_column(check_obj, column_name, return_check_obj) 98 except SchemaError as err: 99 err.column_name = column_name --> 100 error_handler.collect_error( 101 validation_type(err.reason_code), err.reason_code, err 102 ) File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\base\error_handler.py:66, in ErrorHandler.collect_error(self, error_type, reason_code, schema_error, original_exc) 59 """Collect schema error, raising exception if lazy is False. 60 61 :param error_type: type of error 62 :param reason_code: string representing reason for error 63 :param schema_error: ``SchemaError`` object. 64 """ 65 if not self._lazy: ---> 66 raise schema_error from original_exc 68 # delete data of validated object from SchemaError object to prevent 69 # storing copies of the validated DataFrame/Series for every 70 # SchemaError collected. 71 if hasattr(schema_error, "data"): File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\components.py:76, in ColumnBackend.validate.<locals>.validate_column(check_obj, column_name, return_check_obj) 72 try: 73 # make sure the schema component mutations are reverted after 74 # validation 75 _orig_name = schema.name ---> 76 validated_check_obj = super(ColumnBackend, self).validate( 77 check_obj, 78 schema.set_name(column_name), 79 head=head, 80 tail=tail, 81 sample=sample, 82 random_state=random_state, 83 lazy=lazy, 84 inplace=inplace, 85 ) 86 # revert the schema component mutations 87 schema.name = _orig_name File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\array.py:73, in ArraySchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace) 66 error_handler.collect_error( 67 validation_type(exc.reason_code), 68 exc.reason_code, 69 exc, 70 ) 72 # run the core checks ---> 73 error_handler = self.run_checks_and_handle_errors( 74 error_handler, 75 schema, 76 check_obj, 77 head=head, 78 tail=tail, 79 sample=sample, 80 random_state=random_state, 81 ) 83 if lazy and error_handler.collected_errors: 84 if getattr(schema, "drop_invalid_rows", False): File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\backends\pandas\array.py:137, in ArraySchemaBackend.run_checks_and_handle_errors(self, error_handler, schema, check_obj, **subsample_kwargs) 126 else: 127 error = SchemaError( 128 schema=schema, 129 data=check_obj, (...) 135 reason_code=result.reason_code, 136 ) --> 137 error_handler.collect_error( 138 validation_type(result.reason_code), 139 result.reason_code, 140 error, 141 original_exc=result.original_exc, 142 ) 144 return error_handler File c:\Users\delgr\Projects\hhai-repo\.pixi\envs\default\Lib\site-packages\pandera\api\base\error_handler.py:66, in ErrorHandler.collect_error(self, error_type, reason_code, schema_error, original_exc) 59 """Collect schema error, raising exception if lazy is False. 60 61 :param error_type: type of error 62 :param reason_code: string representing reason for error 63 :param schema_error: ``SchemaError`` object. 64 """ 65 if not self._lazy: ---> 66 raise schema_error from original_exc 68 # delete data of validated object from SchemaError object to prevent 69 # storing copies of the validated DataFrame/Series for every 70 # SchemaError collected. 71 if hasattr(schema_error, "data"): SchemaError: Column 'age' failed element-wise validator number 0: greater_than_or_equal_to(0) failure cases: -5
Exercise: Create a schema for a DataFrame of rectangles, and validate the DataFrame below:
- columns:
length: float,width: float - both must be positive
- add a custom check:
area = length * widthmust be less than 100
df = pd.DataFrame({
"length": [3.0, 20.0],
"width": [4.0, 1.0]
})Solution
class RectangleSchema(pa.DataFrameModel):
length: pa.typing.Series[float] = pa.Field(gt=0)
width: pa.typing.Series[float] = pa.Field(gt=0)
@pa.dataframe_check
@classmethod
def area_less_than_100(cls, df):
return (df["length"] * df["width"]) < 100
RectangleSchema.validate(df)Section 2: Pydantic-AI: Validated Inputs + LLM Reasoning
Pydantic-AI extends Pydantic models into “agents” that control LLM inputs and outputs. It enforces strong structure around prompts, validated parameters, and model reasoning steps. It’s helpful when you need reproducible LLM workflows instead of loose free-form strings.
Exercises
Example: A Validated Agent Input Model
from getpass import getpass
import os
# Needs an OpenAI API Key: https://platform.openai.com/login
if not os.getenv("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass()from pydantic_ai import Agent
from pydantic import BaseModel, Field
class Rectangle(BaseModel):
length: float = Field(gt=0)
width: float = Field(gt=0)
agent = Agent("gpt-4o-mini")
result = await agent.run(
user_prompt="Compute area.",
deps=Rectangle(length=3, width=4),
)
print(result.output) # LLM output
print(result.input) # validated inputExercise: Add a Person Model and Ask the LLM.
- Create an agent
- Ask the model: “How old will this person be in five years?”
- Run it with a
Person(name="Emma", age=3)object - Confirm Pydantic blocks invalid values like
age=-10
Solution
from pydantic import BaseModel, Field, ValidationError
from pydantic_ai import Agent
class Person(BaseModel):
name: str = Field(min_length=1)
age: int = Field(ge=0)
agent = Agent("gpt-4o-mini")
result = await agent.run(
user_prompt="How old will this person be in five years?",
deps=Person(name="Emma", age=3),
)
print(result.output)
try:
Person(name="Emma", age=-10)
except ValidationError as exc:
print(exc)Section 3: msgspec: Fast, Typed, and Strict Structured Data
msgspec provides ultra-fast, typed data structures with built-in validation when encoding or decoding.
Think of it as dataclasses + validation + serialization, all optimized in C.
It’s especially good for:
- JSON / MessagePack APIs
- high-performance pipelines
- applications needing strict schemas but minimal overhead
Exercises
Example: Define a Strict Typed Structure
import msgspec
class Person(msgspec.Struct):
name: str
age: int
def __post_init__(self):
if self.age < 0:
raise ValueError("age must be non-negative")
data = b'{"name": "Alice", "age": -5}' # <- invalid in your domain
person = msgspec.json.decode(data, type=Person)
print(person)Exercise: Write JSON validation code to reject this invalid Rectangle:
Solution
import msgspec
class Rectangle(msgspec.Struct):
length: float
width: float
def __post_init__(self):
if self.length <= 0 or self.width <= 0:
raise ValueError("length and width must be positive")
data = b'{"length": 3.0, "width": -2.0}'
try:
msgspec.json.decode(data, type=Rectangle)
except ValueError as exc:
print(exc)Section 4: Typer: Validation and Structure for Command-Line Interfaces
Typer is a modern library for building command-line interfaces using Python type hints. It automatically parses arguments, enforces basic validation (types, required/optional values), and generates helpful error messages and documentation. While Typer isn’t a “data validation” library in the traditional sense, it does validate user input at the command boundary — one of the most critical validation layers in real applications.
Type will build a CLI and:
- convert values
- reject invalid types
- show a nice help message if you pass invalid flags
Exercises
Example: A CLI Command With Typed Arguments
Put this into a file called rectangle.py and run it:
python app.py rectangle-area --length 3 --width 4import typer
app = typer.Typer()
@app.command()
def rectangle_area(length: float, width: float):
"""
Compute the area of a rectangle.
"""
if length <= 0 or width <= 0:
typer.echo("Both length and width must be positive!")
raise typer.Exit(code=1)
area = length * width
typer.echo(f"Area: {area}")
if __name__ == "__main__":
app()Exercise: Build a Validated Person CLI Tool
Create a CLI command:
python app.py create-person --name "Emma" --age 3Requirements:
-
The command should define parameters with type hints:
name: str age: int -
Validate inside the function that:
- name is not empty
- age ≥ 0
-
On success, print:
"Person(name='Emma', age=3) created!" -
On failure, print an error message and exit with
typer.Exit(code=1).
Solution
import typer
app = typer.Typer()
@app.command()
def create_person(name: str, age: int):
if not name:
typer.echo("name must not be empty")
raise typer.Exit(code=1)
if age < 0:
typer.echo("age must be non-negative")
raise typer.Exit(code=1)
typer.echo(f"Person(name={name!r}, age={age}) created!")
if __name__ == "__main__":
app()