Structured Scientific Data with HDF5: Design, Access, and Compression

Author

Dr. Nicholas Del Grosso

Download Materials

# %pip install h5py numpy

import h5py
import numpy as np
import time

Structured Scientific Data with HDF5: Design, Access, and Compression

HDF5 is one of the most widely used scientific data formats. Many neuroscience tools, analysis pipelines, and standardised formats rely on it internally. In this notebook, we explore how HDF5 works at a practical level using the h5py library, which provides a NumPy-friendly interface to HDF5 files.

Our goals are to:

Understand how to store structured data and metadata

Read data efficiently and selectively

Explore how compression affects disk usage and performance

Throughout, we focus on design decisions: how we organise data, how we represent it, and how those choices influence performance.

Setup

Utility Functions

import os
import psutil

def _format_bytes(bytes: float, precision: int = 2) -> str:
    """
    Takes a time in seconds and returns a string (e.g. ) that is more human-readable.

    Looking to do this in a real project?  Some alternatives:
      - `humanfriendly`: https://pypi.org/project/humanfriendly/#getting-started
    """

    if bytes < 0:
        raise ValueError("bytes must be non-negative")

    units = [("KB", 1000), ("MB", 1_000_000), ("GB", 1_000_000_000), ("TB", 1_000_000_000_000)]

    for unit, scale in reversed(units):
        if bytes >= scale:
            value = bytes / scale
            return f"{value:.{precision}f} {unit}"
    else:
        return f"{bytes} B"
    

def _disk_read() -> float:
    return psutil.Process(os.getpid()).io_counters().read_bytes

def _disk_write() -> float:
    return psutil.Process(os.getpid()).io_counters().write_bytes


class utils:
    format_bytes = _format_bytes
    bytes_read = _disk_read
    bytes_written = _disk_write

Section 1: Writing Mixed-Type, Labelled, Data and Metadata to HDF5 Files using h5py

HDF5 allows us to move beyond simple array storage. Instead of writing one array per file, we can:

Store multiple datasets in a single file
Organise data hierarchically using groups
Attach metadata directly to datasets
Combine numeric arrays with strings and other data types

This makes HDF5 suitable for structured scientific data, where measurements, metadata, and experimental parameters belong together.

In this section, we focus on:

Creating datasets
Organising them into groups
Adding metadata using attributes
Designing clear, interpretable file structures

The emphasis is not on complexity, but on clarity. We aim to create files that are both structured and understandable.

Reference

Code	Description
`f = h5py.File('filename.h5', 'w')`	Open an h5py file object for writing
`f.close()`	Closes the h5py file and releases the linked file back to the operating system.
`f.create_dataset('temp', data=x)`	Write an array called ’temp’ with the data in the numy array `x` into the HDF5 file
`f.create_dataset('data/temp', data=x)`	Write an array called ’temp’ in the folder called “data” with the data in the numy array `x` into the HDF5 file
`f.attrs['name'] = 'Session 1'`	Set an attribute as metadata onto the root group of the HDF5 file – this works like a normal Python dictionary
`f['x'].attrs['id'] = 'ABC'`	Set an attribute as metadata onto the ‘x’ node of the HDF file

Exercises

Example: Write an HDF5 File called temp.h5 with the following schema:

root/
└── temp: uint16, 1000 x 1  (temperature measurements over time)

temp = np.random.randint(15, 22, size=(1000,1)).astype(np.uint16)

with h5py.File('temp.h5', 'w') as f:
    f.create_dataset('temp', data=temp)
    f['temp'].attrs.update({
        'description': 'Temperature measurements over time'
    })

Exercise: Open up this temp.h5 file with the myHDF5 data browser and explore the data in the GUI. Does this look how you expected?

Solution

Exercise: Write an HDF5 File called ephys.h5 with the following schema and descriptions:

root/
├── time: float32, 1 x 1000 (trial time, in seconds)
├── voltage: int16, 4 x 1000 (voltage measurements for each recording channel)
└──chan_names: S, 4 (channel names for each recording channel)

t = np.linspace(0, 3, 1000).astype(np.float32)
voltage = np.random.normal(1, 1, size=(4, 1000)).astype(np.float32)
chan_names = ['CH01', 'CH02', 'CH03', 'CH05']

Solution

with h5py.File('ephys.h5', 'w') as f:
    f.create_dataset('time', data=t)
    f['time'].attrs.update({
        'description': 'Trial time, in seconds.'
    })
    f.create_dataset('voltage', data=voltage)
    f['voltage'].attrs.update({
        'description': 'voltage measurements for each recording channel.'
    })
    f.create_dataset('chan_names', data=chan_names)
    f['chan_names'].attrs.update({
        'description': 'channel names for each recording channel.'
    })

Exercise: Open up your ephys.h5 file with the myHDF5 data browser and explore your data in the GUI. Does this look how you expected?

Solution

Exercise: Write an HDF5 File called motion_tracking.h5 with the following schema

root/
├── session_date: str
├── subject_id: str
├── camera: 
│   ├── black_noise_image: uint16, 640 x 640 x 3 (reference image taken with lights out)
│   ├── image_width: uint16
│   ├── image_height: uint16
│   ├── shutter_speed: uint16
│   └── aperture: float32
│
└── motion_tracking: 
    ├── time: uint32 1 x 3000 (session time, in milliseconds)
    ├── rb_pos: float32  2 x 3 x 3000 (XYZ coordinates of the center of each tracked rigid body)
    ├── rb_rot: float32  2 x 3 x 3000 (XYZ Euler rotations of each tracked rigid body)
    ├── xyz_names: str 1 x 3 (The spatial coordinate names)
    └── rb_names: str 1 x 2 (The name of each rigid body)

session_date = '2024-04-22'
subject_id = 'AD11'
camera_black_noise_im = np.random.randint(0, 30, size=(640, 640)).astype(np.uint16)
im_width = 640
im_height = 640
shutter_speed = 800
aperture = 2.8
motion_time = (np.arange(0, 1000, step=1/shutter_speed)[:3000] * 1000).astype(np.uint32)
rb_pos = np.random.random(size=(2, 3, 3000)).astype(np.float32)
rb_rot = np.random.random(size=(2, 3, 3000)).astype(np.float32)
xyz_names = ['X', 'Y', 'Z']
rb_names = ['head', 'tail_base']

Solution

with h5py.File('motion_tracking.h5', 'w') as f:
    f.attrs.update({
        'session_date': session_date,
        'subject_id': subject_id,
    })
    f.create_dataset('camera/black_noise_image', data=camera_black_noise_im)
    f['camera'].attrs.update({
        'image_width': im_width,
        'image_height': im_height,
        'shutter_speed': shutter_speed,
        'aperture': aperture,
    })

    f.create_dataset('motion_tracking/time', data=motion_time)
    f.create_dataset('motion_tracking/rb_pos', data=rb_pos)
    f.create_dataset('motion_tracking/rb_rot', data=rb_rot)
    f.create_dataset('motion_trakcking/rb_names', data=rb_names)
    f.create_dataset('xyz_names', data=xyz_names)

Exercise: Open up your ephys.h5 file with the myHDF5 data browser and explore your data in the GUI. Does this look how you expected?

Solution

Section 2: Reading Data from HDF5 Files

One of the strengths of HDF5 is that we do not need to load an entire dataset into memory; without loading all the data, we can:

Inspect file structure
Read selected datasets
Access metadata
Load only slices of large arrays

In this section, we practise navigating HDF5 files and reading data efficiently. We treat HDF5 files as structured containers and learn how to explore them programmatically.

Our goal is to become comfortable:

Inspecting file contents
Accessing nested groups
Reading partial data
Retrieving metadata stored as attributes

Reference

Code	Description
`f = h5py.File('file.h5')`	Opens an h5py file object for reading
`f.close()`	Closes the h5py file and releases the linked file back to the operating system.
`f.keys()`	See a list of datasets and groups at the root node
`f.attrs`	Get the dict-like attributes at the root node
`f.attrs['a']']`	Get the ‘a’ attribute at the root node
`f['x'][:]`	Read in the ‘x’ dataset as a numpy array
`f['x'][5:20]`	Read in a slice of the ‘x’ dataset as a numpy array
`f['x'].keys()`	See a list of datasets and groups at the ‘x’ node
`f['folder']['x']`	Get tthe ‘x’ dataset in the ‘folder’ group
`f['folder/x']`	(Alternative Syntax) Ge tthe ‘x’ dataset in the ‘folder’ group

Exercises

Example: From the temperature file, read in only the last 5 temperature measurements as a numpy array.

f = h5py.File('temp.h5')
temp = f['temp'][-5:, :]
f.close()
temp

array([[19],
       [15],
       [18],
       [15],
       [21]], dtype=uint16)

Exercise: From the ephys.h5 file, read in the first 10 voltage measurements as a numpy array.

Solution

f = h5py.File('ephys.h5')
volts = f['voltage'][:10, :]
f.close()
volts

array([[-0.00671306,  0.05161912,  1.4439257 , ...,  1.1785603 ,
         0.4601649 ,  2.5711071 ],
       [ 0.5300264 , -0.23166905, -1.9291192 , ...,  2.4456306 ,
         0.27968076,  0.69164586],
       [ 1.5944071 ,  0.74694943,  2.0242946 , ...,  0.02141233,
         0.7670784 ,  1.341191  ],
       [ 0.4813082 ,  0.98860365,  0.47247237, ...,  1.3577696 ,
         1.3034651 ,  1.9397987 ]], shape=(4, 1000), dtype=float32)

Exercise: From the motion_tracking.h5 file, get the all the XYZ positions of the first rigid body during the recording.

Solution

f = h5py.File('motion_tracking.h5')
pos = f['motion_tracking']['rb_pos'][0, :, :]
f.close()
pos

array([[3.3986607e-01, 2.6604038e-01, 8.7452948e-01, ..., 3.7884522e-02,
        5.9882879e-01, 1.7544292e-01],
       [8.1301540e-01, 3.5222918e-01, 6.6543019e-01, ..., 7.2900415e-04,
        5.3880429e-01, 2.0586400e-01],
       [7.3296028e-01, 1.1116295e-01, 9.4686520e-01, ..., 3.7584129e-01,
        5.3968757e-01, 4.0583685e-01]], shape=(3, 3000), dtype=float32)

Exercise: from the ephys.h5 file, load the name of the second recording channel. (Note: Because HDF5 stores byte arrays for strings, to get it as a single string, it can help to use the .asstr() method)

Solution

f = h5py.File('ephys.h5')
f['chan_names'].asstr()[1]

'CH02'

Example: From the ephys.h5 file, load the description of the voltage dataset

f = h5py.File('ephys.h5')
f['voltage'].attrs['description']
f.close()

Exercise: From the motion tracking file, get the camera’s shutter speed during the session

Solution

f = h5py.File('motion_tracking.h5')
ss = f['camera'].attrs['shutter_speed']
f.close()
ss

np.int64(800)

Section 3: Compression in HDF5 — Trading CPU for Disk

When we store large datasets in HDF5, we can choose whether to compress them. Compression reduces file size, but it introduces trade-offs:

Smaller files
Potentially faster disk reads (less data transferred)
Increased CPU usage during compression and decompression

In this section, we will:

Create datasets with and without compression
Compare file sizes
Measure write time and read time
Observe CPU versus wall time trade-offs

Reference

Code	Description
`h5py.File()`	Open or create an HDF5 file
`create_dataset()`	Create a dataset inside the file
`compression="gzip"`	Enable gzip compression
`compression_opts=`	Set gzip compression level
`dataset[:]`	Read entire dataset
`utils.bytes_written()`	Measure disk writes
`array.nbytes`	Measure array size in RAM

Exercises

Exercise: “Writing an Uncompressed Dataset”. Please run the code below, and look at the printed performance. Does disk usage closely match RAM size?

data = np.arange(30_000_000, dtype=np.int64)
print("Size in RAM:", utils.format_bytes(data.nbytes))

dr0 = utils.bytes_written()
t0_wall = time.perf_counter()
t0_cpu = time.process_time()

with h5py.File("data.h5", "w") as f:
    f.create_dataset("data", data=data)

t1_wall = time.perf_counter()
t1_cpu = time.process_time()
dr1 = utils.bytes_written()

print("Disk written:", utils.format_bytes(dr1 - dr0))
print("Wall time:", t1_wall - t0_wall)
print("CPU time:", t1_cpu - t0_cpu)

Size in RAM: 240.00 MB
Disk written: 240.00 MB
Wall time: 0.18284069999936037
CPU time: 0.125

Exercise: “Writing Structured Data with Gzip Compression”. Please run the code below, and look at the printed performance. Does disk usage closely match RAM size? What changed from the uncompressed case?

data = np.arange(30_000_000, dtype=np.int64)
print("Size in RAM:", utils.format_bytes(data.nbytes))


dr0 = utils.bytes_written()
t0_wall = time.perf_counter()
t0_cpu = time.process_time()

with h5py.File("data.h5", "w") as f:
    f.create_dataset(
        "data",
        data=data,
        compression="gzip"
    )

t1_wall = time.perf_counter()
t1_cpu = time.process_time()
dr1 = utils.bytes_written()

print("Disk written:", utils.format_bytes(dr1 - dr0))
print("Wall time:", t1_wall - t0_wall)
print("CPU time:", t1_cpu - t0_cpu)

Size in RAM: 240.00 MB
Disk written: 45.53 MB
Wall time: 1.7229763999930583
CPU time: 1.640625

Exercise: “Writing Random Noise with Gzip Compression”. Please run the code below, and look at the printed performance. Does disk usage closely match RAM size? What changed from the structured case? Is it always worth it to compress data?

data = np.random.rand(30_000, 1_000)
print("Size in RAM:", utils.format_bytes(data.nbytes))



dr0 = utils.bytes_written()
t0_wall = time.perf_counter()
t0_cpu = time.process_time()

with h5py.File("data.h5", "w") as f:
    f.create_dataset(
        "data",
        data=data,
        compression="gzip"
    )

t1_wall = time.perf_counter()
t1_cpu = time.process_time()
dr1 = utils.bytes_written()

print("Disk written:", utils.format_bytes(dr1 - dr0))
print("Wall time:", t1_wall - t0_wall)
print("CPU time:", t1_cpu - t0_cpu)

Size in RAM: 240.00 MB
Disk written: 226.58 MB
Wall time: 6.494938899995759
CPU time: 6.203125