Structured Scientific Data with HDF5: Design, Access, and Compression

Structured Scientific Data with HDF5: Design, Access, and Compression

Author
Dr. Nicholas Del Grosso

# %pip install h5py numpy
import h5py
import numpy as np
import time

Structured Scientific Data with HDF5: Design, Access, and Compression

HDF5 is one of the most widely used scientific data formats. Many neuroscience tools, analysis pipelines, and standardised formats rely on it internally. In this notebook, we explore how HDF5 works at a practical level using the h5py library, which provides a NumPy-friendly interface to HDF5 files.

Our goals are to:

Understand how to store structured data and metadata

Read data efficiently and selectively

Explore how compression affects disk usage and performance

Throughout, we focus on design decisions: how we organise data, how we represent it, and how those choices influence performance.

Setup

Utility Functions

import os
import psutil

def _format_bytes(bytes: float, precision: int = 2) -> str:
    """
    Takes a time in seconds and returns a string (e.g. ) that is more human-readable.

    Looking to do this in a real project?  Some alternatives:
      - `humanfriendly`: https://pypi.org/project/humanfriendly/#getting-started
    """

    if bytes < 0:
        raise ValueError("bytes must be non-negative")

    units = [("KB", 1000), ("MB", 1_000_000), ("GB", 1_000_000_000), ("TB", 1_000_000_000_000)]

    for unit, scale in reversed(units):
        if bytes >= scale:
            value = bytes / scale
            return f"{value:.{precision}f} {unit}"
    else:
        return f"{bytes} B"
    

def _disk_read() -> float:
    return psutil.Process(os.getpid()).io_counters().read_bytes

def _disk_write() -> float:
    return psutil.Process(os.getpid()).io_counters().write_bytes


class utils:
    format_bytes = _format_bytes
    bytes_read = _disk_read
    bytes_written = _disk_write

Section 1: Writing Mixed-Type, Labelled, Data and Metadata to HDF5 Files using h5py

HDF5 allows us to move beyond simple array storage. Instead of writing one array per file, we can:

  • Store multiple datasets in a single file
  • Organise data hierarchically using groups
  • Attach metadata directly to datasets
  • Combine numeric arrays with strings and other data types

This makes HDF5 suitable for structured scientific data, where measurements, metadata, and experimental parameters belong together.

In this section, we focus on:

  • Creating datasets
  • Organising them into groups
  • Adding metadata using attributes
  • Designing clear, interpretable file structures

The emphasis is not on complexity, but on clarity. We aim to create files that are both structured and understandable.

Reference

Code Description
f = h5py.File('filename.h5', 'w') Open an h5py file object for writing
f.close() Closes the h5py file and releases the linked file back to the operating system.
f.create_dataset('temp', data=x) Write an array called ’temp’ with the data in the numy array x into the HDF5 file
f.create_dataset('data/temp', data=x) Write an array called ’temp’ in the folder called “data” with the data in the numy array x into the HDF5 file
f.attrs['name'] = 'Session 1' Set an attribute as metadata onto the root group of the HDF5 file – this works like a normal Python dictionary
f['x'].attrs['id'] = 'ABC' Set an attribute as metadata onto the ‘x’ node of the HDF file

Exercises

Example: Write an HDF5 File called temp.h5 with the following schema:

root/
└── temp: uint16, 1000 x 1  (temperature measurements over time)
temp = np.random.randint(15, 22, size=(1000,1)).astype(np.uint16)
with h5py.File('temp.h5', 'w') as f:
    f.create_dataset('temp', data=temp)
    f['temp'].attrs.update({
        'description': 'Temperature measurements over time'
    })

Exercise: Open up this temp.h5 file with the myHDF5 data browser and explore the data in the GUI. Does this look how you expected?

Solution

Exercise: Write an HDF5 File called ephys.h5 with the following schema and descriptions:

root/
├── time: float32, 1 x 1000 (trial time, in seconds)
├── voltage: int16, 4 x 1000 (voltage measurements for each recording channel)
└──chan_names: S, 4 (channel names for each recording channel)
t = np.linspace(0, 3, 1000).astype(np.float32)
voltage = np.random.normal(1, 1, size=(4, 1000)).astype(np.float32)
chan_names = ['CH01', 'CH02', 'CH03', 'CH05']
Solution
with h5py.File('ephys.h5', 'w') as f:
    f.create_dataset('time', data=t)
    f['time'].attrs.update({
        'description': 'Trial time, in seconds.'
    })
    f.create_dataset('voltage', data=voltage)
    f['voltage'].attrs.update({
        'description': 'voltage measurements for each recording channel.'
    })
    f.create_dataset('chan_names', data=chan_names)
    f['chan_names'].attrs.update({
        'description': 'channel names for each recording channel.'
    })

Exercise: Open up your ephys.h5 file with the myHDF5 data browser and explore your data in the GUI. Does this look how you expected?

Solution

Exercise: Write an HDF5 File called motion_tracking.h5 with the following schema

root/
├── session_date: str
├── subject_id: str
├── camera: 
│   ├── black_noise_image: uint16, 640 x 640 x 3 (reference image taken with lights out)
│   ├── image_width: uint16
│   ├── image_height: uint16
│   ├── shutter_speed: uint16
│   └── aperture: float32
│
└── motion_tracking: 
    ├── time: uint32 1 x 3000 (session time, in milliseconds)
    ├── rb_pos: float32  2 x 3 x 3000 (XYZ coordinates of the center of each tracked rigid body)
    ├── rb_rot: float32  2 x 3 x 3000 (XYZ Euler rotations of each tracked rigid body)
    ├── xyz_names: str 1 x 3 (The spatial coordinate names)
    └── rb_names: str 1 x 2 (The name of each rigid body)
session_date = '2024-04-22'
subject_id = 'AD11'
camera_black_noise_im = np.random.randint(0, 30, size=(640, 640)).astype(np.uint16)
im_width = 640
im_height = 640
shutter_speed = 800
aperture = 2.8
motion_time = (np.arange(0, 1000, step=1/shutter_speed)[:3000] * 1000).astype(np.uint32)
rb_pos = np.random.random(size=(2, 3, 3000)).astype(np.float32)
rb_rot = np.random.random(size=(2, 3, 3000)).astype(np.float32)
xyz_names = ['X', 'Y', 'Z']
rb_names = ['head', 'tail_base']
Solution
with h5py.File('motion_tracking.h5', 'w') as f:
    f.attrs.update({
        'session_date': session_date,
        'subject_id': subject_id,
    })
    f.create_dataset('camera/black_noise_image', data=camera_black_noise_im)
    f['camera'].attrs.update({
        'image_width': im_width,
        'image_height': im_height,
        'shutter_speed': shutter_speed,
        'aperture': aperture,
    })

    f.create_dataset('motion_tracking/time', data=motion_time)
    f.create_dataset('motion_tracking/rb_pos', data=rb_pos)
    f.create_dataset('motion_tracking/rb_rot', data=rb_rot)
    f.create_dataset('motion_trakcking/rb_names', data=rb_names)
    f.create_dataset('xyz_names', data=xyz_names)
    

Exercise: Open up your ephys.h5 file with the myHDF5 data browser and explore your data in the GUI. Does this look how you expected?

Solution

Section 2: Reading Data from HDF5 Files

One of the strengths of HDF5 is that we do not need to load an entire dataset into memory; without loading all the data, we can:

  • Inspect file structure
  • Read selected datasets
  • Access metadata
  • Load only slices of large arrays

In this section, we practise navigating HDF5 files and reading data efficiently. We treat HDF5 files as structured containers and learn how to explore them programmatically.

Our goal is to become comfortable:

  • Inspecting file contents
  • Accessing nested groups
  • Reading partial data
  • Retrieving metadata stored as attributes

Reference

Code Description
f = h5py.File('file.h5') Opens an h5py file object for reading
f.close() Closes the h5py file and releases the linked file back to the operating system.
f.keys() See a list of datasets and groups at the root node
f.attrs Get the dict-like attributes at the root node
f.attrs['a']'] Get the ‘a’ attribute at the root node
f['x'][:] Read in the ‘x’ dataset as a numpy array
f['x'][5:20] Read in a slice of the ‘x’ dataset as a numpy array
f['x'].keys() See a list of datasets and groups at the ‘x’ node
f['folder']['x'] Get tthe ‘x’ dataset in the ‘folder’ group
f['folder/x'] (Alternative Syntax) Ge tthe ‘x’ dataset in the ‘folder’ group

Exercises

Example: From the temperature file, read in only the last 5 temperature measurements as a numpy array.

f = h5py.File('temp.h5')
temp = f['temp'][-5:, :]
f.close()
temp
array([[19],
       [15],
       [18],
       [15],
       [21]], dtype=uint16)

Exercise: From the ephys.h5 file, read in the first 10 voltage measurements as a numpy array.

Solution
f = h5py.File('ephys.h5')
volts = f['voltage'][:10, :]
f.close()
volts
array([[-0.00671306,  0.05161912,  1.4439257 , ...,  1.1785603 ,
         0.4601649 ,  2.5711071 ],
       [ 0.5300264 , -0.23166905, -1.9291192 , ...,  2.4456306 ,
         0.27968076,  0.69164586],
       [ 1.5944071 ,  0.74694943,  2.0242946 , ...,  0.02141233,
         0.7670784 ,  1.341191  ],
       [ 0.4813082 ,  0.98860365,  0.47247237, ...,  1.3577696 ,
         1.3034651 ,  1.9397987 ]], shape=(4, 1000), dtype=float32)

Exercise: From the motion_tracking.h5 file, get the all the XYZ positions of the first rigid body during the recording.

Solution
f = h5py.File('motion_tracking.h5')
pos = f['motion_tracking']['rb_pos'][0, :, :]
f.close()
pos
array([[3.3986607e-01, 2.6604038e-01, 8.7452948e-01, ..., 3.7884522e-02,
        5.9882879e-01, 1.7544292e-01],
       [8.1301540e-01, 3.5222918e-01, 6.6543019e-01, ..., 7.2900415e-04,
        5.3880429e-01, 2.0586400e-01],
       [7.3296028e-01, 1.1116295e-01, 9.4686520e-01, ..., 3.7584129e-01,
        5.3968757e-01, 4.0583685e-01]], shape=(3, 3000), dtype=float32)

Exercise: from the ephys.h5 file, load the name of the second recording channel. (Note: Because HDF5 stores byte arrays for strings, to get it as a single string, it can help to use the .asstr() method)

Solution
f = h5py.File('ephys.h5')
f['chan_names'].asstr()[1]
'CH02'

Example: From the ephys.h5 file, load the description of the voltage dataset

f = h5py.File('ephys.h5')
f['voltage'].attrs['description']
f.close()

Exercise: From the motion tracking file, get the camera’s shutter speed during the session

Solution
f = h5py.File('motion_tracking.h5')
ss = f['camera'].attrs['shutter_speed']
f.close()
ss
np.int64(800)

Section 3: Compression in HDF5 — Trading CPU for Disk

When we store large datasets in HDF5, we can choose whether to compress them. Compression reduces file size, but it introduces trade-offs:

  • Smaller files
  • Potentially faster disk reads (less data transferred)
  • Increased CPU usage during compression and decompression

In this section, we will:

  • Create datasets with and without compression
  • Compare file sizes
  • Measure write time and read time
  • Observe CPU versus wall time trade-offs

Reference

Code Description
h5py.File() Open or create an HDF5 file
create_dataset() Create a dataset inside the file
compression="gzip" Enable gzip compression
compression_opts= Set gzip compression level
dataset[:] Read entire dataset
utils.bytes_written() Measure disk writes
array.nbytes Measure array size in RAM

Exercises

Exercise: “Writing an Uncompressed Dataset”. Please run the code below, and look at the printed performance. Does disk usage closely match RAM size?

data = np.arange(30_000_000, dtype=np.int64)
print("Size in RAM:", utils.format_bytes(data.nbytes))

dr0 = utils.bytes_written()
t0_wall = time.perf_counter()
t0_cpu = time.process_time()

with h5py.File("data.h5", "w") as f:
    f.create_dataset("data", data=data)

t1_wall = time.perf_counter()
t1_cpu = time.process_time()
dr1 = utils.bytes_written()

print("Disk written:", utils.format_bytes(dr1 - dr0))
print("Wall time:", t1_wall - t0_wall)
print("CPU time:", t1_cpu - t0_cpu)
Size in RAM: 240.00 MB
Disk written: 240.00 MB
Wall time: 0.18284069999936037
CPU time: 0.125

Exercise: “Writing Structured Data with Gzip Compression”. Please run the code below, and look at the printed performance. Does disk usage closely match RAM size? What changed from the uncompressed case?

data = np.arange(30_000_000, dtype=np.int64)
print("Size in RAM:", utils.format_bytes(data.nbytes))


dr0 = utils.bytes_written()
t0_wall = time.perf_counter()
t0_cpu = time.process_time()

with h5py.File("data.h5", "w") as f:
    f.create_dataset(
        "data",
        data=data,
        compression="gzip"
    )

t1_wall = time.perf_counter()
t1_cpu = time.process_time()
dr1 = utils.bytes_written()

print("Disk written:", utils.format_bytes(dr1 - dr0))
print("Wall time:", t1_wall - t0_wall)
print("CPU time:", t1_cpu - t0_cpu)
Size in RAM: 240.00 MB
Disk written: 45.53 MB
Wall time: 1.7229763999930583
CPU time: 1.640625

Exercise: “Writing Random Noise with Gzip Compression”. Please run the code below, and look at the printed performance. Does disk usage closely match RAM size? What changed from the structured case? Is it always worth it to compress data?

data = np.random.rand(30_000, 1_000)
print("Size in RAM:", utils.format_bytes(data.nbytes))



dr0 = utils.bytes_written()
t0_wall = time.perf_counter()
t0_cpu = time.process_time()

with h5py.File("data.h5", "w") as f:
    f.create_dataset(
        "data",
        data=data,
        compression="gzip"
    )

t1_wall = time.perf_counter()
t1_cpu = time.process_time()
dr1 = utils.bytes_written()

print("Disk written:", utils.format_bytes(dr1 - dr0))
print("Wall time:", t1_wall - t0_wall)
print("CPU time:", t1_cpu - t0_cpu)
Size in RAM: 240.00 MB
Disk written: 226.58 MB
Wall time: 6.494938899995759
CPU time: 6.203125