Structured Scientific Data with HDF5: Design, Access, and Compression
Author
# %pip install h5py numpyimport h5py
import numpy as np
import timeStructured Scientific Data with HDF5: Design, Access, and Compression
HDF5 is one of the most widely used scientific data formats. Many neuroscience tools, analysis pipelines, and standardised formats rely on it internally. In this notebook, we explore how HDF5 works at a practical level using the h5py library, which provides a NumPy-friendly interface to HDF5 files.
Our goals are to:
Understand how to store structured data and metadata
Read data efficiently and selectively
Explore how compression affects disk usage and performance
Throughout, we focus on design decisions: how we organise data, how we represent it, and how those choices influence performance.
Setup
Utility Functions
import os
import psutil
def _format_bytes(bytes: float, precision: int = 2) -> str:
"""
Takes a time in seconds and returns a string (e.g. ) that is more human-readable.
Looking to do this in a real project? Some alternatives:
- `humanfriendly`: https://pypi.org/project/humanfriendly/#getting-started
"""
if bytes < 0:
raise ValueError("bytes must be non-negative")
units = [("KB", 1000), ("MB", 1_000_000), ("GB", 1_000_000_000), ("TB", 1_000_000_000_000)]
for unit, scale in reversed(units):
if bytes >= scale:
value = bytes / scale
return f"{value:.{precision}f} {unit}"
else:
return f"{bytes} B"
def _disk_read() -> float:
return psutil.Process(os.getpid()).io_counters().read_bytes
def _disk_write() -> float:
return psutil.Process(os.getpid()).io_counters().write_bytes
class utils:
format_bytes = _format_bytes
bytes_read = _disk_read
bytes_written = _disk_writeSection 1: Writing Mixed-Type, Labelled, Data and Metadata to HDF5 Files using h5py
HDF5 allows us to move beyond simple array storage. Instead of writing one array per file, we can:
- Store multiple datasets in a single file
- Organise data hierarchically using groups
- Attach metadata directly to datasets
- Combine numeric arrays with strings and other data types
This makes HDF5 suitable for structured scientific data, where measurements, metadata, and experimental parameters belong together.
In this section, we focus on:
- Creating datasets
- Organising them into groups
- Adding metadata using attributes
- Designing clear, interpretable file structures
The emphasis is not on complexity, but on clarity. We aim to create files that are both structured and understandable.
Reference
| Code | Description |
|---|---|
f = h5py.File('filename.h5', 'w') |
Open an h5py file object for writing |
f.close() |
Closes the h5py file and releases the linked file back to the operating system. |
f.create_dataset('temp', data=x) |
Write an array called ’temp’ with the data in the numy array x into the HDF5 file |
f.create_dataset('data/temp', data=x) |
Write an array called ’temp’ in the folder called “data” with the data in the numy array x into the HDF5 file |
f.attrs['name'] = 'Session 1' |
Set an attribute as metadata onto the root group of the HDF5 file – this works like a normal Python dictionary |
f['x'].attrs['id'] = 'ABC' |
Set an attribute as metadata onto the ‘x’ node of the HDF file |
Exercises
Example: Write an HDF5 File called temp.h5 with the following schema:
root/
└── temp: uint16, 1000 x 1 (temperature measurements over time)temp = np.random.randint(15, 22, size=(1000,1)).astype(np.uint16)with h5py.File('temp.h5', 'w') as f:
f.create_dataset('temp', data=temp)
f['temp'].attrs.update({
'description': 'Temperature measurements over time'
})Exercise: Open up this temp.h5 file with the myHDF5 data browser and explore the data in the GUI. Does this look how you expected?
Solution
Exercise: Write an HDF5 File called ephys.h5 with the following schema and descriptions:
root/
├── time: float32, 1 x 1000 (trial time, in seconds)
├── voltage: int16, 4 x 1000 (voltage measurements for each recording channel)
└──chan_names: S, 4 (channel names for each recording channel)t = np.linspace(0, 3, 1000).astype(np.float32)
voltage = np.random.normal(1, 1, size=(4, 1000)).astype(np.float32)
chan_names = ['CH01', 'CH02', 'CH03', 'CH05']Solution
with h5py.File('ephys.h5', 'w') as f:
f.create_dataset('time', data=t)
f['time'].attrs.update({
'description': 'Trial time, in seconds.'
})
f.create_dataset('voltage', data=voltage)
f['voltage'].attrs.update({
'description': 'voltage measurements for each recording channel.'
})
f.create_dataset('chan_names', data=chan_names)
f['chan_names'].attrs.update({
'description': 'channel names for each recording channel.'
})Exercise: Open up your ephys.h5 file with the myHDF5 data browser and explore your data in the GUI. Does this look how you expected?
Solution
Exercise: Write an HDF5 File called motion_tracking.h5 with the following schema
root/
├── session_date: str
├── subject_id: str
├── camera:
│ ├── black_noise_image: uint16, 640 x 640 x 3 (reference image taken with lights out)
│ ├── image_width: uint16
│ ├── image_height: uint16
│ ├── shutter_speed: uint16
│ └── aperture: float32
│
└── motion_tracking:
├── time: uint32 1 x 3000 (session time, in milliseconds)
├── rb_pos: float32 2 x 3 x 3000 (XYZ coordinates of the center of each tracked rigid body)
├── rb_rot: float32 2 x 3 x 3000 (XYZ Euler rotations of each tracked rigid body)
├── xyz_names: str 1 x 3 (The spatial coordinate names)
└── rb_names: str 1 x 2 (The name of each rigid body)session_date = '2024-04-22'
subject_id = 'AD11'
camera_black_noise_im = np.random.randint(0, 30, size=(640, 640)).astype(np.uint16)
im_width = 640
im_height = 640
shutter_speed = 800
aperture = 2.8
motion_time = (np.arange(0, 1000, step=1/shutter_speed)[:3000] * 1000).astype(np.uint32)
rb_pos = np.random.random(size=(2, 3, 3000)).astype(np.float32)
rb_rot = np.random.random(size=(2, 3, 3000)).astype(np.float32)
xyz_names = ['X', 'Y', 'Z']
rb_names = ['head', 'tail_base']Solution
with h5py.File('motion_tracking.h5', 'w') as f:
f.attrs.update({
'session_date': session_date,
'subject_id': subject_id,
})
f.create_dataset('camera/black_noise_image', data=camera_black_noise_im)
f['camera'].attrs.update({
'image_width': im_width,
'image_height': im_height,
'shutter_speed': shutter_speed,
'aperture': aperture,
})
f.create_dataset('motion_tracking/time', data=motion_time)
f.create_dataset('motion_tracking/rb_pos', data=rb_pos)
f.create_dataset('motion_tracking/rb_rot', data=rb_rot)
f.create_dataset('motion_trakcking/rb_names', data=rb_names)
f.create_dataset('xyz_names', data=xyz_names)
Exercise: Open up your ephys.h5 file with the myHDF5 data browser and explore your data in the GUI. Does this look how you expected?
Solution
Section 2: Reading Data from HDF5 Files
One of the strengths of HDF5 is that we do not need to load an entire dataset into memory; without loading all the data, we can:
- Inspect file structure
- Read selected datasets
- Access metadata
- Load only slices of large arrays
In this section, we practise navigating HDF5 files and reading data efficiently. We treat HDF5 files as structured containers and learn how to explore them programmatically.
Our goal is to become comfortable:
- Inspecting file contents
- Accessing nested groups
- Reading partial data
- Retrieving metadata stored as attributes
Reference
| Code | Description |
|---|---|
f = h5py.File('file.h5') |
Opens an h5py file object for reading |
f.close() |
Closes the h5py file and releases the linked file back to the operating system. |
f.keys() |
See a list of datasets and groups at the root node |
f.attrs |
Get the dict-like attributes at the root node |
f.attrs['a']'] |
Get the ‘a’ attribute at the root node |
f['x'][:] |
Read in the ‘x’ dataset as a numpy array |
f['x'][5:20] |
Read in a slice of the ‘x’ dataset as a numpy array |
f['x'].keys() |
See a list of datasets and groups at the ‘x’ node |
f['folder']['x'] |
Get tthe ‘x’ dataset in the ‘folder’ group |
f['folder/x'] |
(Alternative Syntax) Ge tthe ‘x’ dataset in the ‘folder’ group |
Exercises
Example: From the temperature file, read in only the last 5 temperature measurements as a numpy array.
f = h5py.File('temp.h5')
temp = f['temp'][-5:, :]
f.close()
temparray([[19],
[15],
[18],
[15],
[21]], dtype=uint16)Exercise: From the ephys.h5 file, read in the first 10 voltage measurements as a numpy array.
Solution
f = h5py.File('ephys.h5')
volts = f['voltage'][:10, :]
f.close()
voltsarray([[-0.00671306, 0.05161912, 1.4439257 , ..., 1.1785603 ,
0.4601649 , 2.5711071 ],
[ 0.5300264 , -0.23166905, -1.9291192 , ..., 2.4456306 ,
0.27968076, 0.69164586],
[ 1.5944071 , 0.74694943, 2.0242946 , ..., 0.02141233,
0.7670784 , 1.341191 ],
[ 0.4813082 , 0.98860365, 0.47247237, ..., 1.3577696 ,
1.3034651 , 1.9397987 ]], shape=(4, 1000), dtype=float32)Exercise: From the motion_tracking.h5 file, get the all the XYZ positions of the first rigid body during the recording.
Solution
f = h5py.File('motion_tracking.h5')
pos = f['motion_tracking']['rb_pos'][0, :, :]
f.close()
posarray([[3.3986607e-01, 2.6604038e-01, 8.7452948e-01, ..., 3.7884522e-02,
5.9882879e-01, 1.7544292e-01],
[8.1301540e-01, 3.5222918e-01, 6.6543019e-01, ..., 7.2900415e-04,
5.3880429e-01, 2.0586400e-01],
[7.3296028e-01, 1.1116295e-01, 9.4686520e-01, ..., 3.7584129e-01,
5.3968757e-01, 4.0583685e-01]], shape=(3, 3000), dtype=float32)Exercise: from the ephys.h5 file, load the name of the second recording channel. (Note: Because HDF5 stores byte arrays for strings, to get it as a single string, it can help to use the .asstr() method)
Solution
f = h5py.File('ephys.h5')
f['chan_names'].asstr()[1]'CH02'Example: From the ephys.h5 file, load the description of the voltage dataset
f = h5py.File('ephys.h5')
f['voltage'].attrs['description']
f.close()Exercise: From the motion tracking file, get the camera’s shutter speed during the session
Solution
f = h5py.File('motion_tracking.h5')
ss = f['camera'].attrs['shutter_speed']
f.close()
ssnp.int64(800)Section 3: Compression in HDF5 — Trading CPU for Disk
When we store large datasets in HDF5, we can choose whether to compress them. Compression reduces file size, but it introduces trade-offs:
- Smaller files
- Potentially faster disk reads (less data transferred)
- Increased CPU usage during compression and decompression
In this section, we will:
- Create datasets with and without compression
- Compare file sizes
- Measure write time and read time
- Observe CPU versus wall time trade-offs
Reference
| Code | Description |
|---|---|
h5py.File() |
Open or create an HDF5 file |
create_dataset() |
Create a dataset inside the file |
compression="gzip" |
Enable gzip compression |
compression_opts= |
Set gzip compression level |
dataset[:] |
Read entire dataset |
utils.bytes_written() |
Measure disk writes |
array.nbytes |
Measure array size in RAM |
Exercises
Exercise: “Writing an Uncompressed Dataset”. Please run the code below, and look at the printed performance. Does disk usage closely match RAM size?
data = np.arange(30_000_000, dtype=np.int64)
print("Size in RAM:", utils.format_bytes(data.nbytes))
dr0 = utils.bytes_written()
t0_wall = time.perf_counter()
t0_cpu = time.process_time()
with h5py.File("data.h5", "w") as f:
f.create_dataset("data", data=data)
t1_wall = time.perf_counter()
t1_cpu = time.process_time()
dr1 = utils.bytes_written()
print("Disk written:", utils.format_bytes(dr1 - dr0))
print("Wall time:", t1_wall - t0_wall)
print("CPU time:", t1_cpu - t0_cpu)Size in RAM: 240.00 MB
Disk written: 240.00 MB
Wall time: 0.18284069999936037
CPU time: 0.125Exercise: “Writing Structured Data with Gzip Compression”. Please run the code below, and look at the printed performance. Does disk usage closely match RAM size? What changed from the uncompressed case?
data = np.arange(30_000_000, dtype=np.int64)
print("Size in RAM:", utils.format_bytes(data.nbytes))
dr0 = utils.bytes_written()
t0_wall = time.perf_counter()
t0_cpu = time.process_time()
with h5py.File("data.h5", "w") as f:
f.create_dataset(
"data",
data=data,
compression="gzip"
)
t1_wall = time.perf_counter()
t1_cpu = time.process_time()
dr1 = utils.bytes_written()
print("Disk written:", utils.format_bytes(dr1 - dr0))
print("Wall time:", t1_wall - t0_wall)
print("CPU time:", t1_cpu - t0_cpu)Size in RAM: 240.00 MB
Disk written: 45.53 MB
Wall time: 1.7229763999930583
CPU time: 1.640625Exercise: “Writing Random Noise with Gzip Compression”. Please run the code below, and look at the printed performance. Does disk usage closely match RAM size? What changed from the structured case? Is it always worth it to compress data?
data = np.random.rand(30_000, 1_000)
print("Size in RAM:", utils.format_bytes(data.nbytes))
dr0 = utils.bytes_written()
t0_wall = time.perf_counter()
t0_cpu = time.process_time()
with h5py.File("data.h5", "w") as f:
f.create_dataset(
"data",
data=data,
compression="gzip"
)
t1_wall = time.perf_counter()
t1_cpu = time.process_time()
dr1 = utils.bytes_written()
print("Disk written:", utils.format_bytes(dr1 - dr0))
print("Wall time:", t1_wall - t0_wall)
print("CPU time:", t1_cpu - t0_cpu)Size in RAM: 240.00 MB
Disk written: 226.58 MB
Wall time: 6.494938899995759
CPU time: 6.203125