Data Representation and Disk IO: Performance Beyond RAM
Author
import numpy as np
from numpy import testing as npt
from scipy import sparseData Representation and Disk IO: Performance Beyond RAM
In this session, we extend our understanding of performance beyond CPU and memory. Many scientific workflows become slow not because of computation, but because of how we store and retrieve data. Here we examine:
- How file formats affect size and speed
- How precision choices influence storage
- How disk activity differs from memory activity
We will measure what happens when we write arrays to disk, compare text and binary formats, and explore how data types determine both memory usage and file size. Throughout, we focus on making informed design decisions rather than relying on defaults.
Setup
Utility Functions
import os
import psutil
def _format_bytes(bytes: float, precision: int = 2) -> str:
"""
Takes a time in seconds and returns a string (e.g. ) that is more human-readable.
Looking to do this in a real project? Some alternatives:
- `humanfriendly`: https://pypi.org/project/humanfriendly/#getting-started
"""
if bytes < 0:
raise ValueError("bytes must be non-negative")
units = [("KB", 1000), ("MB", 1_000_000), ("GB", 1_000_000_000), ("TB", 1_000_000_000_000)]
for unit, scale in reversed(units):
if bytes >= scale:
value = bytes / scale
return f"{value:.{precision}f} {unit}"
else:
return f"{bytes} B"
def _disk_read() -> float:
return psutil.Process(os.getpid()).io_counters().read_bytes
def _disk_write() -> float:
return psutil.Process(os.getpid()).io_counters().write_bytes
class utils:
format_bytes = _format_bytes
bytes_read = _disk_read
bytes_written = _disk_writeSection 1: Section 1: Reading Should Be Simple: Text vs Binary File Formats
When we save arrays to disk, we choose a representation. That choice affects:
- File size
- Write speed
- Read speed
- CPU usage during IO (Reading and Writing)
In this section, we compare binary storage (.npy) with text-based storage (.txt). We measure disk writes directly and observe how closely file size reflects the array’s size in RAM. Our goal is to understand what is predictable, what is expensive, and why.
We will:
- Measure memory usage with .nbytes
- Measure disk writes using psutil
- Compare CPU time and wall time
- Examine how formatting affects text output
Reference
| Code | Description |
|---|---|
np.arange() |
Create arrays of controlled size and dtype for experiments |
array.nbytes |
Number of bytes used by the array in RAM |
np.save() |
Save array to binary .npy format |
np.savetxt() |
Save array to text file |
np.savetxt(fmt=...) |
Control formatting of values when writing text |
%time |
Measure CPU time and wall time in notebooks |
utils.bytes_written() |
Measure bytes written to disk by this process |
utils.format_bytes() |
Convert byte counts into human-readable units |
Exercises
Writing to Binary NPY files with np.save()
the np.save() function is very simple: it puts two things into an .npy file:
- Writes a text header explaining the size, shape, and dtype of the array stored.
- Writes the array data in binary (a.k.a. “bytes”) format.
That should make np.save() predictable and straightforward. Let’s try it out!
Example: Make an array of 200,000 float64 values and save the it to disk with np.save(), and print:
- How many bytes does the array take up in RAM?
- How many bytes did the disk write when writing the file?
Is the amount of data written to these binary files about the same amount as the array stored in RAM?
data = np.arange(200_000, dtype=np.float64)
print('Size in RAM:', utils.format_bytes(data.nbytes))
dr0 = utils.bytes_written()
np.save('data.npy', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))Size in RAM: 1.60 MB
Bytes Written: 1.61 MBExercise: Make an array of 1,000,000 float64 values and save the it to disk with np.save(), and print:
- How many bytes does the array take up in RAM?
- How many bytes did the disk write when writing the file?
Is the amount of data written to these binary files about the same amount as the array stored in RAM?
Solution
data = np.arange(1_000_000, dtype=np.float64)
print('Size in RAM:', utils.format_bytes(data.nbytes))
dr0 = utils.bytes_written()
np.save('data.npy', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))Size in RAM: 8.00 MB
Bytes Written: 8.00 MBExercise: Does data type affect the data sizes, and/or the relative sizes between RAM and Disk when saving binary data with np.save()? Let’s try it again, this time with 1,000,000 int64 values:
Solution
data = np.arange(1_000_000, dtype=np.int64)
print('Size in RAM:', utils.format_bytes(data.nbytes))
dr0 = utils.bytes_written()
np.save('data.npy', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))Size in RAM: 8.00 MB
Bytes Written: 8.00 MBExercise: How Much time does it take to write these files? Let’s do a basic measurement:
- Create a data array that takes up 500 MB when written to disk.
- Use
%timeto measure how much CPU and Wall time the writing took.
How long did the data take to write? Did the CPU have to do much work during that process?
Solution
data = np.arange(62_500_000, dtype=np.float64)
print('Size in RAM:', utils.format_bytes(data.nbytes))
dr0 = utils.bytes_written()
%time np.save('data.npy', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))Size in RAM: 500.00 MB
CPU times: user 0 ns, sys: 724 ms, total: 724 ms
Wall time: 725 ms
Bytes Written: 500.00 MBWriting to Text files with np.savetxt()
the np.savetxt() function is also very simple; for each value in an array, it writes out the value in text and puts a seperator character (often a new-line) between each value.
This is the same way that we record values on to paper, and makes reading these files in text editors to browse the data quite convenient.
That should make np.savetxt() also predictable and straightforward. Let’s try it out!
Exercise: Make an array of 1,000,000 float64 values and save the it to disk with np.savetxt(), and print:
- How many bytes does the array take up in RAM?
- How many bytes did the disk write when writing the file?
Is the amount of data written to text files about the same amount as the array stored in RAM?
Solution
data = np.arange(1_000_000, dtype=np.float64)
print('Size in RAM:', utils.format_bytes(data.nbytes))
dr0 = utils.bytes_written()
np.savetxt('data.txt', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))Size in RAM: 8.00 MB
Bytes Written: 25.00 MBExercise: Does data type affect the data sizes, and/or the relative sizes between RAM and Disk when saving binary data with np.savetxt(fmt='%d')? Let’s try it again, this time with 1,000,000 int64 values:
Note: the fmt= option changes how the text is formatted when the data is written. '%d' means “write as integers”. "%.18e" is the default; it means “write in scientific floating point notation with 18 significance points”
Solution
data = np.arange(1_000_000, dtype=np.int64)
print('Size in RAM:', utils.format_bytes(data.nbytes))
dr0 = utils.bytes_written()
np.savetxt('data.txt', data, fmt='%d')
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))Size in RAM: 8.00 MB
Bytes Written: 6.91 MBExercise: Does data type affect the data sizes, and/or the relative sizes between RAM and Disk when saving binary data with np.savetxt()? Let’s try it again, this time with 1,000,000 int64 values:
Solution
data = np.arange(1_000_000, dtype=np.int64)
print('Size in RAM:', utils.format_bytes(data.nbytes))
dr0 = utils.bytes_written()
np.savetxt('data.txt', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))Size in RAM: 8.00 MB
Bytes Written: 25.02 MBExercise: How Much time does it take to write these files? Let’s do a basic measurement:
- Create a data array that takes up 300 MB when written to disk.
- Use
%timeto measure how much CPU and Wall time the writing took.
How long did the data take to write? Did the CPU have to do much work during that process?
Solution
data = np.arange(12_500_000, dtype=np.float64)
print('Size in RAM:', utils.format_bytes(data.nbytes))
dr0 = utils.bytes_written()
%time np.savetxt('data.txt', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))Size in RAM: 100.00 MB
CPU times: user 20.8 s, sys: 1.91 s, total: 22.7 s
Wall time: 22.7 s
Bytes Written: 312.50 MBSection 2: Section 2: Precision as a Design Decision
Data type is not an implementation detail — it is a design choice.
The number of bits we allocate determines:
- Range of representable values
- Precision of numerical values
- Memory usage
- Disk size
- Compression behaviour
In many scientific workflows, default data types (such as float64) are used automatically. However, this may be unnecessary and costly. In this section, we explore integer and floating-point limits and learn how to safely reduce storage size without compromising data integrity.
We will:
- Inspect value ranges of integer types
- Examine floating-point precision limits
- Use astype() to reduce storage size
- Verify correctness using NumPy testing utilities
Exercises
Values, Precision, and Memory Size of Data Types
Exercise: How much space do these data types take up, in bytes? Use np.dtype().itemsize to print the byte size of:
- 16-bit floats (
np.float16) - 32-bit floats (
np.float32) - 64-bit floats (
np.float64)
How many bytes do they take up? Are they different?
Code snippet: np.dtype(np.float16).itemsize
Solution
np.dtype(np.float16).itemsize, np.dtype(np.float32).itemsize, np.dtype(np.float64).itemsize(2, 4, 8)Exercise: How much space do these data types take up, in bytes? Use np.dtype().itemsize to print the byte size of:
- 16-bit floats (
np.float16) - 16-bit ints (
np.int16) - 16-bit unsigned ints (
np.uint16)
How many bytes do they take up? Are they different?
Solution
np.dtype(np.float16).itemsize, np.dtype(np.int16).itemsize, np.dtype(np.uint16).itemsize(2, 2, 2)Exercise: Boolean values are sometimes surprising in how they are stored in memory by Numpy; because they only store True and False values, by rights they should only take up 1 bit (1/8 of a byte). Let’s check if that’s true. How much space do these data types take up, in bytes? Use np.dtype().itemsize to print the byte size of:
- bools (
np.bool_) - 8-bit ints (
np.int8)
How many bytes do they take up? Are they different from each other?
Solution
np.dtype(np.bool_).itemsize, np.dtype(np.int8).itemsize(1, 1)Exercise: What values can unsigned integers hold? Is there a big difference between, say, an 8-bit integer and a 32-bit integer? Print the output of np.iinfo() and Compare np.uint8, np.uint16, np.uint32, and np.uint64 and examine the minimum and maximum values for those data types.
Code Snippet: print(np.iinfo(np.uint8))
Solution
print(np.iinfo(np.uint8))
print(np.iinfo(np.uint16))
print(np.iinfo(np.uint32))
print(np.iinfo(np.uint64))Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------
Machine parameters for uint16
---------------------------------------------------------------
min = 0
max = 65535
---------------------------------------------------------------
Machine parameters for uint32
---------------------------------------------------------------
min = 0
max = 4294967295
---------------------------------------------------------------
Machine parameters for uint64
---------------------------------------------------------------
min = 0
max = 18446744073709551615
---------------------------------------------------------------Exercise: What values can signed integers hold? Print the output of np.iinfo() and Compare np.int8, np.int16, np.int32, and np.int64.
What makes the values different than their corresponeding unsigned values?
Solution
print(np.iinfo(np.int8))
print(np.iinfo(np.int16))
print(np.iinfo(np.int32))
print(np.iinfo(np.int64))Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------
Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------
Machine parameters for int32
---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------
Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------Exercise: What values can floats hold, and what is an attribute that makes them different from ints? Print the output of np.finfo() and Compare np.float16, np.float32, and np.float64:
Solution
print(np.finfo(np.float16))
print(np.finfo(np.float32))
print(np.finfo(np.float64))Machine parameters for float16
---------------------------------------------------------------
precision = 3 resolution = 0.001
machep = -10 eps = 0.000977
negep = -11 epsneg = 0.0004883
minexp = -14 tiny = 6.104e-05
maxexp = 16 max = 6.55e+04
nexp = 5 min = -max
smallest_normal = 6.104e-05 smallest_subnormal = 6e-08
---------------------------------------------------------------
Machine parameters for float32
---------------------------------------------------------------
precision = 6 resolution = 1e-06
machep = -23 eps = 1.1920929e-07
negep = -24 epsneg = 5.9604645e-08
minexp = -126 tiny = 1.1754944e-38
maxexp = 128 max = 3.4028235e+38
nexp = 8 min = -max
smallest_normal = 1.1754944e-38 smallest_subnormal = 1e-45
---------------------------------------------------------------
Machine parameters for float64
---------------------------------------------------------------
precision = 15 resolution = 1e-15
machep = -52 eps = 2.220446049250313e-16
negep = -53 epsneg = 1.1102230246251565e-16
minexp = -1022 tiny = 2.2250738585072014e-308
maxexp = 1024 max = 1.7976931348623157e+308
nexp = 11 min = -max
smallest_normal = 2.2250738585072014e-308 smallest_subnormal = 5e-324
---------------------------------------------------------------Reducing Data Size with np.astype()
Example: What is the smallest data type you’d like to use to store the array below, without losing important data? Use .astype() to make a transformed version of the data, then use npt.assert_equal() to verify that the transformed data has the same values as the original data.
temperature_c = np.random.randint(low=-15, high=35, size=10_000)
utils.format_bytes(temperature_c.nbytes)temperature_c2 = temperature_c.astype(np.int8)
print(utils.format_bytes(temperature_c2.nbytes))
npt.assert_equal(temperature_c, temperature_c2)10.00 KBExercise: What is the smallest data type you’d like to use to store the array below, without losing important data? Use .astype() to make a transformed version of the data, then use npt.assert_equal() to verify that the transformed data has the same values as the original data.
pixel_brightness = np.random.randint(low=0, high=256, size=10_000)
utils.format_bytes(pixel_brightness.nbytes)Solution
pixel_brightness2 = pixel_brightness.astype(np.uint8)
print(utils.format_bytes(pixel_brightness2.nbytes))
npt.assert_equal(pixel_brightness, pixel_brightness2)10.00 KBExercise: What is the smallest data type you’d like to use to store the array below, without losing important data? Use .astype() to make a transformed version of the data, then use npt.assert_equal() to verify that the transformed data has the same values as the original data.
binned_spike_counts = np.random.poisson(lam=4, size=10_000)
utils.format_bytes(binned_spike_counts.nbytes)Solution
binned_spike_counts2 = binned_spike_counts.astype(np.uint8)
print(utils.format_bytes(binned_spike_counts.nbytes))
npt.assert_equal(binned_spike_counts, binned_spike_counts2)10.00 KBExercise: What is the smallest data type you’d like to use to store the array below, without losing important data? Use .astype() to make a transformed version of the data, then use npt.assert_equal() to verify that the transformed data has the same values as the original data.
time_samples = np.arange(0, 10_000)
utils.format_bytes(time_samples.nbytes)Solution
time_samples2 = time_samples.astype(np.uint16)
print(utils.format_bytes(time_samples2.nbytes))
npt.assert_equal(time_samples, time_samples2)20.00 KBExercise: What is the smallest data type you’d like to use to store the array below, without losing important data? Use .astype() to make a transformed version of the data, then use npt.assert_equal() to verify that the transformed data has the same values as the original data.
head_velocities = np.random.randint(low=-100000, high=100000, size=10_000)
utils.format_bytes(head_velocities.nbytes)Solution
head_velocities2 = head_velocities.astype(np.int32)
print(utils.format_bytes(head_velocities2.nbytes))
npt.assert_equal(head_velocities, head_velocities2)40.00 KBExercise: What is the smallest data type you’d like to use to store the array below, without losing important data? Use .astype() to make a transformed version of the data, then verify the data’s integrity with npt.assert_allclose().
Note: These are floats, so you’ll have to rather use npt.assert_allclose() to verify that the transformed data has the same values as the original data, and note that here, you should specify an “absolute tolerance” (atol) and a “relative tolerance” (rtol), to say how different the new data is allowed to look from the old data.
time_seconds = np.arange(0, 200, .001)
utils.format_bytes(time_seconds.nbytes)Solution
time_seconds2 = time_seconds.astype(np.float16)
print(utils.format_bytes(time_seconds2.nbytes))
npt.assert_allclose(time_seconds, time_seconds2, atol=0.001, rtol=0.1)400.00 KBExercise: What is the smallest data type you’d like to use to store the array below, without losing important data? Use .astype() to make a transformed version of the data, then verify the data’s integrity with npt.assert_allclose().
firing_rates = np.random.exponential(scale=4, size=10_000)
utils.format_bytes(firing_rates.nbytes)Solution
firing_rates2 = firing_rates.astype(np.float32)
print(utils.format_bytes(firing_rates2.nbytes))
npt.assert_allclose(firing_rates, firing_rates2, atol=firing_rates.min()/2, rtol=0.001)40.00 KBSection 3: Section 3: Reducing Repetition with Dictionary Encoding
Many scientific datasets contain repeated categorical values:
- Animal IDs
- Experimental conditions
- Trial labels
- Brain region names
- Behavioural states
Text takes up a lot of space; if we store these values as full strings (or large integers), we repeat the same information many times.
Dictionary encoding is a common solution for this situation. It reduces storage by splitting the categorical array into two arrays:
- An array of unique text values, storing each unique value only once.
- An array of integers, containing the index code of each category.
This changes the representation of the data without changing its meaning.
In this section, we will:
- Encode repeated values using
np.unique() - Compare memory usage before and after encoding
- Downcast integer codes to reduce size further
- Reconstruct the original data
- Compare with
pandas.factorize()
Reference
| Code | Description |
|---|---|
np.unique(return_inverse=True) |
Return unique values and integer codes mapping each element |
array.nbytes |
Measure memory usage of an array |
astype() |
Convert codes to smaller integer dtype |
npt.assert_equal() |
Verify exact equality after reconstruction |
pd.factorize() |
Encode repeated values into integer codes |
codes.max() |
Inspect largest encoded index |
Exercises
Exercise: Take the a array of repeated animal IDs below and encode it using uniques, codes = np.unique(return_inverse=True). How much memory do the encoded representation and lookup table use together?
np.random.seed(42)
animal_ids = np.random.choice(["mouse", "rat", "human", "zebrafish"], size=1_000_000)
utils.format_bytes(animal_ids.nbytes)36.00 MBSolution
uniques, codes = np.unique(animal_ids, return_inverse=True)
utils.format_bytes(codes.nbytes + uniques.nbytes)'8.00 MB'Exercise: Use the encoded representation to reconstruct the original array (reconstructed = uniques[codes]), then:
- Verify that the reconstructed data matches the original. (
npt.assert_equal()) - Verify that the reconstructed data has the same size as the original.
Solution
reconstructed = uniques[codes]
npt.assert_equal(animal_ids, reconstructed)
utils.format_bytes(animal_ids.nbytes)36.00 MBExercise: Reducing Code Size Further with astype(). Dictionary-encode the data below, then reduce the number of bits the codes array takes up, to further save space. How small can we get the total transformed data?
np.random.seed(42)
brain_regions = np.random.choice(["visual cortex", "hippocampus", "thalamus", "auditory cortex"], size=3_000_000)
utils.format_bytes(brain_regions.nbytes)'180.00 MB'Solution
uniques, codes = np.unique(brain_regions, return_inverse=True)
codes = codes.astype(np.uint8)
utils.format_bytes(codes.nbytes + uniques.nbytes)'3.00 MB'Exercise: When saving this data, it is valuable to keep the arrays together in a single file; that way, the values aren’t seperated from the codes. This can be done by np.savez()!
example: np.savez('eeg.npz', time=times, voltages=volts)
Save the dictionary-encoded brain regions you made in the previous exercise into a single file “.npz” file.
Solution
np.savez("brain_regions", regions=brain_regions, codes=codes)Exercise: Okay, we can get even smaller. Try out np.savez_compressed(), and compare the file size written to that when using np.savez().
Solution
np.savez_compressed("brain_regions2", regions=brain_regions, codes=codes)
utils.format_bytes(os.path.getsize("brain_regions.npz")), utils.format_bytes(os.path.getsize("brain_regions2.npz")) ('183.00 MB', '2.36 MB')Exercise: Dictionary encoding is not helpful when the data is mostly made up of unique values. Let’s try it out!
The code below generates an array of random DNA sequences. In the cell below it, please use dictionary encoding on the data, and compare the sizes of the original dataset and the tranformed dataset. Is there a difference?
import random
dna_seqs = np.array(["".join(random.choices("GCTA", k=60)) for _ in range(20_000)])
print(utils.format_bytes(dna_seqs.nbytes))
dna_seqs[:5]4.80 MBarray(['CGATTCTTATGAACTACTGACGTTAGGAATTTAGTCAGGTTCGAGACTCATGCACCCCTG',
'GTGTGTTTCAAGACTAACGTGACCTGCATATTTCCAGTCGCAAGTCATTCCGGTATACGA',
'GGACAAATTGAGTATAAAAATCATGCTTGGGTCTCATGTTTAAACTTGCCAAAACACCCT',
'TGTATCGTGTGCGGCTGAGTGGCTCATGTCACAGCAAGAAGACGTCCGCTGTAACAGGCC',
'GGGGGTTGCTATGAACGCCACGAAACTCCTTACTACAACTTGCACGCGGGATACAATGTC'],
dtype='<U60')Solution
uniques, codes = np.unique(dna_seqs, return_inverse=True)
codes = codes.astype(np.uint8)
utils.format_bytes(codes.nbytes + uniques.nbytes)'4.82 MB'Section 4: Section 4: Saving Fewer Zeros — Sparse Arrays
In many scientific datasets, most values are zero. Yes, actually zero. Examples include:
- Spike trains binned at high temporal resolution (most of the time, no spikes are firing)
- Adjacency matrices in connectivity analyses (most things aren’t heavily connected)
- Large masks or selection matrices (most data we’re not tryng to select)
If we store these arrays “densely”, we allocate memory for every element — including zeros. “Sparse” representations, on the other hand, store only the non-zero values and their positions.
Sparse arrays change representation without changing meaning. They can:
- Reduce memory usage dramatically
- Reduce disk storage
- Improve performance for certain operations
However, sparse arrays also introduce trade-offs:
- Not all NumPy operations are supported
- Converting between dense and sparse formats has a cost
- Sparse formats are beneficial only when many values are zero
In this section, we explore when sparse representations help and how to measure their impact.
Reference
| Code | Description |
|---|---|
scipy.sparse.csr_matrix() |
Create a compressed sparse row matrix |
matrix.data |
Non-zero values stored in sparse format |
matrix.indices |
Column indices of non-zero values |
matrix.indptr |
Index pointer for row boundaries |
matrix.toarray() |
Convert sparse matrix back to dense |
sparse.save_npz() |
Save sparse matrix to compressed file |
array.nbytes |
Measure dense array memory usage |
%time |
Measure execution time in notebook |
npt.assert_equal() |
Verify equality after reconstruction |
Exercises
Exercise: Creating a Sparse Matrix. The large array below is roughly 99% zeros.
- Measure the size of the dense array.
- Convert it to a sparse representation wiht
sparse.csr_matrix() - Compare memory usage. (note: you’ll need to check size in three places:
data.data.nbytes,data.indices.nbytes, anddata.indptr.nbytes) - Save the File to disk with
sparse.save_npz()How large is it on disk?
n = 10_000
density = 0.01
dense = (np.random.rand(n, n) < density).astype(np.uint8)
utils.format_bytes(dense.nbytes)'100.00 MB'Solution
sprse = sparse.csr_matrix(dense)
utils.format_bytes(sprse.data.nbytes + sprse.indices.nbytes + sprse.indptr.nbytes)'5.04 MB'Exercise: Read the Sparse Matrix, and convert it back to a Dense matrix with data.toarray(). Verify that the data is the same as the original.
Solution
npt.assert_equal(dense, sprse.toarray())