Data Representation and Disk IO: Performance Beyond RAM

Author

Dr. Nicholas Del Grosso

Download Materials

import numpy as np
from numpy import testing as npt
from scipy import sparse

Data Representation and Disk IO: Performance Beyond RAM

In this session, we extend our understanding of performance beyond CPU and memory. Many scientific workflows become slow not because of computation, but because of how we store and retrieve data. Here we examine:

How file formats affect size and speed
How precision choices influence storage
How disk activity differs from memory activity

We will measure what happens when we write arrays to disk, compare text and binary formats, and explore how data types determine both memory usage and file size. Throughout, we focus on making informed design decisions rather than relying on defaults.

Setup

Utility Functions

import os
import psutil

def _format_bytes(bytes: float, precision: int = 2) -> str:
    """
    Takes a time in seconds and returns a string (e.g. ) that is more human-readable.

    Looking to do this in a real project?  Some alternatives:
      - `humanfriendly`: https://pypi.org/project/humanfriendly/#getting-started
    """

    if bytes < 0:
        raise ValueError("bytes must be non-negative")

    units = [("KB", 1000), ("MB", 1_000_000), ("GB", 1_000_000_000), ("TB", 1_000_000_000_000)]

    for unit, scale in reversed(units):
        if bytes >= scale:
            value = bytes / scale
            return f"{value:.{precision}f} {unit}"
    else:
        return f"{bytes} B"
    

def _disk_read() -> float:
    return psutil.Process(os.getpid()).io_counters().read_bytes

def _disk_write() -> float:
    return psutil.Process(os.getpid()).io_counters().write_bytes


class utils:
    format_bytes = _format_bytes
    bytes_read = _disk_read
    bytes_written = _disk_write

Section 1: Section 1: Reading Should Be Simple: Text vs Binary File Formats

When we save arrays to disk, we choose a representation. That choice affects:

File size
Write speed
Read speed
CPU usage during IO (Reading and Writing)

In this section, we compare binary storage (.npy) with text-based storage (.txt). We measure disk writes directly and observe how closely file size reflects the array’s size in RAM. Our goal is to understand what is predictable, what is expensive, and why.

We will:

Measure memory usage with .nbytes
Measure disk writes using psutil
Compare CPU time and wall time
Examine how formatting affects text output

Reference

Code	Description
`np.arange()`	Create arrays of controlled size and dtype for experiments
`array.nbytes`	Number of bytes used by the array in RAM
`np.save()`	Save array to binary `.npy` format
`np.savetxt()`	Save array to text file
`np.savetxt(fmt=...)`	Control formatting of values when writing text
`%time`	Measure CPU time and wall time in notebooks
`utils.bytes_written()`	Measure bytes written to disk by this process
`utils.format_bytes()`	Convert byte counts into human-readable units

Exercises

Writing to Binary NPY files with `np.save()`

the np.save() function is very simple: it puts two things into an .npy file:

Writes a text header explaining the size, shape, and dtype of the array stored.
Writes the array data in binary (a.k.a. “bytes”) format.

That should make np.save() predictable and straightforward. Let’s try it out!

Example: Make an array of 200,000 float64 values and save the it to disk with np.save(), and print:

How many bytes does the array take up in RAM?
How many bytes did the disk write when writing the file?

Is the amount of data written to these binary files about the same amount as the array stored in RAM?

data = np.arange(200_000, dtype=np.float64)
print('Size in RAM:', utils.format_bytes(data.nbytes))

dr0 = utils.bytes_written()
np.save('data.npy', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))

Size in RAM: 1.60 MB
Bytes Written: 1.61 MB

Exercise: Make an array of 1,000,000 float64 values and save the it to disk with np.save(), and print:

How many bytes does the array take up in RAM?
How many bytes did the disk write when writing the file?

Is the amount of data written to these binary files about the same amount as the array stored in RAM?

Solution

data = np.arange(1_000_000, dtype=np.float64)
print('Size in RAM:', utils.format_bytes(data.nbytes))

dr0 = utils.bytes_written()
np.save('data.npy', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))

Size in RAM: 8.00 MB
Bytes Written: 8.00 MB

Exercise: Does data type affect the data sizes, and/or the relative sizes between RAM and Disk when saving binary data with np.save()? Let’s try it again, this time with 1,000,000 int64 values:

Solution

data = np.arange(1_000_000, dtype=np.int64)
print('Size in RAM:', utils.format_bytes(data.nbytes))

dr0 = utils.bytes_written()
np.save('data.npy', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))

Size in RAM: 8.00 MB
Bytes Written: 8.00 MB

Exercise: How Much time does it take to write these files? Let’s do a basic measurement:

Create a data array that takes up 500 MB when written to disk.
Use %time to measure how much CPU and Wall time the writing took.

How long did the data take to write? Did the CPU have to do much work during that process?

Solution

data = np.arange(62_500_000, dtype=np.float64)
print('Size in RAM:', utils.format_bytes(data.nbytes))

dr0 = utils.bytes_written()
%time np.save('data.npy', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))

Size in RAM: 500.00 MB
CPU times: user 0 ns, sys: 724 ms, total: 724 ms
Wall time: 725 ms
Bytes Written: 500.00 MB

Writing to Text files with `np.savetxt()`

the np.savetxt() function is also very simple; for each value in an array, it writes out the value in text and puts a seperator character (often a new-line) between each value.

This is the same way that we record values on to paper, and makes reading these files in text editors to browse the data quite convenient.

That should make np.savetxt() also predictable and straightforward. Let’s try it out!

Exercise: Make an array of 1,000,000 float64 values and save the it to disk with np.savetxt(), and print:

How many bytes does the array take up in RAM?
How many bytes did the disk write when writing the file?

Is the amount of data written to text files about the same amount as the array stored in RAM?

Solution

data = np.arange(1_000_000, dtype=np.float64)
print('Size in RAM:', utils.format_bytes(data.nbytes))

dr0 = utils.bytes_written()
np.savetxt('data.txt', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))

Size in RAM: 8.00 MB
Bytes Written: 25.00 MB

Exercise: Does data type affect the data sizes, and/or the relative sizes between RAM and Disk when saving binary data with np.savetxt(fmt='%d')? Let’s try it again, this time with 1,000,000 int64 values:

Note: the fmt= option changes how the text is formatted when the data is written. '%d' means “write as integers”. "%.18e" is the default; it means “write in scientific floating point notation with 18 significance points”

Solution

data = np.arange(1_000_000, dtype=np.int64)
print('Size in RAM:', utils.format_bytes(data.nbytes))

dr0 = utils.bytes_written()
np.savetxt('data.txt', data, fmt='%d')
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))

Size in RAM: 8.00 MB
Bytes Written: 6.91 MB

Exercise: Does data type affect the data sizes, and/or the relative sizes between RAM and Disk when saving binary data with np.savetxt()? Let’s try it again, this time with 1,000,000 int64 values:

Solution

data = np.arange(1_000_000, dtype=np.int64)
print('Size in RAM:', utils.format_bytes(data.nbytes))

dr0 = utils.bytes_written()
np.savetxt('data.txt', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))

Size in RAM: 8.00 MB
Bytes Written: 25.02 MB

Exercise: How Much time does it take to write these files? Let’s do a basic measurement:

Create a data array that takes up 300 MB when written to disk.
Use %time to measure how much CPU and Wall time the writing took.

How long did the data take to write? Did the CPU have to do much work during that process?

Solution

data = np.arange(12_500_000, dtype=np.float64)
print('Size in RAM:', utils.format_bytes(data.nbytes))

dr0 = utils.bytes_written()
%time np.savetxt('data.txt', data)
dr1 = utils.bytes_written()
print('Bytes Written:', utils.format_bytes(dr1 - dr0))

Size in RAM: 100.00 MB
CPU times: user 20.8 s, sys: 1.91 s, total: 22.7 s
Wall time: 22.7 s
Bytes Written: 312.50 MB

Section 2: Section 2: Precision as a Design Decision

Data type is not an implementation detail — it is a design choice.

The number of bits we allocate determines:

Range of representable values
Precision of numerical values
Memory usage
Disk size
Compression behaviour

In many scientific workflows, default data types (such as float64) are used automatically. However, this may be unnecessary and costly. In this section, we explore integer and floating-point limits and learn how to safely reduce storage size without compromising data integrity.

We will:

Inspect value ranges of integer types
Examine floating-point precision limits
Use astype() to reduce storage size
Verify correctness using NumPy testing utilities

Exercises

Values, Precision, and Memory Size of Data Types

Exercise: How much space do these data types take up, in bytes? Use np.dtype().itemsize to print the byte size of:

16-bit floats (np.float16)
32-bit floats (np.float32)
64-bit floats (np.float64)

How many bytes do they take up? Are they different?

Code snippet: np.dtype(np.float16).itemsize

Solution

np.dtype(np.float16).itemsize, np.dtype(np.float32).itemsize, np.dtype(np.float64).itemsize

(2, 4, 8)

Exercise: How much space do these data types take up, in bytes? Use np.dtype().itemsize to print the byte size of:

16-bit floats (np.float16)
16-bit ints (np.int16)
16-bit unsigned ints (np.uint16)

How many bytes do they take up? Are they different?

Solution

np.dtype(np.float16).itemsize, np.dtype(np.int16).itemsize, np.dtype(np.uint16).itemsize

(2, 2, 2)

Exercise: Boolean values are sometimes surprising in how they are stored in memory by Numpy; because they only store True and False values, by rights they should only take up 1 bit (1/8 of a byte). Let’s check if that’s true. How much space do these data types take up, in bytes? Use np.dtype().itemsize to print the byte size of:

bools (np.bool_)
8-bit ints (np.int8)

How many bytes do they take up? Are they different from each other?

Solution

np.dtype(np.bool_).itemsize, np.dtype(np.int8).itemsize

(1, 1)

Exercise: What values can unsigned integers hold? Is there a big difference between, say, an 8-bit integer and a 32-bit integer? Print the output of np.iinfo() and Compare np.uint8, np.uint16, np.uint32, and np.uint64 and examine the minimum and maximum values for those data types.

Code Snippet: print(np.iinfo(np.uint8))

Solution

print(np.iinfo(np.uint8))
print(np.iinfo(np.uint16))
print(np.iinfo(np.uint32))
print(np.iinfo(np.uint64))

Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

Machine parameters for uint16
---------------------------------------------------------------
min = 0
max = 65535
---------------------------------------------------------------

Machine parameters for uint32
---------------------------------------------------------------
min = 0
max = 4294967295
---------------------------------------------------------------

Machine parameters for uint64
---------------------------------------------------------------
min = 0
max = 18446744073709551615
---------------------------------------------------------------

Exercise: What values can signed integers hold? Print the output of np.iinfo() and Compare np.int8, np.int16, np.int32, and np.int64.

What makes the values different than their corresponeding unsigned values?

Solution

print(np.iinfo(np.int8))
print(np.iinfo(np.int16))
print(np.iinfo(np.int32))
print(np.iinfo(np.int64))

Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for int32
---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------

Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------

Exercise: What values can floats hold, and what is an attribute that makes them different from ints? Print the output of np.finfo() and Compare np.float16, np.float32, and np.float64:

Solution

print(np.finfo(np.float16))
print(np.finfo(np.float32))
print(np.finfo(np.float64))

Machine parameters for float16
---------------------------------------------------------------
precision = 3   resolution = 0.001
machep = -10   eps =        0.000977
negep =  -11   epsneg =     0.0004883
minexp = -14   tiny =       6.104e-05
maxexp = 16   max =        6.55e+04
nexp =   5   min =        -max
smallest_normal = 6.104e-05   smallest_subnormal = 6e-08
---------------------------------------------------------------

Machine parameters for float32
---------------------------------------------------------------
precision = 6   resolution = 1e-06
machep = -23   eps =        1.1920929e-07
negep =  -24   epsneg =     5.9604645e-08
minexp = -126   tiny =       1.1754944e-38
maxexp = 128   max =        3.4028235e+38
nexp =   8   min =        -max
smallest_normal = 1.1754944e-38   smallest_subnormal = 1e-45
---------------------------------------------------------------

Machine parameters for float64
---------------------------------------------------------------
precision = 15   resolution = 1e-15
machep = -52   eps =        2.220446049250313e-16
negep =  -53   epsneg =     1.1102230246251565e-16
minexp = -1022   tiny =       2.2250738585072014e-308
maxexp = 1024   max =        1.7976931348623157e+308
nexp =   11   min =        -max
smallest_normal = 2.2250738585072014e-308   smallest_subnormal = 5e-324
---------------------------------------------------------------

Reducing Data Size with `np.astype()`

Example: What is the smallest data type you’d like to use to store the array below, without losing important data? Use .astype() to make a transformed version of the data, then use npt.assert_equal() to verify that the transformed data has the same values as the original data.

temperature_c = np.random.randint(low=-15, high=35, size=10_000)
utils.format_bytes(temperature_c.nbytes)

temperature_c2 = temperature_c.astype(np.int8)
print(utils.format_bytes(temperature_c2.nbytes))
npt.assert_equal(temperature_c, temperature_c2)

10.00 KB

Exercise: What is the smallest data type you’d like to use to store the array below, without losing important data? Use .astype() to make a transformed version of the data, then use npt.assert_equal() to verify that the transformed data has the same values as the original data.

pixel_brightness = np.random.randint(low=0, high=256, size=10_000)
utils.format_bytes(pixel_brightness.nbytes)

Solution

pixel_brightness2 = pixel_brightness.astype(np.uint8)
print(utils.format_bytes(pixel_brightness2.nbytes))
npt.assert_equal(pixel_brightness, pixel_brightness2)

10.00 KB

binned_spike_counts = np.random.poisson(lam=4, size=10_000)
utils.format_bytes(binned_spike_counts.nbytes)

Solution

binned_spike_counts2 = binned_spike_counts.astype(np.uint8)
print(utils.format_bytes(binned_spike_counts.nbytes))
npt.assert_equal(binned_spike_counts, binned_spike_counts2)

10.00 KB

time_samples = np.arange(0, 10_000)
utils.format_bytes(time_samples.nbytes)

Solution

time_samples2 = time_samples.astype(np.uint16)
print(utils.format_bytes(time_samples2.nbytes))
npt.assert_equal(time_samples, time_samples2)

20.00 KB

head_velocities = np.random.randint(low=-100000, high=100000, size=10_000)
utils.format_bytes(head_velocities.nbytes)

Solution

head_velocities2 = head_velocities.astype(np.int32)
print(utils.format_bytes(head_velocities2.nbytes))
npt.assert_equal(head_velocities, head_velocities2)

40.00 KB

Exercise: What is the smallest data type you’d like to use to store the array below, without losing important data? Use .astype() to make a transformed version of the data, then verify the data’s integrity with npt.assert_allclose().

Note: These are floats, so you’ll have to rather use npt.assert_allclose() to verify that the transformed data has the same values as the original data, and note that here, you should specify an “absolute tolerance” (atol) and a “relative tolerance” (rtol), to say how different the new data is allowed to look from the old data.

time_seconds = np.arange(0, 200, .001)
utils.format_bytes(time_seconds.nbytes)

Solution

time_seconds2 = time_seconds.astype(np.float16)
print(utils.format_bytes(time_seconds2.nbytes))
npt.assert_allclose(time_seconds, time_seconds2, atol=0.001, rtol=0.1)

400.00 KB

Exercise: What is the smallest data type you’d like to use to store the array below, without losing important data? Use .astype() to make a transformed version of the data, then verify the data’s integrity with npt.assert_allclose().

firing_rates = np.random.exponential(scale=4, size=10_000)
utils.format_bytes(firing_rates.nbytes)

Solution

firing_rates2 = firing_rates.astype(np.float32)
print(utils.format_bytes(firing_rates2.nbytes))
npt.assert_allclose(firing_rates, firing_rates2, atol=firing_rates.min()/2, rtol=0.001)

40.00 KB

Section 3: Section 3: Reducing Repetition with Dictionary Encoding

Many scientific datasets contain repeated categorical values:

Animal IDs
Experimental conditions
Trial labels
Brain region names
Behavioural states

Text takes up a lot of space; if we store these values as full strings (or large integers), we repeat the same information many times.

Dictionary encoding is a common solution for this situation. It reduces storage by splitting the categorical array into two arrays:

An array of unique text values, storing each unique value only once.
An array of integers, containing the index code of each category.

This changes the representation of the data without changing its meaning.

In this section, we will:

Encode repeated values using np.unique()
Compare memory usage before and after encoding
Downcast integer codes to reduce size further
Reconstruct the original data
Compare with pandas.factorize()

Reference

Code	Description
`np.unique(return_inverse=True)`	Return unique values and integer codes mapping each element
`array.nbytes`	Measure memory usage of an array
`astype()`	Convert codes to smaller integer dtype
`npt.assert_equal()`	Verify exact equality after reconstruction
`pd.factorize()`	Encode repeated values into integer codes
`codes.max()`	Inspect largest encoded index

Exercises

Exercise: Take the a array of repeated animal IDs below and encode it using uniques, codes = np.unique(return_inverse=True). How much memory do the encoded representation and lookup table use together?

np.random.seed(42)
animal_ids = np.random.choice(["mouse", "rat", "human", "zebrafish"], size=1_000_000)
utils.format_bytes(animal_ids.nbytes)

36.00 MB

Solution

uniques, codes = np.unique(animal_ids, return_inverse=True)
utils.format_bytes(codes.nbytes + uniques.nbytes)

'8.00 MB'

Exercise: Use the encoded representation to reconstruct the original array (reconstructed = uniques[codes]), then:

Verify that the reconstructed data matches the original. (npt.assert_equal())
Verify that the reconstructed data has the same size as the original.

Solution

reconstructed = uniques[codes]
npt.assert_equal(animal_ids, reconstructed)
utils.format_bytes(animal_ids.nbytes)

36.00 MB

Exercise: Reducing Code Size Further with astype(). Dictionary-encode the data below, then reduce the number of bits the codes array takes up, to further save space. How small can we get the total transformed data?

np.random.seed(42)
brain_regions = np.random.choice(["visual cortex", "hippocampus", "thalamus", "auditory cortex"], size=3_000_000)
utils.format_bytes(brain_regions.nbytes)

'180.00 MB'

Solution

uniques, codes = np.unique(brain_regions, return_inverse=True)
codes = codes.astype(np.uint8)
utils.format_bytes(codes.nbytes + uniques.nbytes)

'3.00 MB'

Exercise: When saving this data, it is valuable to keep the arrays together in a single file; that way, the values aren’t seperated from the codes. This can be done by np.savez()!

example: np.savez('eeg.npz', time=times, voltages=volts)

Save the dictionary-encoded brain regions you made in the previous exercise into a single file “.npz” file.

Solution

np.savez("brain_regions", regions=brain_regions, codes=codes)

Exercise: Okay, we can get even smaller. Try out np.savez_compressed(), and compare the file size written to that when using np.savez().

Solution

np.savez_compressed("brain_regions2", regions=brain_regions, codes=codes)
utils.format_bytes(os.path.getsize("brain_regions.npz")), utils.format_bytes(os.path.getsize("brain_regions2.npz"))

('183.00 MB', '2.36 MB')

Exercise: Dictionary encoding is not helpful when the data is mostly made up of unique values. Let’s try it out!

The code below generates an array of random DNA sequences. In the cell below it, please use dictionary encoding on the data, and compare the sizes of the original dataset and the tranformed dataset. Is there a difference?

import random
dna_seqs = np.array(["".join(random.choices("GCTA", k=60)) for _ in range(20_000)])
print(utils.format_bytes(dna_seqs.nbytes))
dna_seqs[:5]

4.80 MB

array(['CGATTCTTATGAACTACTGACGTTAGGAATTTAGTCAGGTTCGAGACTCATGCACCCCTG',
       'GTGTGTTTCAAGACTAACGTGACCTGCATATTTCCAGTCGCAAGTCATTCCGGTATACGA',
       'GGACAAATTGAGTATAAAAATCATGCTTGGGTCTCATGTTTAAACTTGCCAAAACACCCT',
       'TGTATCGTGTGCGGCTGAGTGGCTCATGTCACAGCAAGAAGACGTCCGCTGTAACAGGCC',
       'GGGGGTTGCTATGAACGCCACGAAACTCCTTACTACAACTTGCACGCGGGATACAATGTC'],
      dtype='<U60')

Solution

uniques, codes = np.unique(dna_seqs, return_inverse=True)
codes = codes.astype(np.uint8)
utils.format_bytes(codes.nbytes + uniques.nbytes)

'4.82 MB'

Section 4: Section 4: Saving Fewer Zeros — Sparse Arrays

In many scientific datasets, most values are zero. Yes, actually zero. Examples include:

Spike trains binned at high temporal resolution (most of the time, no spikes are firing)
Adjacency matrices in connectivity analyses (most things aren’t heavily connected)
Large masks or selection matrices (most data we’re not tryng to select)

If we store these arrays “densely”, we allocate memory for every element — including zeros. “Sparse” representations, on the other hand, store only the non-zero values and their positions.

Sparse arrays change representation without changing meaning. They can:

Reduce memory usage dramatically
Reduce disk storage
Improve performance for certain operations

However, sparse arrays also introduce trade-offs:

Not all NumPy operations are supported
Converting between dense and sparse formats has a cost
Sparse formats are beneficial only when many values are zero

In this section, we explore when sparse representations help and how to measure their impact.

Reference

Code	Description
`scipy.sparse.csr_matrix()`	Create a compressed sparse row matrix
`matrix.data`	Non-zero values stored in sparse format
`matrix.indices`	Column indices of non-zero values
`matrix.indptr`	Index pointer for row boundaries
`matrix.toarray()`	Convert sparse matrix back to dense
`sparse.save_npz()`	Save sparse matrix to compressed file
`array.nbytes`	Measure dense array memory usage
`%time`	Measure execution time in notebook
`npt.assert_equal()`	Verify equality after reconstruction

Exercises

Exercise: Creating a Sparse Matrix. The large array below is roughly 99% zeros.

Measure the size of the dense array.
Convert it to a sparse representation wiht sparse.csr_matrix()
Compare memory usage. (note: you’ll need to check size in three places: data.data.nbytes, data.indices.nbytes, and data.indptr.nbytes)
Save the File to disk with sparse.save_npz() How large is it on disk?

n = 10_000
density = 0.01
dense = (np.random.rand(n, n) < density).astype(np.uint8)
utils.format_bytes(dense.nbytes)

'100.00 MB'

Solution

sprse = sparse.csr_matrix(dense)
utils.format_bytes(sprse.data.nbytes + sprse.indices.nbytes + sprse.indptr.nbytes)

'5.04 MB'

Exercise: Read the Sparse Matrix, and convert it back to a Dense matrix with data.toarray(). Verify that the data is the same as the original.

Solution

npt.assert_equal(dense, sprse.toarray())

Data Representation and Disk IO: Performance Beyond RAM

Author

Data Representation and Disk IO: Performance Beyond RAM

Setup

Utility Functions

Section 1: Section 1: Reading Should Be Simple: Text vs Binary File Formats

Reference

Exercises

Writing to Binary NPY files with np.save()

Writing to Text files with np.savetxt()

Section 2: Section 2: Precision as a Design Decision

Exercises

Values, Precision, and Memory Size of Data Types

Reducing Data Size with np.astype()

Section 3: Section 3: Reducing Repetition with Dictionary Encoding

Reference

Exercises

Section 4: Section 4: Saving Fewer Zeros — Sparse Arrays

Reference

Exercises

Writing to Binary NPY files with `np.save()`

Writing to Text files with `np.savetxt()`

Reducing Data Size with `np.astype()`