Compiling Code for High-Performance Computing

Working with the GPU in Python: Practical CuPy

Author

Dr. Nicholas Del Grosso

Download Materials

# Run this cell to install cupy
# %pip install cupy-cuda12x

import numpy as np
import cupy as cp
import time

Working with the GPU in Python: Practical CuPy

Modern GPUs can perform massive numbers of arithmetic operations in parallel. For large array computations — common in neuroscience — this can lead to substantial speedups.

However:

Moving data between CPU and GPU is expensive.
Small operations may be slower on the GPU.
Temporary allocations can dominate runtime.
Poor batching can erase any advantage.

In this notebook, we learn how to use CuPy, a NumPy-compatible GPU array library, and avoid common performance traps.By the end of the session, we should be able to:

Understand how CuPy mirrors NumPy
Identify when GPU acceleration helps
Avoid unnecessary host-device transfers
Batch operations efficiently
Write simple custom GPU kernels
Understand how kernel fusion reduces overhead

Setup

Installing CuPY

When connecting to extra hardware, getting the right drivers is essential. This guide is quite helpful: https://docs.cupy.dev/en/stable/install.html. If you don’t have access to a GPU you can upload this notebook to Google Colab which provides one.

Utility Functions

Please run the following code cell to get some utility functions we’ll use in this notebook:

def _format_duration(seconds: float, precision: int = 4) -> str:
    """
    Takes a time in seconds and returns a string (e.g. ) that is more human-readable.

    Looking to do this in a real project?  Some alternatives:
      - `humanize`: https://humanize.readthedocs.io/en/latest/
    """


    if seconds < 0:
        raise ValueError("Duration must be non-negative")

    units = [("s", 1), ("ms", 1e-3), ("µs", 1e-6)]

    for unit, scale in units:
        if seconds >= scale:
            value = seconds / scale
            return f"{value:.{precision}f} {unit}"
    else:
        return f"{seconds / 1e-9:.{precision}f} ns"
    

class utils:
    format_duration = _format_duration

Section 1: CuPy vs NumPy: When Does the GPU Help?

CuPy provides a NumPy-like API that runs on the GPU.

If we replace:

import numpy as np

with:

import cupy as cp

But GPUs behave differently:

They have massive parallelism
They have separate memory
They benefit from large workloads
They suffer from frequent data transfers

Example neuroscience context: Applying sin() to 100 million samples may benefit from GPU acceleration. Applying it to 1,000 samples probably will not.

Reference

Code	Description
`import cupy as cp`	Import CuPy
`cp.asarray(x)`	Move NumPy array to GPU
`cp.asnumpy(x_gpu)`	Move GPU array to CPU
`x_gpu.get()`	Same as `cp.asnumpy()`
`cp.sin(x_gpu)`	GPU sine operation
`cp.cuda.Stream.null.synchronize()`	Ensure GPU timing is complete
`time.perf_counter()`	Measure runtime

Exercises

In the following exercises, we’ll compare various operations, either using numpy or cupy. Our goal is to see if we can demonstrate:

GPU helps for large arrays.
GPU overhead dominates for small arrays.
Arithmetic density improves GPU benefit.

Example: Create This random array on the gpu. Is it faster than on the cpu?

t0 = time.perf_counter()
np.random.random(200)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'11.0416 ms'

t0 = time.perf_counter()
cp.random.random(200)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'822.4626 ms'

Exercise: Create This random array on the gpu. Is it faster than on the cpu?

t0 = time.perf_counter()
np.random.random(20_000_00)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'19.1141 ms'

Solution

t0 = time.perf_counter()
cp.random.random(20_000_000)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'3.8420 ms'

Exercise: Does more arithmetic improve GPU benefit?

t0 = time.perf_counter()
x = np.random.random(20_000_000)
y = np.exp(x) + np.sqrt(x)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'412.1347 ms'

Solution

t0 = time.perf_counter()
x = cp.random.random(20_000_000)
y = cp.exp(x) + cp.sqrt(x)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'16.3444 ms'

Exercise: Does repeated computation favor GPU?

t0 = time.perf_counter()
x = np.random.random(2_000_000)
for _ in range(50):
    np.sin(x)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'1.1261 s'

Solution

t0 = time.perf_counter()
x = cp.random.random(2_000_000)
for _ in range(50):
    cp.sin(x)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'190.4566 ms'

Exercise: Does combining operations change performance?

t0 = time.perf_counter()
x = np.random.random(20_000_000)
y = np.sin(x) + np.cos(x)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'685.1928 ms'

Solution

t0 = time.perf_counter()
x = cp.random.random(20_000_000)
y = cp.sin(x) + cp.cos(x)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'141.0810 ms'

Section 2: The Transfer Trap

The most common GPU mistake is Moving data back and forth repeatedly. This is really slow!

Every time we call the following, we transfer memory between CPU and GPU.:

cp.asarray(x)
cp.asnumpy(x_gpu)
x_gpu.get()

For example, the code below transfers data two times per trial:

for trial in trials:
    x_gpu = cp.asarray(trial)  # CPU to GPU
    y_gpu = cp.sin(x_gpu)   
    result = cp.asnumpy(y_gpu)  # GPU to CPU

A Better approach:

Move to GPU once
Compute in batches
Move back to CPU once

Reference

Code	Description
`cp.asarray(x)`	CPU → GPU transfer
`cp.asnumpy(x_gpu)`	GPU → CPU transfer
`x_gpu.get()`	GPU → CPU transfer
`np.mean(x_gpu)`	Implicit transfer!
`cp.mean(x_gpu)`	GPU mean

Exercises

Exercise: Optimize the code below to reduce the number of transers. How much performance benefit is there?

# Before (keep the same, for comparison later)

x = np.random.random(20_000_000)

t0 = time.perf_counter()

for _ in range(50):
    x_gpu = cp.asarray(x)
    y_gpu = cp.sin(x_gpu)
    y = cp.asnumpy(y_gpu)


cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'5.7390 s'

Solution

# After (change the code below)

x = np.random.random(20_000_000)

t0 = time.perf_counter()

for _ in range(50):
    x_gpu = cp.asarray(x)
    y_gpu = cp.sin(x_gpu)
    y = cp.asnumpy(y_gpu)


cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'5.7951 s'

Exercise: Where is the transfer in the code below? Correct it to improve the performance:

# Before (leave unchanged, for comparison later)

x = cp.random.random(20_000_000)

t0 = time.perf_counter()
np.mean(x_gpu)

cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'3.3932 ms'

Solution

# After (leave unchanged, for comparison later)

x = cp.random.random(20_000_000)

t0 = time.perf_counter()
np.mean(x_gpu)

cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'3.3782 ms'

Section 3: Reducing Handover with Custom Kernels

Each GPU operation launches a kernel.

Multiple operations → multiple kernel launches.

We can fuse operations by writing a custom elementwise kernel. For example, we could apply thresholding and scaling in one pass, instead of three separate GPU calls.

This reduces:

Kernel launch overhead
Temporary arrays
Memory traffic

Reference

Code	Description
`cp.ElementwiseKernel`	Create custom elementwise kernel
`in_params`	Input argument types
`out_params`	Output argument types
`operation`	C-like code string
`name`	Kernel name

Exercise

Let’s make a kernel, and check its performance when we have less handover! Run the code below. Which is faster?

x = cp.random.random(20_000_000)

t0 = time.perf_counter()

add2 = x + 2 + 2 + 2 + 2 + 2

cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'253.7830 ms'

Solution

x = cp.random.random(20_000_000)

t0 = time.perf_counter()

add2 = cp.ElementwiseKernel(
    'float64 x',
    'float64 y',
    'y = x + 2 + 2 + 2 + 2 + 2;',
    'add2'
)

x = cp.random.random(20_000_000)

cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)

'6.2664 ms'