Working with the GPU in Python: Practical CuPy
Author
# Run this cell to install cupy
# %pip install cupy-cuda12ximport numpy as np
import cupy as cp
import timeWorking with the GPU in Python: Practical CuPy
Modern GPUs can perform massive numbers of arithmetic operations in parallel. For large array computations — common in neuroscience — this can lead to substantial speedups.
However:
- Moving data between CPU and GPU is expensive.
- Small operations may be slower on the GPU.
- Temporary allocations can dominate runtime.
- Poor batching can erase any advantage.
In this notebook, we learn how to use CuPy, a NumPy-compatible GPU array library, and avoid common performance traps.By the end of the session, we should be able to:
- Understand how CuPy mirrors NumPy
- Identify when GPU acceleration helps
- Avoid unnecessary host-device transfers
- Batch operations efficiently
- Write simple custom GPU kernels
- Understand how kernel fusion reduces overhead
Setup
Installing CuPY
When connecting to extra hardware, getting the right drivers is essential. This guide is quite helpful: https://docs.cupy.dev/en/stable/install.html. If you don’t have access to a GPU you can upload this notebook to Google Colab which provides one.
Utility Functions
Please run the following code cell to get some utility functions we’ll use in this notebook:
def _format_duration(seconds: float, precision: int = 4) -> str:
"""
Takes a time in seconds and returns a string (e.g. ) that is more human-readable.
Looking to do this in a real project? Some alternatives:
- `humanize`: https://humanize.readthedocs.io/en/latest/
"""
if seconds < 0:
raise ValueError("Duration must be non-negative")
units = [("s", 1), ("ms", 1e-3), ("µs", 1e-6)]
for unit, scale in units:
if seconds >= scale:
value = seconds / scale
return f"{value:.{precision}f} {unit}"
else:
return f"{seconds / 1e-9:.{precision}f} ns"
class utils:
format_duration = _format_durationSection 1: CuPy vs NumPy: When Does the GPU Help?
CuPy provides a NumPy-like API that runs on the GPU.
If we replace:
import numpy as npwith:
import cupy as cpBut GPUs behave differently:
- They have massive parallelism
- They have separate memory
- They benefit from large workloads
- They suffer from frequent data transfers
Example neuroscience context:
Applying sin() to 100 million samples may benefit from GPU acceleration. Applying it to 1,000 samples probably will not.
Reference
| Code | Description |
|---|---|
import cupy as cp |
Import CuPy |
cp.asarray(x) |
Move NumPy array to GPU |
cp.asnumpy(x_gpu) |
Move GPU array to CPU |
x_gpu.get() |
Same as cp.asnumpy() |
cp.sin(x_gpu) |
GPU sine operation |
cp.cuda.Stream.null.synchronize() |
Ensure GPU timing is complete |
time.perf_counter() |
Measure runtime |
Exercises
In the following exercises, we’ll compare various operations, either using numpy or cupy. Our goal is to see if we can demonstrate:
- GPU helps for large arrays.
- GPU overhead dominates for small arrays.
- Arithmetic density improves GPU benefit.
Example: Create This random array on the gpu. Is it faster than on the cpu?
t0 = time.perf_counter()
np.random.random(200)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'11.0416 ms't0 = time.perf_counter()
cp.random.random(200)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'822.4626 ms'Exercise: Create This random array on the gpu. Is it faster than on the cpu?
t0 = time.perf_counter()
np.random.random(20_000_00)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'19.1141 ms'Solution
t0 = time.perf_counter()
cp.random.random(20_000_000)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'3.8420 ms'Exercise: Does more arithmetic improve GPU benefit?
t0 = time.perf_counter()
x = np.random.random(20_000_000)
y = np.exp(x) + np.sqrt(x)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'412.1347 ms'Solution
t0 = time.perf_counter()
x = cp.random.random(20_000_000)
y = cp.exp(x) + cp.sqrt(x)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'16.3444 ms'Exercise: Does repeated computation favor GPU?
t0 = time.perf_counter()
x = np.random.random(2_000_000)
for _ in range(50):
np.sin(x)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'1.1261 s'Solution
t0 = time.perf_counter()
x = cp.random.random(2_000_000)
for _ in range(50):
cp.sin(x)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'190.4566 ms'Exercise: Does combining operations change performance?
t0 = time.perf_counter()
x = np.random.random(20_000_000)
y = np.sin(x) + np.cos(x)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'685.1928 ms'Solution
t0 = time.perf_counter()
x = cp.random.random(20_000_000)
y = cp.sin(x) + cp.cos(x)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'141.0810 ms'Section 2: The Transfer Trap
The most common GPU mistake is Moving data back and forth repeatedly. This is really slow!
Every time we call the following, we transfer memory between CPU and GPU.:
cp.asarray(x)cp.asnumpy(x_gpu)x_gpu.get()
For example, the code below transfers data two times per trial:
for trial in trials:
x_gpu = cp.asarray(trial) # CPU to GPU
y_gpu = cp.sin(x_gpu)
result = cp.asnumpy(y_gpu) # GPU to CPUA Better approach:
- Move to GPU once
- Compute in batches
- Move back to CPU once
Reference
| Code | Description |
|---|---|
cp.asarray(x) |
CPU → GPU transfer |
cp.asnumpy(x_gpu) |
GPU → CPU transfer |
x_gpu.get() |
GPU → CPU transfer |
np.mean(x_gpu) |
Implicit transfer! |
cp.mean(x_gpu) |
GPU mean |
Exercises
Exercise: Optimize the code below to reduce the number of transers. How much performance benefit is there?
# Before (keep the same, for comparison later)
x = np.random.random(20_000_000)
t0 = time.perf_counter()
for _ in range(50):
x_gpu = cp.asarray(x)
y_gpu = cp.sin(x_gpu)
y = cp.asnumpy(y_gpu)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'5.7390 s'Solution
# After (change the code below)
x = np.random.random(20_000_000)
t0 = time.perf_counter()
for _ in range(50):
x_gpu = cp.asarray(x)
y_gpu = cp.sin(x_gpu)
y = cp.asnumpy(y_gpu)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'5.7951 s'Exercise: Where is the transfer in the code below? Correct it to improve the performance:
# Before (leave unchanged, for comparison later)
x = cp.random.random(20_000_000)
t0 = time.perf_counter()
np.mean(x_gpu)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'3.3932 ms'Solution
# After (leave unchanged, for comparison later)
x = cp.random.random(20_000_000)
t0 = time.perf_counter()
np.mean(x_gpu)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'3.3782 ms'Section 3: Reducing Handover with Custom Kernels
Each GPU operation launches a kernel.
Multiple operations → multiple kernel launches.
We can fuse operations by writing a custom elementwise kernel. For example, we could apply thresholding and scaling in one pass, instead of three separate GPU calls.
This reduces:
- Kernel launch overhead
- Temporary arrays
- Memory traffic
Reference
| Code | Description |
|---|---|
cp.ElementwiseKernel |
Create custom elementwise kernel |
in_params |
Input argument types |
out_params |
Output argument types |
operation |
C-like code string |
name |
Kernel name |
Exercise
Let’s make a kernel, and check its performance when we have less handover! Run the code below. Which is faster?
x = cp.random.random(20_000_000)
t0 = time.perf_counter()
add2 = x + 2 + 2 + 2 + 2 + 2
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'253.7830 ms'Solution
x = cp.random.random(20_000_000)
t0 = time.perf_counter()
add2 = cp.ElementwiseKernel(
'float64 x',
'float64 y',
'y = x + 2 + 2 + 2 + 2 + 2;',
'add2'
)
x = cp.random.random(20_000_000)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)'6.2664 ms'