Working with the GPU in Python: Practical CuPy

Author
Dr. Nicholas Del Grosso

# Run this cell to install cupy
# %pip install cupy-cuda12x
import numpy as np
import cupy as cp
import time

Working with the GPU in Python: Practical CuPy

Modern GPUs can perform massive numbers of arithmetic operations in parallel. For large array computations — common in neuroscience — this can lead to substantial speedups.

However:

  • Moving data between CPU and GPU is expensive.
  • Small operations may be slower on the GPU.
  • Temporary allocations can dominate runtime.
  • Poor batching can erase any advantage.

In this notebook, we learn how to use CuPy, a NumPy-compatible GPU array library, and avoid common performance traps.By the end of the session, we should be able to:

  • Understand how CuPy mirrors NumPy
  • Identify when GPU acceleration helps
  • Avoid unnecessary host-device transfers
  • Batch operations efficiently
  • Write simple custom GPU kernels
  • Understand how kernel fusion reduces overhead

Setup

Installing CuPY

When connecting to extra hardware, getting the right drivers is essential. This guide is quite helpful: https://docs.cupy.dev/en/stable/install.html . If you don’t have access to a GPU you can upload this notebook to Google Colab which provides one.

Utility Functions

Please run the following code cell to get some utility functions we’ll use in this notebook:

def _format_duration(seconds: float, precision: int = 4) -> str:
    """
    Takes a time in seconds and returns a string (e.g. ) that is more human-readable.

    Looking to do this in a real project?  Some alternatives:
      - `humanize`: https://humanize.readthedocs.io/en/latest/
    """


    if seconds < 0:
        raise ValueError("Duration must be non-negative")

    units = [("s", 1), ("ms", 1e-3), ("µs", 1e-6)]

    for unit, scale in units:
        if seconds >= scale:
            value = seconds / scale
            return f"{value:.{precision}f} {unit}"
    else:
        return f"{seconds / 1e-9:.{precision}f} ns"
    

class utils:
    format_duration = _format_duration

Section 1: CuPy vs NumPy: When Does the GPU Help?

CuPy provides a NumPy-like API that runs on the GPU.

If we replace:

import numpy as np

with:

import cupy as cp

But GPUs behave differently:

  • They have massive parallelism
  • They have separate memory
  • They benefit from large workloads
  • They suffer from frequent data transfers

Example neuroscience context: Applying sin() to 100 million samples may benefit from GPU acceleration. Applying it to 1,000 samples probably will not.

Reference

Code Description
import cupy as cp Import CuPy
cp.asarray(x) Move NumPy array to GPU
cp.asnumpy(x_gpu) Move GPU array to CPU
x_gpu.get() Same as cp.asnumpy()
cp.sin(x_gpu) GPU sine operation
cp.cuda.Stream.null.synchronize() Ensure GPU timing is complete
time.perf_counter() Measure runtime

Exercises

In the following exercises, we’ll compare various operations, either using numpy or cupy. Our goal is to see if we can demonstrate:

  • GPU helps for large arrays.
  • GPU overhead dominates for small arrays.
  • Arithmetic density improves GPU benefit.

Example: Create This random array on the gpu. Is it faster than on the cpu?

t0 = time.perf_counter()
np.random.random(200)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'11.0416 ms'
t0 = time.perf_counter()
cp.random.random(200)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'822.4626 ms'

Exercise: Create This random array on the gpu. Is it faster than on the cpu?

t0 = time.perf_counter()
np.random.random(20_000_00)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'19.1141 ms'
Solution
t0 = time.perf_counter()
cp.random.random(20_000_000)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'3.8420 ms'

Exercise: Does more arithmetic improve GPU benefit?

t0 = time.perf_counter()
x = np.random.random(20_000_000)
y = np.exp(x) + np.sqrt(x)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'412.1347 ms'
Solution
t0 = time.perf_counter()
x = cp.random.random(20_000_000)
y = cp.exp(x) + cp.sqrt(x)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'16.3444 ms'

Exercise: Does repeated computation favor GPU?

t0 = time.perf_counter()
x = np.random.random(2_000_000)
for _ in range(50):
    np.sin(x)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'1.1261 s'
Solution
t0 = time.perf_counter()
x = cp.random.random(2_000_000)
for _ in range(50):
    cp.sin(x)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'190.4566 ms'

Exercise: Does combining operations change performance?

t0 = time.perf_counter()
x = np.random.random(20_000_000)
y = np.sin(x) + np.cos(x)
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'685.1928 ms'
Solution
t0 = time.perf_counter()
x = cp.random.random(20_000_000)
y = cp.sin(x) + cp.cos(x)
cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'141.0810 ms'

Section 2: The Transfer Trap

The most common GPU mistake is Moving data back and forth repeatedly. This is really slow!

Every time we call the following, we transfer memory between CPU and GPU.:

  • cp.asarray(x)
  • cp.asnumpy(x_gpu)
  • x_gpu.get()

For example, the code below transfers data two times per trial:

for trial in trials:
    x_gpu = cp.asarray(trial)  # CPU to GPU
    y_gpu = cp.sin(x_gpu)   
    result = cp.asnumpy(y_gpu)  # GPU to CPU

A Better approach:

  1. Move to GPU once
  2. Compute in batches
  3. Move back to CPU once

Reference

Code Description
cp.asarray(x) CPU → GPU transfer
cp.asnumpy(x_gpu) GPU → CPU transfer
x_gpu.get() GPU → CPU transfer
np.mean(x_gpu) Implicit transfer!
cp.mean(x_gpu) GPU mean

Exercises

Exercise: Optimize the code below to reduce the number of transers. How much performance benefit is there?

# Before (keep the same, for comparison later)

x = np.random.random(20_000_000)

t0 = time.perf_counter()

for _ in range(50):
    x_gpu = cp.asarray(x)
    y_gpu = cp.sin(x_gpu)
    y = cp.asnumpy(y_gpu)


cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'5.7390 s'
Solution
# After (change the code below)

x = np.random.random(20_000_000)

t0 = time.perf_counter()

for _ in range(50):
    x_gpu = cp.asarray(x)
    y_gpu = cp.sin(x_gpu)
    y = cp.asnumpy(y_gpu)


cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'5.7951 s'

Exercise: Where is the transfer in the code below? Correct it to improve the performance:

# Before (leave unchanged, for comparison later)

x = cp.random.random(20_000_000)

t0 = time.perf_counter()
np.mean(x_gpu)

cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'3.3932 ms'
Solution
# After (leave unchanged, for comparison later)

x = cp.random.random(20_000_000)

t0 = time.perf_counter()
np.mean(x_gpu)

cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'3.3782 ms'

Section 3: Reducing Handover with Custom Kernels

Each GPU operation launches a kernel.

Multiple operations → multiple kernel launches.

We can fuse operations by writing a custom elementwise kernel. For example, we could apply thresholding and scaling in one pass, instead of three separate GPU calls.

This reduces:

  • Kernel launch overhead
  • Temporary arrays
  • Memory traffic

Reference

Code Description
cp.ElementwiseKernel Create custom elementwise kernel
in_params Input argument types
out_params Output argument types
operation C-like code string
name Kernel name

Exercise

Let’s make a kernel, and check its performance when we have less handover! Run the code below. Which is faster?

x = cp.random.random(20_000_000)

t0 = time.perf_counter()

add2 = x + 2 + 2 + 2 + 2 + 2

cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'253.7830 ms'
Solution
x = cp.random.random(20_000_000)

t0 = time.perf_counter()

add2 = cp.ElementwiseKernel(
    'float64 x',
    'float64 y',
    'y = x + 2 + 2 + 2 + 2 + 2;',
    'add2'
)

x = cp.random.random(20_000_000)

cp.cuda.Stream.null.synchronize()
t1 = time.perf_counter()
utils.format_duration(t1 - t0)
'6.2664 ms'