03 February 2025

Mark Hobbs

Accelerating peridynamic simulations using Numba CUDA

This post details the acceleration of the pypd package using CUDA. The goal is to ensure that the code requires no changes to the user interface while automatically detecting the presence of a CUDA-enabled device.

The majority of the simulation time is spent in two functions particles.compute_forces() and particles.update_positions(). These have both been optimised using numba in a manner analagous to shared-memory parallelism with OpenMP. The existing implementation does not support distributed-memory architectures (e.g., MPI), and as such, the maximum problem size is limited by the memory capacity of a single CPU - typically allowing for simulations of 1 to 2 million particles.

We are going to employ numba-cuda to make the CUDA implementation as easy and seamless as possible.

The CUDA programming model

Before writing the CUDA kernel, we briefly review the CUDA programming model. The CUDA programming model provides an abstraction of the GPU architecture that acts as a bridge between an application and its possible implementation on GPU hardware. See this post for further details.

CUDA Device Information:
----------------------------------------
CUDA Runtime Version:     12.5
Device Name:              b'Tesla T4'
Compute Capability:       (7, 5)

Memory:
Total Memory:             15.83 GB
Free Memory:              15.72 GB

Compute Resources:
Multiprocessors:          40
Max Threads per Block:    1024

Grid Limitations:
Max Grid Dimensions X:    2147483647
Max Grid Dimensions Y:    65535
Max Grid Dimensions Z:    65535

Additional Characteristics:
Warp Size:                32
Clock Rate:               1.59 GHz
Memory Clock Rate:        5.00 GHz

1. Device (GPU)

Physical hardware accelerator
Contains multiple Streaming Multiprocessors (SMs)
Manages global memory and computation resources

2. Streaming Multiprocessors (SMs)

Core computational units of the GPU
Contains:
- CUDA Cores
- Shared Memory
- Register File
- Special Function Units
- Warp Schedulers

3. Threads

Smallest unit of execution
Analogous to a single computational task
Organised into blocks and grids

4. Thread Blocks

Collection of threads that can:
- Share shared memory
- Synchronise with each other
- Execute on a single Streaming Multiprocessor (SM)

5. Grid

Highest-level organisational unit
Collection of thread blocks
All blocks in a grid execute the same kernel

Memory Hierarchy

Global Memory

Largest, slowest memory
Accessible by all threads
Resides on device DRAM
High latency, low bandwidth

Shared Memory

On-chip memory within an SM
Explicitly managed by programmer
Much faster than global memory
Shared by all threads in a block

Register Memory

Fastest memory
Private to each thread
Limited in size
Located directly on the SM

Constant Memory

Read-only cache
Optimised for broadcast scenarios
Good for data accessed by multiple threads

Texture Memory

Read-only cache
Optimised for 2D spatial locality
Hardware interpolation support

Execution Model

Warp

Basic unit of thread execution
Typically 32 threads
Executed in SIMT (Single Instruction, Multiple Thread) mode
All threads in a warp execute the same instruction

Occupancy

Measure of active warps per SM
Determines computational efficiency
Influenced by:
- Kernel resource usage
- SM capabilities
- Available registers
- Shared memory consumption

Kernel Launch Configuration

Threads per Block

Typically multiples of 32 (warp size)
Maximum usually 1024 threads
Constrained by:
- Available registers
- Shared memory
- Compute capability

Blocks per Grid

Determined by total work items
Limited by device capabilities

Typically calculated as:

blocks_per_grid = (total_work_items + threads_per_block - 1) // threads_per_block

`pypd`

Making some simplifications, setting up and running a simulation using pypd is broadly as follows:

import pypd

particles = pypd.ParticleSet(geometry)
bonds = pypd.BondSet(particles)
model = pypd.Model(particles, bonds)
simulation = pypd.Simulation(n_time_steps=5000)
simulation.run(model)

When the simulation is run we need to detect if CUDA is available and then allocate memory to the device.

from numba import cuda


class Simulation:

    def __init__(self):
        self.cuda_available = cuda.is_available()
        print(f"Is CUDA available: {self.cuda_available}")
        if self.cuda_available:
            get_cuda_device_info()

    def run(self, model):
        """
        Run the simulation
        """
        if self.cuda_available:
            model._host_to_device()

        for _ in range(self.n_time_steps):
            self._single_time_step(model)

        if self.cuda_available:
            model._device_to_host()

    def _single_time_step(self, model):
        """
        Single time step
        """
        model.particles.compute_forces(model.bonds, self.cuda_available)
        model.particles.update_positions(self)

        if model.penetrators:
            for penetrator in model.penetrators:
                penetrator.calculate_penetrator_force(model.particles, self)

Device memory allocations and data transfers should typically be done outside the CUDA kernel to:

Minimise redundant data transfers
Allow for more efficient memory management
Enable potential reuse of device-allocated memory across multiple function calls

Modify the arrays in place…

class Model:
    
    def _host_to_device(self):
        from numba import cuda
        cuda.to_device(self.particles.x)
        cuda.to_device(self.particles.u)
        cuda.to_device(self.particles.v)
        cuda.to_device(self.particles.a)
        cuda.to_device(self.particles.f)
        cuda.to_device(self.particles.bc.flag)
        cuda.to_device(self.particles.bc.unit_vector)
        cuda.to_device(self.bonds.c)
        cuda.to_device(self.bonds.d)
        cuda.to_device(self.bonds.f_x)
        cuda.to_device(self.bonds.f_y)

class ParticleSet:

     def compute_forces(self, bonds, cuda_available):
        """
        Compute particle forces
        """
        if cuda_available:
            compute_nodal_forces_gpu()
        else:
            self.f, _ = compute_nodal_forces_cpu(
                self.x,
                self.u,
                self.cell_volume,
                bonds.bondlist,
                bonds.d,
                bonds.c,
                bonds.f_x,
                bonds.f_y,
                bonds.constitutive_law.calculate_bond_damage,
                bonds.surface_correction_factors,
            )

class EulerCromer(Integrator):

    def one_timestep(self, particles, simulation):
        """
        Update particle positions using an Euler-Cromer time integration scheme
        """
        if simulation.cuda_available:
            return euler_cromer_cuda()
        else:
            return euler_cromer(
                particles.f,
                particles.u,
                particles.v,
                particles.a,
                particles.node_density,
                particles.bc.flag,
                particles.bc.i_magnitude,
                particles.bc.unit_vector,
                simulation.damping,
                simulation.dt
            )

CUDA Kernels

Data structure

A neighbour list is a data structure used to efficiently store particles within a specified cut-off distance of each other. There are two common representations:

Per-particle (family list): Each particle stores a list of its neighbours:

neighbours = [[j1, j2, j3], ...]  # shape: [n_particles, n_family_members]

This results in a nested loop structure: loop over particles, then loop over the neighbours of each particle.

 for particle in particles:
     for neighbour in particle.neighbours:
         # do stuff 

Pairwise (bond list): Each interaction is stored as a pair of particle indices:
```
bonds = [[i, j], ...]  # shape: [n_bonds, 2]
```
This format results in a single loop structure: loop over all bonds.
```
 for bond in bonds:
     # do stuff 
```

A pairwise list is an efficient option on a CPU as it avoids nested for loops and redundant computations (each particle pair is processed exactly once). The computation of bond forces is embarrassingly parallel, but the reduction of bond forces to particle forces must be performed in serial, as multiple threads may attempt to update the force on the same particle concurrently. See compute_nodal_forces() in this file for details.

In contrast, a per-particle neighbour list can provide better GPU performance. This approach accepts redundant computations (each particle pair is processed twice) in exchange for massively parallel bond force computation and enhanced parallelism during the reduction phase. Additionally, a per-particle neighbour list enables improved memory coalescing patterns, as threads within a warp access contiguous memory locations when processing neighbouring particles. The resulting reduction in memory bandwidth requirements further amplifies the performance advantages on modern GPU architectures.

`compute_nodal_forces_cuda()`

One of the key challenges in this work is implementing an efficient parallel reduction of bond forces to particle forces. To efficiently reduce bond forces to node forces we implement a parallel binary reduction.

The performance improvement is examined in this notebook.

THREADS_PER_BLOCK = 256

@cuda.jit
def reduce_bond_forces_kernel(bond_forces, particle_forces):
    """
    Reduce bond forces to particle forces

    Employ sequential addressing
    """

    shared = cuda.shared.array(THREADS_PER_BLOCK, dtype=bond_forces.dtype)

    particle = cuda.blockIdx.x
    tid = cuda.threadIdx.x
    n_family_members = bond_forces.shape[1]

    # Initialise shared memory
    val = 0.0
    if tid < n_family_members:
        val = bond_forces[particle, tid]
    shared[tid] = val

    cuda.syncthreads()

    stride = THREADS_PER_BLOCK // 2
    while stride > 0:
        if tid < stride:
            shared[tid] += shared[tid + stride]
        cuda.syncthreads()
        stride //= 2

    if tid == 0:
        particle_forces[particle] = shared[0]

The above can be written more cleanly using cuTile. In cuTile the execution unit is a tile block. Threads are implicit.

A fundamental principle of GPU programming is oversubscription: assigning more thread blocks than there are Streaming Multiprocessors (SMs) on the GPU to ensure full utilisation and hide memory latency.

kernel_function[grid_size, block_size](arguments)

reduce_bond_forces_kernel[n_particles, THREADS_PER_BLOCK](bond_forces, particle_forces)

`euler_cromer_cuda()`

Updating particle positions (Euler-Cromer integration) is an embarrassingly parallel problem and is ideally suited for GPU acceleration. Each unit of work (e.g. a particle) is updated independently and no communication is needed between threads.

`ConstitutiveLaw`

We need to write a utility function (wrapper function) to create a CUDA-compatible material law. The material law is called with the signature material_law(k_bond, s, d)…

class ConstitutiveLaw:

    def create_cuda_material_law(material_law):
        """
        Convert a Python material law function to a CUDA-compatible form
        
        Parameters:
        -----------
        material_law : callable
            A function that takes (k_bond, stretch, current_damage) 
            and returns updated damage
        
        Returns:
        --------
        cuda_material_law : ndarray
            A pre-compiled CUDA-compatible material law
        """
        pass

Spatial locality

Sort particles spatially to improve memory access patterns… space-filling curves

Minimising data transfer overhead

Transferring data between the device (GPU) and the host (CPU) is significantly slower than the computational throughput of the GPU itself, and must therefore be minimised within the simulation time-stepping routine to maintain performance.

Useful links

Numba for CUDA programmers
An even easier introduction to CUDA
CUDA programming guide
The CUDA programming model
Optimising parallel reduction in CUDA
numba.readthedocs
https://numba.readthedocs.io/en/stable/cuda/index.html
https://nvidia.github.io/numba-cuda/
GPU Programming in Pure Python - Bryce Adelstein Lelbach
- Discusses efficient parallel reduction of arrays
CUDA Python
- CUDA Python is the home for accessing NVIDIA’s CUDA platform from Python.
How to Write a CUDA Program - The Parallel Programming Edition
GTC 2025: CUDA Python