03 February 2025

Mark Hobbs

Accelerating peridynamic simulations using Numba CUDA

This post details the acceleration of the pypd package using CUDA. The goal is to ensure that the code requires no changes to the user interface while automatically detecting the presence of a CUDA-enabled device.

The majority of the simulation time is spent in two functions particles.compute_forces() and particles.update_positions(). These have both been optimised using numba in a manner analagous to shared-memory parallelism with OpenMP. The existing implementation does not support distributed-memory architectures (e.g., MPI), and as such, the maximum problem size is limited by the memory capacity of a single CPU - typically allowing for simulations of 1 to 2 million particles.

We are going to employ numba-cuda to make the CUDA implementation as easy and seamless as possible.

The CUDA programming model

Before writing the CUDA kernel, we briefly review the CUDA programming model. The CUDA programming model provides an abstraction of the GPU architecture that acts as a bridge between an application and its possible implementation on GPU hardware. See this post for further details.

CUDA Device Information:
----------------------------------------
CUDA Runtime Version:     12.5
Device Name:              b'Tesla T4'
Compute Capability:       (7, 5)

Memory:
Total Memory:             15.83 GB
Free Memory:              15.72 GB

Compute Resources:
Multiprocessors:          40
Max Threads per Block:    1024

Grid Limitations:
Max Grid Dimensions X:    2147483647
Max Grid Dimensions Y:    65535
Max Grid Dimensions Z:    65535

Additional Characteristics:
Warp Size:                32
Clock Rate:               1.59 GHz
Memory Clock Rate:        5.00 GHz

1. Device (GPU)

2. Streaming Multiprocessors (SMs)

3. Threads

4. Thread Blocks

5. Grid

Memory Hierarchy

Global Memory

Shared Memory

Register Memory

Constant Memory

Texture Memory

Execution Model

Warp

Occupancy

Kernel Launch Configuration

Threads per Block

Blocks per Grid

pypd

Making some simplifications, setting up and running a simulation using pypd is broadly as follows:

import pypd

particles = pypd.ParticleSet(geometry)
bonds = pypd.BondSet(particles)
model = pypd.Model(particles, bonds)
simulation = pypd.Simulation(n_time_steps=5000)
simulation.run(model)

When the simulation is run we need to detect if CUDA is available and then allocate memory to the device.

from numba import cuda


class Simulation:

    def __init__(self):
        self.cuda_available = cuda.is_available()
        print(f"Is CUDA available: {self.cuda_available}")
        if self.cuda_available:
            get_cuda_device_info()

    def run(self, model):
        """
        Run the simulation
        """
        if self.cuda_available:
            model._host_to_device()

        for _ in range(self.n_time_steps):
            self._single_time_step(model)

        if self.cuda_available:
            model._device_to_host()

    def _single_time_step(self, model):
        """
        Single time step
        """
        model.particles.compute_forces(model.bonds, self.cuda_available)
        model.particles.update_positions(self)

        if model.penetrators:
            for penetrator in model.penetrators:
                penetrator.calculate_penetrator_force(model.particles, self)

Device memory allocations and data transfers should typically be done outside the CUDA kernel to:

Modify the arrays in place…

class Model:
    
    def _host_to_device(self):
        from numba import cuda
        cuda.to_device(self.particles.x)
        cuda.to_device(self.particles.u)
        cuda.to_device(self.particles.v)
        cuda.to_device(self.particles.a)
        cuda.to_device(self.particles.f)
        cuda.to_device(self.particles.bc.flag)
        cuda.to_device(self.particles.bc.unit_vector)
        cuda.to_device(self.bonds.c)
        cuda.to_device(self.bonds.d)
        cuda.to_device(self.bonds.f_x)
        cuda.to_device(self.bonds.f_y)
class ParticleSet:

     def compute_forces(self, bonds, cuda_available):
        """
        Compute particle forces
        """
        if cuda_available:
            compute_nodal_forces_gpu()
        else:
            self.f, _ = compute_nodal_forces_cpu(
                self.x,
                self.u,
                self.cell_volume,
                bonds.bondlist,
                bonds.d,
                bonds.c,
                bonds.f_x,
                bonds.f_y,
                bonds.constitutive_law.calculate_bond_damage,
                bonds.surface_correction_factors,
            )

class EulerCromer(Integrator):

    def one_timestep(self, particles, simulation):
        """
        Update particle positions using an Euler-Cromer time integration scheme
        """
        if simulation.cuda_available:
            return euler_cromer_cuda()
        else:
            return euler_cromer(
                particles.f,
                particles.u,
                particles.v,
                particles.a,
                particles.node_density,
                particles.bc.flag,
                particles.bc.i_magnitude,
                particles.bc.unit_vector,
                simulation.damping,
                simulation.dt
            )

CUDA Kernels

Data structure

A neighbour list is a data structure used to efficiently store particles within a specified cut-off distance of each other. There are two common representations:

  1. Per-particle (family list): Each particle stores a list of its neighbours:

    neighbours = [[j1, j2, j3], ...]  # shape: [n_particles, n_family_members]
    

    This results in a nested loop structure: loop over particles, then loop over the neighbours of each particle.

     for particle in particles:
         for neighbour in particle.neighbours:
             # do stuff 
    
  2. Pairwise (bond list): Each interaction is stored as a pair of particle indices:

    bonds = [[i, j], ...]  # shape: [n_bonds, 2]
    

    This format results in a single loop structure: loop over all bonds.

     for bond in bonds:
         # do stuff 
    

A pairwise list is an efficient option on a CPU as it avoids nested for loops and redundant computations (each particle pair is processed exactly once). The computation of bond forces is embarrassingly parallel, but the reduction of bond forces to particle forces must be performed in serial, as multiple threads may attempt to update the force on the same particle concurrently. See compute_nodal_forces() in this file for details.

In contrast, a per-particle neighbour list can provide better GPU performance. This approach accepts redundant computations (each particle pair is processed twice) in exchange for massively parallel bond force computation and enhanced parallelism during the reduction phase. Additionally, a per-particle neighbour list enables improved memory coalescing patterns, as threads within a warp access contiguous memory locations when processing neighbouring particles. The resulting reduction in memory bandwidth requirements further amplifies the performance advantages on modern GPU architectures.

compute_nodal_forces_cuda()

One of the key challenges in this work is implementing an efficient parallel reduction of bond forces to particle forces. To efficiently reduce bond forces to node forces we implement a parallel binary reduction.

The performance improvement is examined in this notebook.

THREADS_PER_BLOCK = 256

@cuda.jit
def reduce_bond_forces_kernel(bond_forces, particle_forces):
    """
    Reduce bond forces to particle forces

    Employ sequential addressing
    """

    shared = cuda.shared.array(THREADS_PER_BLOCK, dtype=bond_forces.dtype)

    particle = cuda.blockIdx.x
    tid = cuda.threadIdx.x
    n_family_members = bond_forces.shape[1]

    # Initialise shared memory
    val = 0.0
    if tid < n_family_members:
        val = bond_forces[particle, tid]
    shared[tid] = val

    cuda.syncthreads()

    stride = THREADS_PER_BLOCK // 2
    while stride > 0:
        if tid < stride:
            shared[tid] += shared[tid + stride]
        cuda.syncthreads()
        stride //= 2

    if tid == 0:
        particle_forces[particle] = shared[0]

The above can be written more cleanly using cuTile. In cuTile the execution unit is a tile block. Threads are implicit.

A fundamental principle of GPU programming is oversubscription: assigning more thread blocks than there are Streaming Multiprocessors (SMs) on the GPU to ensure full utilisation and hide memory latency.

kernel_function[grid_size, block_size](arguments)

reduce_bond_forces_kernel[n_particles, THREADS_PER_BLOCK](bond_forces, particle_forces)

euler_cromer_cuda()

Updating particle positions (Euler-Cromer integration) is an embarrassingly parallel problem and is ideally suited for GPU acceleration. Each unit of work (e.g. a particle) is updated independently and no communication is needed between threads.

ConstitutiveLaw

We need to write a utility function (wrapper function) to create a CUDA-compatible material law. The material law is called with the signature material_law(k_bond, s, d)

class ConstitutiveLaw:

    def create_cuda_material_law(material_law):
        """
        Convert a Python material law function to a CUDA-compatible form
        
        Parameters:
        -----------
        material_law : callable
            A function that takes (k_bond, stretch, current_damage) 
            and returns updated damage
        
        Returns:
        --------
        cuda_material_law : ndarray
            A pre-compiled CUDA-compatible material law
        """
        pass

Spatial locality

Sort particles spatially to improve memory access patterns… space-filling curves

Minimising data transfer overhead

Transferring data between the device (GPU) and the host (CPU) is significantly slower than the computational throughput of the GPU itself, and must therefore be minimised within the simulation time-stepping routine to maintain performance.