Accelerating peridynamic simulations using Numba CUDA
This post details the acceleration of the pypd
package using CUDA
. The goal is to ensure that the code requires no changes to the user interface while automatically detecting the presence of a CUDA-enabled device.
The majority of the simulation time is spent in two functions particles.compute_forces()
and particles.update_positions()
. These have both been optimised using numba
in a manner analagous to shared-memory parallelism with OpenMP
. The existing implementation does not support distributed-memory architectures (e.g., MPI
), and as such, the maximum problem size is limited by the memory capacity of a single CPU - typically allowing for simulations of 1 to 2 million particles.
We are going to employ numba-cuda
to make the CUDA implementation as easy and seamless as possible.
The CUDA programming model
Before writing the CUDA kernel, we briefly review the CUDA programming model. The CUDA programming model provides an abstraction of the GPU architecture that acts as a bridge between an application and its possible implementation on GPU hardware. See this post for further details.
CUDA Device Information:
----------------------------------------
CUDA Runtime Version: 12.5
Device Name: b'Tesla T4'
Compute Capability: (7, 5)
Memory:
Total Memory: 15.83 GB
Free Memory: 15.72 GB
Compute Resources:
Multiprocessors: 40
Max Threads per Block: 1024
Grid Limitations:
Max Grid Dimensions X: 2147483647
Max Grid Dimensions Y: 65535
Max Grid Dimensions Z: 65535
Additional Characteristics:
Warp Size: 32
Clock Rate: 1.59 GHz
Memory Clock Rate: 5.00 GHz
1. Device (GPU)
- Physical hardware accelerator
- Contains multiple Streaming Multiprocessors (SMs)
- Manages global memory and computation resources
2. Streaming Multiprocessors (SMs)
- Core computational units of the GPU
- Contains:
- CUDA Cores
- Shared Memory
- Register File
- Special Function Units
- Warp Schedulers
3. Threads
- Smallest unit of execution
- Analogous to a single computational task
- Organised into blocks and grids
4. Thread Blocks
- Collection of threads that can:
- Share shared memory
- Synchronise with each other
- Execute on a single Streaming Multiprocessor (SM)
5. Grid
- Highest-level organisational unit
- Collection of thread blocks
- All blocks in a grid execute the same kernel
Memory Hierarchy
Global Memory
- Largest, slowest memory
- Accessible by all threads
- Resides on device DRAM
- High latency, low bandwidth
Shared Memory
- On-chip memory within an SM
- Explicitly managed by programmer
- Much faster than global memory
- Shared by all threads in a block
Register Memory
- Fastest memory
- Private to each thread
- Limited in size
- Located directly on the SM
Constant Memory
- Read-only cache
- Optimised for broadcast scenarios
- Good for data accessed by multiple threads
Texture Memory
- Read-only cache
- Optimised for 2D spatial locality
- Hardware interpolation support
Execution Model
Warp
- Basic unit of thread execution
- Typically 32 threads
- Executed in SIMT (Single Instruction, Multiple Thread) mode
- All threads in a warp execute the same instruction
Occupancy
- Measure of active warps per SM
- Determines computational efficiency
- Influenced by:
- Kernel resource usage
- SM capabilities
- Available registers
- Shared memory consumption
Kernel Launch Configuration
Threads per Block
- Typically multiples of 32 (warp size)
- Maximum usually 1024 threads
- Constrained by:
- Available registers
- Shared memory
- Compute capability
Blocks per Grid
- Determined by total work items
- Limited by device capabilities
- Typically calculated as:
blocks_per_grid = (total_work_items + threads_per_block - 1) // threads_per_block
pypd
Making some simplifications, setting up and running a simulation using pypd
is broadly as follows:
import pypd
particles = pypd.ParticleSet(geometry)
bonds = pypd.BondSet(particles)
model = pypd.Model(particles, bonds)
simulation = pypd.Simulation(n_time_steps=5000)
simulation.run(model)
When the simulation is run we need to detect if CUDA
is available and then allocate memory to the device.
from numba import cuda
class Simulation:
def __init__(self):
self.cuda_available = cuda.is_available()
print(f"Is CUDA available: {self.cuda_available}")
if self.cuda_available:
get_cuda_device_info()
def run(self, model):
"""
Run the simulation
"""
if self.cuda_available:
model._host_to_device()
for _ in range(self.n_time_steps):
self._single_time_step(model)
if self.cuda_available:
model._device_to_host()
def _single_time_step(self, model):
"""
Single time step
"""
model.particles.compute_forces(model.bonds, self.cuda_available)
model.particles.update_positions(self)
if model.penetrators:
for penetrator in model.penetrators:
penetrator.calculate_penetrator_force(model.particles, self)
Device memory allocations and data transfers should typically be done outside the CUDA kernel to:
- Minimise redundant data transfers
- Allow for more efficient memory management
- Enable potential reuse of device-allocated memory across multiple function calls
Modify the arrays in place…
class Model:
def _host_to_device(self):
from numba import cuda
cuda.to_device(self.particles.x)
cuda.to_device(self.particles.u)
cuda.to_device(self.particles.v)
cuda.to_device(self.particles.a)
cuda.to_device(self.particles.f)
cuda.to_device(self.particles.bc.flag)
cuda.to_device(self.particles.bc.unit_vector)
cuda.to_device(self.bonds.c)
cuda.to_device(self.bonds.d)
cuda.to_device(self.bonds.f_x)
cuda.to_device(self.bonds.f_y)
class ParticleSet:
def compute_forces(self, bonds, cuda_available):
"""
Compute particle forces
"""
if cuda_available:
compute_nodal_forces_gpu()
else:
self.f, _ = compute_nodal_forces_cpu(
self.x,
self.u,
self.cell_volume,
bonds.bondlist,
bonds.d,
bonds.c,
bonds.f_x,
bonds.f_y,
bonds.constitutive_law.calculate_bond_damage,
bonds.surface_correction_factors,
)
class EulerCromer(Integrator):
def one_timestep(self, particles, simulation):
"""
Update particle positions using an Euler-Cromer time integration scheme
"""
if simulation.cuda_available:
return euler_cromer_cuda()
else:
return euler_cromer(
particles.f,
particles.u,
particles.v,
particles.a,
particles.node_density,
particles.bc.flag,
particles.bc.i_magnitude,
particles.bc.unit_vector,
simulation.damping,
simulation.dt
)
CUDA Kernels
Data structure
A neighbour list is a data structure used to efficiently store particles within a specified cut-off distance of each other. There are two common representations:
-
Per-particle (family list): Each particle stores a list of its neighbours:
neighbours = [[j1, j2, j3], ...] # shape: [n_particles, n_family_members]
This results in a nested loop structure: loop over particles, then loop over the neighbours of each particle.
for particle in particles: for neighbour in particle.neighbours: # do stuff
-
Pairwise (bond list): Each interaction is stored as a pair of particle indices:
bonds = [[i, j], ...] # shape: [n_bonds, 2]
This format results in a single loop structure: loop over all bonds.
for bond in bonds: # do stuff
A pairwise list is an efficient option on a CPU as it avoids nested for loops and redundant computations (each particle pair is processed exactly once). The computation of bond forces is embarrassingly parallel, but the reduction of bond forces to particle forces must be performed in serial, as multiple threads may attempt to update the force on the same particle concurrently. See compute_nodal_forces()
in this file for details.
In contrast, a per-particle neighbour list can provide better GPU performance. This approach accepts redundant computations (each particle pair is processed twice) in exchange for massively parallel bond force computation and enhanced parallelism during the reduction phase. Additionally, a per-particle neighbour list enables improved memory coalescing patterns, as threads within a warp access contiguous memory locations when processing neighbouring particles. The resulting reduction in memory bandwidth requirements further amplifies the performance advantages on modern GPU architectures.
compute_nodal_forces_cuda()
One of the key challenges in this work is implementing an efficient parallel reduction of bond forces to particle forces. To efficiently reduce bond forces to node forces we implement a parallel binary reduction.
The performance improvement is examined in this notebook.
THREADS_PER_BLOCK = 256
@cuda.jit
def reduce_bond_forces_kernel(bond_forces, particle_forces):
"""
Reduce bond forces to particle forces
Employ sequential addressing
"""
shared = cuda.shared.array(THREADS_PER_BLOCK, dtype=bond_forces.dtype)
particle = cuda.blockIdx.x
tid = cuda.threadIdx.x
n_family_members = bond_forces.shape[1]
# Initialise shared memory
val = 0.0
if tid < n_family_members:
val = bond_forces[particle, tid]
shared[tid] = val
cuda.syncthreads()
stride = THREADS_PER_BLOCK // 2
while stride > 0:
if tid < stride:
shared[tid] += shared[tid + stride]
cuda.syncthreads()
stride //= 2
if tid == 0:
particle_forces[particle] = shared[0]
The above can be written more cleanly using cuTile
. In cuTile
the execution unit is a tile block. Threads are implicit.
A fundamental principle of GPU programming is oversubscription: assigning more thread blocks than there are Streaming Multiprocessors (SMs) on the GPU to ensure full utilisation and hide memory latency.
kernel_function[grid_size, block_size](arguments)
reduce_bond_forces_kernel[n_particles, THREADS_PER_BLOCK](bond_forces, particle_forces)
euler_cromer_cuda()
Updating particle positions (Euler-Cromer integration) is an embarrassingly parallel problem and is ideally suited for GPU acceleration. Each unit of work (e.g. a particle) is updated independently and no communication is needed between threads.
ConstitutiveLaw
We need to write a utility function (wrapper function) to create a CUDA-compatible material law. The material law is called with the signature material_law(k_bond, s, d)
…
class ConstitutiveLaw:
def create_cuda_material_law(material_law):
"""
Convert a Python material law function to a CUDA-compatible form
Parameters:
-----------
material_law : callable
A function that takes (k_bond, stretch, current_damage)
and returns updated damage
Returns:
--------
cuda_material_law : ndarray
A pre-compiled CUDA-compatible material law
"""
pass
Spatial locality
Sort particles spatially to improve memory access patterns… space-filling curves
Minimising data transfer overhead
Transferring data between the device (GPU) and the host (CPU) is significantly slower than the computational throughput of the GPU itself, and must therefore be minimised within the simulation time-stepping routine to maintain performance.
Useful links
- Numba for CUDA programmers
- An even easier introduction to CUDA
- CUDA programming guide
- The CUDA programming model
- Optimising parallel reduction in CUDA
- numba.readthedocs
- https://numba.readthedocs.io/en/stable/cuda/index.html
- https://nvidia.github.io/numba-cuda/
- GPU Programming in Pure Python - Bryce Adelstein Lelbach
- Discusses efficient parallel reduction of arrays
- CUDA Python
- CUDA Python is the home for accessing NVIDIA’s CUDA platform from Python.
- How to Write a CUDA Program - The Parallel Programming Edition
- GTC 2025: CUDA Python