Roofline-Driven Optimization of Fully Non-Uniform Octree and AMR on GPU

  • Latt, Jonas (University of Geneva)
  • Coreixas, Christophe (University of Geneva)

Please login to view abstract download link

The lattice Boltzmann method (LBM) is a memory-bound CFD approach that maps efficiently onto GPUs. For uniform, matrix-based domain representations, roofline analyses commonly report sustained bandwidth utilizations above 80% of peak [1]. However, advanced industrial solvers rely on non-uniform structured meshes, adaptive mesh refinement (AMR), robust collision models, and complex boundary treatments - features that disrupt regular memory access and challenge these performance levels. This contribution focuses on single-GPU efficiency as a prerequisite for meaningful exascale scaling. We investigate the intrinsic efficiency of a fully non-uniform LBM octree solver which, opposed to common block-structured implementation, is entirely cell based. Its performance is assessed against the roofline peak of an equivalent matrix-based implementation, and a systematic decomposition quantifies losses due to irregular memory access, neighbor access, boundary handling, enhanced collision models, and tree reconstruction. Despite the structural irregularity of a cell-based octree AMR formulation, we sustain performance near 50% of the theoretical memory bandwidth peak of a reference block-structured GPU implementation. All kernels are formulated as compositions of parallel algorithmic primitives using C++ standard parallelism and related high-level frameworks, an approach that has previously proven useful in porting a large-scale numeric simulation software [2]. This strategy is applied throughout the solver stack - from collide–stream to dynamic remeshing, octree management, and pre-/post-processing. Fully GPU-resident octree reconstruction enables dynamic AMR without noticeable throughput loss, while the abstraction separates data layout from algorithmic intent, ensuring portability and explicit algorithmic complexity. Adaptive refinement substantially reduces resolution requirements and time-to-solution. The framework supports simulations of industrially relevant configurations with up to O(10⁹) cells, including dynamic AMR, within an overnight run on a single GPU. Overall, careful optimization of irregular data structures combined with portable parallel abstractions provides a systematic path toward exascale-ready advanced LBM solvers.