WCCM ECCOMAS 2026

Performance and Scalability of GPU-Initiated Communication Backends for High-Fidelity CFD

Jansson, Niclas (KTH Royal Institute of Technology)

In session: MS330A - Heterogeneous Computing and Algorithmic Advances for Large-Scale and Scale-Resolving Simulations I

Please login to view abstract download link

Computational Fluid Dynamics (CFD) is a natural driver for exascale computing with a virtually unbounded need for computational resources for accurate simulation of turbulent fluid flow, both for academic and engineering usage. However, with exascale computing capabilities on the horizon, we have seen a transition to more heterogeneous computer architectures with various accelerators. While oﬀering high theoretical peak performance and high memory bandwidth, complex programming models and significant programming investments are necessary to eﬃciently exploit these systems. In this context, as HPC architectures increasingly rely on GPUs for computational acceleration, efficient inter-node communication becomes a key challenge for scalability. We detail our work on improving the performance and scalability of key numerical methods in the high-fidelity spectral element code Neko on accelerated exascale machines. Efficient nearest neighbour communication is an essential component of a scalable solver; however, traditional message-passing models often require CPU involvement to mediate communication, introducing latency and limiting the potential of GPU-centric workflows. Emerging models, such as those based on SHMEM, enable direct GPU-initiated communication, allowing GPUs to transfer data between nodes without CPU intervention. This reduces overhead and unlocks new levels of performance and scalability, especially for applications with fine-grained, irregular communication patterns that are common in unstructured CFD simulations. We present our development of non message-passing based communication backends in Neko, comparing the performance and scalability of both CCL- and SHMEM-based backends across various accelerated architectures.