WCCM ECCOMAS 2026

Communication-Hiding Strategies for Enhanced Parallel Performance of the CFD Solver SOD2D

Muela, Jordi (Barcelona Supercomputing Center (BSC-CNS))
Lehmkuhl, Oriol (Barcelona Supercomputing Center (BSC-CNS))

In session: STS399B - Advances in scale-resolving simulation for turbomachinery II

Please login to view abstract download link

SOD2D is a high-order computational fluid dynamics (CFD) solver designed for large-scale simulations of both compressible and incompressible flows on modern high-performance computing systems [1]. The code follows a performance-portable approach based on OpenACC, enabling efficient execution on both GPU- and CPU-based architectures. Prior work has analysed the parallel performance and scalability of SOD2D across a range of workloads, highlighting near-ideal weak scaling and very good strong scaling trends while also identifying communication costs as a limiting factor at high concurrencies [2]. Recent developments have further focused on preparing SOD2D for exascale-class platforms, achieving excellent parallel efficiencies on leadership systems with up to thousands of compute nodes. Although the code exhibits good parallel performance, profiling analyses indicate that inter-partition communication becomes a dominant cost at large concurrencies, both in strong- and weak-scaling regimes. In the current implementation, the evaluation of spatial operators and the execution of MPI halo exchanges are performed sequentially, which prevents overlapping communication with computation and limits scalability as the communication-to-computation ratio increases. This work presents a redesigned operator-evaluation workflow aimed at hiding communication costs through asynchronous execution. The computational domain is partitioned into MPI-boundary elements and interior elements. Operator kernels associated with boundary elements are evaluated first, after which non-blocking MPI halo exchanges are immediately initiated. While communication progresses, the remaining interior-element computations are executed concurrently on the target architecture. Performance results demonstrate that this communication-hiding strategy leads to improved parallel efficiency and speedup across a wide range of core and accelerator counts. The proposed approach enhances both strong- and weak-scaling behaviour, contributing to the sustained scalability of high-order CFD solvers on current and future exascale systems.