Alya on GPU: Breakthroughs in Time-to-Solution and Energy-to-Solution

  • Ould Rouis, Yacine (Barcelona Supercomputing Center)
  • Houzeaux, Guillaume (Barcelona Supercomputing Center)
  • Spiga, Filippo (NVIDIA)
  • Owen, Herbert (Barcelona Supercomputing Center)

Please login to view abstract download link

ALYA is a European flagship simulation code for high-performance computational mechanics developed and maintained by BSC. The code addresses coupled multi-physics problems including incompressible/compressible flow, solid mechanics, chemistry, particle transport, heat transfer, turbulence modeling, and electrical propagation. Originally optimized for CPU architectures using cache blocking and vectorized instructions, ALYA has undergone an extensive GPU porting effort using OpenACC to leverage the massively parallel capabilities of modern accelerators while maintaining a unified codebase for both CPU and GPU execution. The GPU porting strategy centered on implementing OpenACC directives with minimal code refactoring. The porting encompassed multiple ALYA modules, with particular focus on the NASTIN module for incompressible Navier–Stokes, featuring a semi-implicit formulation, Large Eddy Simulation (LES) turbulence modeling, explicit momentum integration using a third-order Runge–Kutta scheme, and a fractional-step pressure–velocity coupling. Challenges included handling OpenACC limitations with type-bound procedures in Fortran’s object-oriented structures, managing complex data types with pointers and dynamic allocation, and addressing the inherently sequential nature of GMRES preconditioning (Krylov orthogonalization) which scales poorly on GPUs. The team successfully offloaded matrix assembly, various solver operations (CG, BiCGSTAB, GMRES), and GPU-aware MPI communications patterns. Performance evaluations on BSC’s MareNostrum5 (MN5) supercomputer demonstrated substantial improvements in both execution time and energy efficiency. For the NASTIN module using the Bolund Hill benchmark with 256M elements (43M nodes), the reference benchmark finished in 25 minutes running on 4 NVIDIA H100 GPUs consuming 2.4M Joules total energy at an average power of 1600W. The CPU baseline (8 dual-sockets 56-cores Intel Sapphire Rapids CPU) required 39 minutes and consumed 16.6M Joules at an average power of 890W per node. This yields a 7x energy-to-solution improvement and approximately 1.5x time-to-solution speedup for the full simulation. Multi-GPU scaling exhibits strong parallel efficiency, with the solver maintaining approximately 80% efficiency at 2.7M nodes per GPU. Measurements have been performed using the EAR - Energy management framework - tool developer at BSC.