WCCM ECCOMAS 2026

Towards Parallel Compiled GPU-Accelerated Reduced Order Models

Eiximeno, Benet (NVIDIA)
Miró, Arnau (Universitat Politècnica de Catalunya)
Kutz, J Nathan (Autodesk Research)
Rodríguez, Ivette (Universitat Politècnica de Catalunya)
Lehmkuhl, Oriol (Barcelona Supercomputing Center)

In session: MS292C - Large-Scale Applications in Scientific Machine Learning III

Please login to view abstract download link

The new generation of high-fidelity and GPU-accelerated computational fluid dynamics codes, as SOD2D [1], are able to resolve transient computations on meshes with hundreds of millions and even billions of degrees of freedom in a matter of hours. Hence, in order to exploit the maximum benefits from such computationally intensive processes it is needed to have an efficient set of tools for postprocessing, dataanalysis and reduced order modelling. To solve this issue, Eiximeno et al. [2] introduced pyLOM as an open-source python library able to perform any singular value decomposition (SVD) based reduced order model in parallel. The parallel SVD from pyLOM, built on a parallel QR factorization algorithm, is also the main actor in the newly introduced geometry agnostic variational autoencoders integration (GAVI) [3]. The source code of pyLOM entails both a non-compiled version written in pure python and a compiled code based on Cython wrappers of C functions. All the linear algebra operations in C are done via linking with the LAPACK and Intel MKL libraries. Although the latter way of executing pyLOM requires manual compilation of the source code, it is the preferred way to go when using a supercomputer due to its better performance and efficiency when compared to the pure python implementation. Recently, Miró et al. [4] presented how the non-compiled version of pyLOM can be ported to GPU using CuPy and mpi4py CUDA-aware communications, showing a massive speedup factor of 83 times when compared to the CPU version and being able to decompose the results on 1.3 billion DoF mesh of the Windsor body in less than 20 seconds using 100 NVIDIA H100 GPUs. Then, the next step towards a more efficient and scalable reduced order modelling strategy is the compilation of the GPU porting from pyLOM. The strategy adopted is using the NVIDIA Warp library, which takes regular Python functions and JIT compiles them to efficient kernel code that can run either on the CPU or GPU. In the conference we will discuss the benefits of the JIT compilation by NVIDIA Warp, focusing on code efficiency and easier portability across different machines, as well as giving a detailed profiling analysis of the parallel SVD computation. Moreover, we will also give details on how this acceleration can improve the implementation of GAVI for dimensionality reduction of turbulent flows. [1] 10.1016/j.cpc.2023.108995 [2] 10.1016/j.cpc.2024.109459 [3] 10.1016/j.compfluid.2024.106797 [4] rs-7678279