WCCM ECCOMAS 2026

Prehandling and Related Hardware-Oriented Finite Element PDE Solvers Enabling Lower Precision and Tensor Core Computations

Ruda, Dustin (TU Dortmund University)
Turek, Stefan (TU Dortmund University)

In session: MS214A - Recent Trends in Scientific Computing for Computational Fluid Dynamics and Solid Mechanics in the Exascale Range I

Please login to view abstract download link

The overarching theme of the work presented is how accelerator hardware in the form of Tensor Core GPUs by Nvidia, that promise a peak performance of several hundred TFLOPS, can be used to solve linear systems in finite element PDE computing. This is challenging because conventional solvers and leading hardware manufacturers follow different paradigms. On the one hand, standard iterative solvers such as multigrid methods are mostly based on sparse matrix vector multiplication and require double precision to achieve sufficient accuracy for ill-conditioned problems. On the other hand, current accelerators achieve their high performance only in low precision and for computationally intensive operations such as dense matrix multiplication, which is crucial for their intended purpose of AI training. As a method to obtain sufficient accuracy for finite element simulations in lower precision by improving the condition, we developed the concept of prehandling, i.e., explicit preconditioning. Two approaches, namely hierarchical finite elements and generating systems, have proven to be suitable for this purpose. They can be applied in two and three spatial dimensions and for different PDEs, such as Poisson's equation and convection-diffusion equations with moderate convection. Based on this, different variants of hardware-oriented solvers can be constructed that are able to exploit Tensor Cores. The basic idea is to utilize recurring mesh cells and apply a Schur complement in addition to the prehandling step. Thus, the large, sparse linear system is transformed into multiplications of small, primarily dense matrices to a large extent. The focus is on the algorithmics of a semi-iterative method, including estimates of storage requirements, complexity and performance on Tensor Core GPUs.