Energy Usage of Matrix-Free Finite Element Methods on Modern GPU Architectures
Please login to view abstract download link
The widening gap between computational throughput and memory bandwidth has redefined the constraints on algorithm design in high-performance computing. For finite element methods (FEM), this imbalance necessitates a shift from traditional matrix-based kernels toward matrix-free approaches which minimize the amount of data movement, a primary driver of power consumption in modern heterogeneous systems. In this work, we systematically investigate the energy-to-solution of high-order matrix-free FEM across modern GPU architectures, evaluating the impact of different programming models and abstractions—including Kokkos, CUDA, and OpenMP—on performance and energy overhead. Furthermore, we explore the use of TinyTC, a domain-specific tensor language from Intel, to assess the benefits of hardware-specific optimizations for the small-scale tensor contractions central to matrix-free FEM. A central challenge in such cross-platform studies is the reliable measurement of energy across heterogeneous hardware. We employ a non-intrusive, script-driven measurement framework that allows applications to remain agnostic of vendor-specific APIs and energy measurement tools. This approach allows for a transparent comparison of energy-to-solution metrics across different platforms with minimal modification of underlying source code. Our results contribute to the field of algorithm-hardware co-design by quantifying the trade-offs between computational intensity and energy consumption. We demonstrate that matrix-free methods not only offer a path to higher performance on modern accelerators but also serve as a critical tool for reducing the total energy footprint of large-scale scientific simulations.
