Maximizing Available Hardware Parallelism on MI300A for Finite Element Applications

  • Mosby, Matthew (Sandia National Laboratories)
  • Parmar, Krishen (Sandia National Laboratories)
  • Shelton, Timothy (Sandia National Laboratories)
  • Glaze, David (Sandia National Laboratories)

Please login to view abstract download link

The current fastest supercomputer in the world is El Capitan, based on the AMD MI300A architecture [1]. While the vast majority of computing horsepower comes from the integrated GPU devices, the CPUs remain powerful and should not be neglected for portions of finite element applications that have not yet been ported to the GPU. The unified memory architecture on MI300A allows for zero-cost synchronization between CPU and GPU, but at the expense of lower CPU performance due to sub-optimal data layout prioritizing GPU performance. In this presentation, we describe how outer-loop vectorization using an explicit Single Instruction Multiple Data (SIMD) [3,4] abstraction can regain or improve performance of finite element algorithms executed on the CPU. We compare performance of nodal force computations using a uniform gradient hexahedral element [5] implemented with Kokkos [2] and our SIMD abstraction when executed on the CPU and GPU portions of the MI300A. Finally, we compare performance of a complex large-deformation solid mechanics simulation when SIMD execution on CPU is enabled vs disabled concurrently with GPU execution. REFERENCES [1] Top500.org, URL https://top500.org/lists/top500/list/2025/11/ [2] C. R. Trott, et al., Kokkos 3: Programming model extensions for the exascale era, IEEE Transactions on Parallel and Distributed Systems 33 (4) (2021) 805–817. [3] Tupek, Michael R., "Sierra's SIMD vector-math library for element tensor and material calculations.," (2015) URL: https://www.osti.gov/biblio/1504182 [4] Data-parallel types (SIMD), ISO C++26. URL https://en.cppreference.com/w/cpp/numeric/simd [5] D.P. Flanagan and T. Belytschko. A uniform strain hexahedron and quadrilateral with orthogonal hourglass control. International Journal for Numerical Methods in Engineering, 17:679–706, 1981. doi:10.1002/nme.1620170504