Performance Portability for the Shallow Water Equations on Multi-Accelerator Systems with SYCL

  • Büttner, Markus (University of Bayreuth)
  • Alt, Christoph (Paderborn University)
  • Kenter, Tobias (Paderborn University)
  • Aizinger, Vadym (University of Bayreuth)

Please login to view abstract download link

SYCL is an open, vendor-independent standard for writing C++ code targeting a wide variety of computing devices. Developers can write parallel codes directly in C++ without needing to learn vendor-specific programming models. A SYCL compiler (e.g., Intel's DPC++ compiler or the open-source AdaptiveCpp project) then translates the C++ code to CUDA, OpenCL, or other parallel programing models. We present our work on a SYCL/C++ code for solving the 2D shallow water equations on unstructured, triangular meshes on multiple CPUs, GPUs, or FPGAs. The numerical discretization is based on the discontinuous Galerkin method with an explicit time stepping scheme. Previous works demonstrated that this implementation can run on consumer laptops, workstations, and data-center hardware. This talk will focus on an implementation on multi-node systems with multiple accelerators per node. We benchmark our implementation on a tidal simulation of the Mediterranean Sea with different spatial resolutions, and show scaling data for weak scaling tests on up to 64 GPUs as well as strong scaling tests on up to 22 FPGAs. GPUs with their high level of parallelism show their strengths for large meshes with high spatial resolution, yielding an aggregate performance of up to 1.5 trillion degrees of freedom per second. FPGAs complement GPUs by providing super-linear speedups in strong scaling scenarios for meshes with 100,000 - 600,000 elements. We will also discuss differences between the different execution and communication modes of GPUs and FPGAs, with implications on the mesh partitioning and a portable implementation.