Student

A Scalable Multigrid Solver with Reduced Synchronization and Enhanced Parallelism

  • Toprak, Teoman (Technical University of Darmstadt)
  • Hollmann, Paul (Technical University of Darmstadt)

Please login to view abstract download link

The efficient numerical solution of large sparse linear systems arising from PDE discretizations remains a major challenge in scientific computing. While multigrid methods offer optimal complexity for elliptic problems, their performance can deteriorate on anisotropic or heterogeneous systems, and their parallel scalability is often limited by synchronization and communication bottlenecks. The K-cycle multigrid method improves robustness by incorporating small Krylov solves within the multigrid hierarchy, allowing multiple levels to contribute more effectively to error reduction. However, standard implementations of both classical multigrid and K-cycles rely on global synchronization at several stages, which becomes increasingly expensive on large-scale systems, particularly on coarse levels where communication dominates computation. To address these challenges, we propose a task-parallel, semi-asynchronous orthonormalization K-cycle multigrid method (OrthoMG). The approach augments the classical K-cycle with a residual orthonormalization Krylov accelerator and reorganizes the cycle into two largely independent task streams, smoothing and coarse-grid correction, that run concurrently under a temporally overlapped coupling scheme. By allowing each stream to operate on slightly outdated information and synchronizing only once per cycle, OrthoMG reduces global communication, increases concurrency, and preserves the robustness of K-cycles on modern many-core and heterogeneous architectures. The improved parallel scaling of the proposed approach is demonstrated for an extended discontinuous Galerkin framework, for multiphase Poisson and Stokes problems. In these tests, the proposed method exhibits better strong and weak scaling compared to traditional multigrid and K-cycle implementations. The decoupled smoother-coarse path execution significantly reduces idle times on coarse levels, while the orthonormalization-based residual minimization improves robustness. As a result, the method remains efficient at large core counts, confirming its suitability for large-scale multiphase simulations.