WCCM ECCOMAS 2026

Tradeoffs for Expedited Data Generation for Learning Interatomic Potentials

Actor, Jonas (Sandia National Laboratories)
Johansson, Anders (Sandia National Laboratories)
McCarthy, Megan (Sandia National Laboratories)
Higgins, Andrew (Sandia National Laboratories)
Modine, Normand (Sandia National Laboratories)
Barry, Matthew (Sandia National Laboratories)
Goff, James (Sandia National Laboratories)

In session: MS303A - Datasets for Science: Large and Small I

Please login to view abstract download link

Foundation model surrogates for learning interatomic potentials have been developed, and seen substantial investment, in the last few years. In certain applications for certain atomic configurations, these models (once trained) have been able to accurately predict structures and properties with orders of magnitude speedup. However, building a universal foundation model for all or many potentials for many-atom systems is still difficult, not just due to the scale of the problem, but also due to the incredibly large amount of compute time necessary to generate a substantial corpus of training configurations which the learned foundation models must be trained to match. In this talk, we highlight some of the new methods for fast generation of data for the purposes of training, and in particular methods involving randomized linear algebra, while exploring the speed-accuracy-size tradeoffs that must be made to generate and compress data. We ultimately show what this tradeoff looks like for a handful of realistic systems.