Identifying tractable probabilistic models through uncertainty-aware symbolic regression
Please login to view abstract download link
Symbolic regression (SR) provides data-driven construction of interpretable models in the form of analytic expressions. Analytic expressions with comparable prediction error or model complexity, however, can differ greatly in robustness w.r.t. to input noise and capability to extrapolate. Most existing approaches do not consider expression robustness during model selection. Consequently, analytic expressions as provided by many symbolic regression workflows can vary wildly with respect to robustness. This is particularly relevant in applications with small data sets and/or relatively high noise. We present a workflow that combines symbolic regression with Bayesian Inference (BI) for post-hoc tractability (convergence of inference) and uncertainty quantification of discovered equations. Candidate expressions derived by SR are interpreted as probabilistic graphical models and assessed -– in addition to fit quality -- by tractability, parameter identifiability, predictive uncertainty, and noise sensitivity. This allows competing symbolic models to be distinguished under noisy and data-limited conditions. The workflow is demonstrated on a representative materials science problem and highlights the importance of uncertainty analysis for reliable equation discovery in computational mechanics. Figure 1 illustrates a high-noise, small-sample dataset (left) where test-set splitting is unfeasible. While SR yields ever smaller reconstruction error with increasing complexity due to tuning of parameters on single data points (overfitting), the posterior predictive MSE of post-hoc Bayesian Inference stays constant from complexity 20 due to BI marginalizing/averaging over parameters. Consequently, post-hoc BI avoids overfitting, even if model complexity would allow to, making it more robust to noise.
