WCCM ECCOMAS 2026

Using Description Length to Guide Genetic Programming Improves Symbolic Regression Solutions

Kronberger, Gabriel (University of Applied Sciences Upper Austria)
Bartlett, Deaglan (Astrophysics, University of Oxford)
Desmond, Harry (University of Protsmouth)
Ferreira, Pedro (Astrophysics, University of Oxford)
Olivetti de Franca, Fabricio (Federal University of ABC, Santo Andre)

In session: MS233A - Advancing Computational Mechanics with Equation Discovery and Symbolic Regression I

Please login to view abstract download link

Genetic programming (GP), an evolutionary algorithm for symbolic regression, remains a top-performing method for model discovery and part of this performance can be attributed to the use of gradient-based optimization of parameters. However, this also introduces a tendency of overfitting. This is traditionally addressed by a multi-objective approach, identifying a Pareto-front trading off accuracy and complexity. The disadvantage of this is that it necessitates a further subjective choice of the overall best model or a 1D ranking. This can be overcome with metrics such as description length (DL) that combine accuracy and complexity and are grounded in information theory. In the past this has been applied only on the final population, potentially discarding successful functions during the search. Employing such criteria for the fitness itself could improve GP performance. We show this to be the case -- accuracy, size and computation time all improve when optimizing for DL directly for synthetic, real-world, and SRBench benchmark datasets. We recommend using FBF and DL for SR. These criteria enable GP to evolve correctly-sized expressions with optimal test error while automatically penalizing complexity. Higher signal-to-noise ratios and larger datasets permit more complex solutions.