Using Description Length to Guide Genetic Programming Improves Symbolic Regression Solutions
Please login to view abstract download link
Genetic programming (GP), an evolutionary algorithm for symbolic regression, remains a top-performing method for model discovery and part of this performance can be attributed to the use of gradient-based optimization of parameters. However, this also introduces a tendency of overfitting. This is traditionally addressed by a multi-objective approach, identifying a Pareto-front trading off accuracy and complexity. The disadvantage of this is that it necessitates a further subjective choice of the overall best model or a 1D ranking. This can be overcome with metrics such as description length (DL) that combine accuracy and complexity and are grounded in information theory. In the past this has been applied only on the final population, potentially discarding successful functions during the search. Employing such criteria for the fitness itself could improve GP performance. We show this to be the case -- accuracy, size and computation time all improve when optimizing for DL directly for synthetic, real-world, and SRBench benchmark datasets. We recommend using FBF and DL for SR. These criteria enable GP to evolve correctly-sized expressions with optimal test error while automatically penalizing complexity. Higher signal-to-noise ratios and larger datasets permit more complex solutions.
