ICLR2026

EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph

Nan Jiang, Ziyi Wang, Yexiang Xue

被引用 1 次

摘要

Symbolic regression seeks to uncover physical laws from experimental data by searching for closedform expressions, which is an important task in AI-driven scientific discovery. Yet the exponential growth of the search space of expression renders the task computationally challenging. A promising yet underexplored direction for reducing the search space and accelerating training lies in symbolic equivalence: many expressions, although syntactically different, define the same function -for example, log(𝑥 2 1 𝑥 3 2 ), log(𝑥 2 1 ) + log(𝑥 3 2 ), and 2 log(𝑥 1 ) + 3 log(𝑥 2 ). Existing algorithms treat such variants as distinct outputs, leading to redundant exploration and slow learning. We introduce Egg-SR, a unified framework that integrates symbolic equivalence into a class of modern symbolic regression methods, including Monte Carlo Tree Search (MCTS), Deep Reinforcement Learning (DRL), and Large Language Models (LLMs). Egg-SR compactly represents equivalent expressions through the proposed Egg module (via equality graphs), accelerating learning by: (1) pruning redundant subtree exploration in Egg-MCTS, (2) aggregating rewards across equivalent generated sequences in Egg-DRL, and (3) enriching feedback prompts in Egg-LLM. Theoretically, we show the benefit of embedding Egg into learning: it tightens the regret bound of MCTS and reduces the variance of the DRL gradient estimator. Empirically, Egg-SR consistently enhances a class of symbolic regression models across several benchmarks, discovering more accurate expressions within the same time limit. Project page is at: https://nan-jiang-group.github.io/egg-sr . 1 𝑥 3 2 ), log(𝑥 2 1 ) + log(𝑥 3 2 ), and 2 log(𝑥 1 ) + 3 log(𝑥 2 ) all represent the same math function and are therefore symbolically equivalent. Ideally, a well-trained model would recognize such equivalence and assign identical goodness-of-fit, rewards, or losses to the corresponding predicted expressions (Allamanis et al., 2017) , since these expressions produce identical functional outputs and attain the same prediction error on the dataset. In the literature, existing SR algorithms treat these expressions as distinct outputs, leading to redundant exploration of the search space and slow training. The main challenge of this direction is: how to represent symbolically-equivalent expressions and embed them into modern learning frameworks in a unified and scalable manner?