ICLR2026
TIPS: Turn-level Information-Potential Reward Shaping for Search-Augmented LLMs
Yutao Xie, Nathaniel Thomas, Nicklas Hansen, Yang Fu, Li Erran Li, Xiaolong Wang
9 citations
Abstract
Search-augmented large language models (LLMs) trained with reinforcement learning (RL) achieve strong results on open-domain question answering (QA), but training remains brittle: rewards are sparse, credit assignment across reasoning and tool calls is difficult, and optimization often collapses on long-horizon tasks. We introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple RL framework that assigns dense rewards to each reasoning-tool-call segment based on how much it increases a teacher model's log-likelihood of the correct answer. This potential is computed by a frozen or periodically refreshed copy of the policy, so TIPS only requires checkpoints of the model being trained-no separate reward model, verifier, or human process labels-making it practical for scaling to frontier models. We show that this turn-level information reward is a form of potential-based shaping, preserving the task's optimal policy while providing fine-grained guidance beyond outcome-only supervision. On a searchaugmented QA setting spanning seven in-domain and out-of-domain benchmarks, TIPS consistently outperforms PPO/GRPO baselines and substantially improves training stability; for example, on Qwen-2.5-7B Instruct it improves average Exact Match by 11.8% and F1 by 13.6% over PPO. These results suggest that information-potential shaping is a viable general mechanism for stabilizing longhorizon RL on large, tool-using LLMs. The code base for TIPS is available at https://github.com/ucsd-wang-lab-lm/tips .