ICLR2026
Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning
Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He
被引用 9 次
摘要
Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning but training often oscillates between entropy collapse and entropy explosion. We trace both hazards to the mean-baseline used in value-free RL (GRPO/DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose Quantile Advantage Estimation (QAE), replacing the mean with a group-wise -quantile baseline. QAE induces a response-level, two-regime gate: on hard queries () it reinforces rare successes, while on easy queries () it targets remaining failures. Under first-order softmax updates, we prove two-sided entropy safety, giving lower/upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned , roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME'24/'25 and AMC'23. These results identify baseline design—rather than token-level heuristics—as the primary mechanism for scaling RLVR.