ICLR2026

RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents

Zijing Zhang, Ziyang Chen, Mingxiao Li, Zhaopeng Tu, Xiaolong Li

被引用 38 次

摘要

We demonstrate the effectiveness of RLVMR on two challenging long-horizon benchmarks, ALF-World and ScienceWorld. Our experiments show that RLVMR achieves new state-of-the-art results across all settings. Notably, on the hardest unseen task split (L2), our 7B model achieves an 83.6% success rate, and surpasses the performance of the much larger models. In-depth analysis reveals that these gains are driven by a tangible improvement in reasoning quality: RLVMR-trained agents exhibit significant reductions in repetitive and invalid actions. This confirms that by rewarding the process of good reasoning, we create agents that are not only more successful but also more robust, efficient, and generalizable. In summary, our contributions are as follows: 1. We identify and formulate the inefficient exploration problem in long-horizon agents, showing how optimizing for final outcomes alone reinforces flawed reasoning and leads to brittle policies that fail to generalize. 2. We propose RLVMR, a novel RL framework that provides dense, process-level supervision by rewarding verifiable meta-reasoning behaviors (e.g., planning, exploration, reflection) using lightweight, programmatic rules. 3. We achieve state-of-the-art performance on the challenging ALFWorld and ScienceWorld benchmarks, with significant improvements in generalization to unseen tasks. 4. We provide in-depth analysis confirming that RLVMR's gains stem directly from improved reasoning quality, evidenced by measurable reductions in repetitive actions and enhanced error recovery, thereby improving both agent robustness and efficiency.