ICLR2026

PROS: Towards Compute-Efficient RLVR via Rollout Prefix Reuse

Baizhou Huang, Xiaojun Wan

Abstract

Large reasoning models (LRMs) trained with Reinforcement Learning with Verifiable Rewards (RLVR) have achieved remarkable progress on complex reasoning tasks. However, RLVR heavily relies on on-policy rollout generation, whose cost grows rapidly with rollout length and model size, eventually becoming the training bottleneck. Our empirical analysis reveals that independent rollouts for the same query often share similar early steps, indicating substantial redundancy. To address this, we propose PROS (Prefix Reuse for On-policy Sampling), a paradigm that reuses promising prefixes of historical rollouts in RLVR training. PROS appends these self-generated partial rollouts to the original queries to form Augmented Queries, which are then used as regular training inputs in subsequent iterations, thereby reducing redundant computation. To select training batch from augmented queries, PROS adopts a hierarchical Bayesian model to estimate their pass rates and prioritize those with the highest reward uncertainty. Experiments across diverse settings show that PROS consistently improves training efficiency and achieves higher accuracy than strong baselines. These results highlight PROS as a practical path toward scalable and compute-efficient RLVR. Published as a conference paper at ICLR 2026 Jen enters a lottery by picking 4 distinct numbers from S=1,2,3,⋯,9,10. 4 numbers are randomly chosen from S. She wins a prize if at least two of her numbers were 2 of the randomly chosen numbers, and wins the grand prize if all four of her numbers were the randomly chosen numbers. The probability of her winning the grand prize given that she won a prize is nm where m and n are relatively prime positive integers. Find m+n.