ICML2025

Scaling Test-Time Compute Without Verification or RL is Suboptimal

Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar

摘要

Scaling only test-time compute Scaling training data and test-time compute Figure 1: Scaling test-time compute: (Top) Given a set of problems, verifier-free (VF) methods query expert traces, whereas verifier-based (VB) methods collect reward annotations for rollouts from the base LLM. Crucially, one aims to mimic "good" traces and the other seeks to improve via access to verification. We prove a √ 𝐻 gap between a simple VB method and any VF method as we scale data 𝑛 and compute 𝐻. (Bottom) Fixing 𝑛, and scaling 𝐻, we verify the gap between VF and VB in practice by comparing the performance of the recently released S1 model [43] trained with a VF approach: supervised distillation, and a simple VB method: best-of-N search (left). For the models we train, we also compare VF and VB when we scale both test compute and verifier training data, where the gap between VF and VB grows, matching our theoretical result (right).