ICML2025

Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment

Audrey Huang, Adam Block, Qinghua Liu, Nan Jiang, Akshay Krishnamurthy, Dylan J. Foster

Abstract

Inference-time computation offers a powerful axis for scaling the performance of language models. However, naively increasing computation in techniques like Best-of-N sampling can lead to performance degradation due to reward hacking. Toward a theoretical understanding of how to best leverage additional computation, we focus on inference-time alignment, which we formalize as the problem of improving the quality of responses drawn from a pre-trained policy, given a prompt of interest and access to an imperfect reward model. We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute, and provide new results that highlight the importance of the pre-trained policy's coverage over high-quality responses for performance and compute scaling: 1. We show that Best-of-N alignment with an ideal choice for N can achieve optimal performance under stringent notions of coverage, but provably suffers from reward hacking when N is large, and fails to achieve tight guarantees under more realistic coverage conditions. 2. We introduce InferenceTimePessimism, a new algorithm which mitigates reward hacking through more sophisticated use of inference-time compute, implementing the principle of pessimism in the face of uncertainty via rejection sampling; we prove that its performance is optimal and does not degrade with N , a property that we call scaling-monotonic. We complement our theoretical results with an experimental evaluation that demonstrate the benefits of InferenceTimePessimism across a variety of tasks and models.