ICLR2026

All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

Gokul Swamy, Sanjiban Choudhury, Wen Sun, Steven Wu, Drew Bagnell

被引用 56 次

摘要

From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure. Specifically, one first trains a reward model (RM) on some dataset (e.g., human preferences) before using it to provide online feedback as part of a downstream reinforcement learning (RL) procedure, rather than directly optimizing the policy parameters on said dataset via offline maximum likelihood estimation. In fact, from an information-theoretic perspective, we can only lose information via passing through a reward model and cannot create any new information via on-policy sampling. To explain this discrepancy, we scrutinize several hypotheses on the value of RL in FT through both theoretical and empirical lenses. Of the hypotheses considered, we find the most support for the explanation that on problems with a generation-verification gap, (1) it is relatively easy to learn the relatively simple RM (verifier) from the preference data. Then, (2) the downstream RL procedure only returns policies (generators) that are optimal for such relatively simple verifiers. Thus, end-to-end, two-stage online FT only has to search over a reduced subset of the full space of policies, requiring less data than offline FT. INTRODUCTION Whether one refers to it as reinforcement learning from human feedback (RLHF, Christiano et al. ( 2017 )), preference fine-tuning (PFT), or even "alignment," the last step in the training pipeline of a wide variety of foundation models (FMs) is fundamentally concerned with raising the generation likelihood of preferred completions of a prompt relative to those of dis-preferred completions. From this perspective, a natural question may be why anything other than maximum likelihood estimation (MLE) -i.e., standard supervised learning -is needed for the PFT problem. Indeed, a plethora of offline approaches to PFT that directly optimize policy parameters via solving a (regularized) classification problem on preference data have been proposed in the literature (e.g., DPO (Rafailov et al., 2023), IPO (Azar et al., 2023), SLiC-HF (Zhao et al., 2023)). However, when one looks at the training procedure of today's most capable models (Achiam et al., 2023; Team et al., 2024; Dubey et al., 2024) , one almost always sees a relatively complex twostage procedure adopted instead. First, one learns a reward model (RM) -i.e., a classifier -on the preference data, before using it to provide labels for a downstream online reinforcement learning (RL) procedure that ultimately optimizes the policy's parameters (