ICLR2026

SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao

75 citations

DOI arXiv Publisher

Abstract

Large language models (LLMs) have achieved remarkable progress in reasoning tasks, yet optimally integrating Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) remains a fundamental challenge. Through a comprehensive analysis of token distributions, learning dynamics, and integration mechanisms from an entropy-based perspective, we reveal key differences between these paradigms: SFT induces coarse-grained, global shifts to policy distributions, while RL performs fine-grained, selective optimizations. Our analysis further establishes entropy as a critical indicator of training efficacy. Building on these observations, we introduce Supervised Reinforcement Fine-Tuning (SRFT), a single-stage framework that unifies both fine-tuning paradigms through entropy-aware weighting mechanisms. SRFT simultaneously applies SFT and RL to directly optimize LLMs using demonstrations and self-exploration rollouts rather than through two-stage sequential methods. Extensive experiments show that SRFT outperforms zero-RL baselines by 9.0% on five mathematical reasoning benchmarks and by 10.9% on three out-of-distribution benchmarks. Moreover, by leveraging demonstration data, SRFT maintains a more stable policy entropy, facilitating sustained policy improvement.