NeurIPS2025

Behavior Injection: Preparing Language Models for Reinforcement Learning

Zhepeng Cen, Yihang Yao, William Han, Zuxin Liu, Ding Zhao

Abstract

Reinforcement learning (RL) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs). However, LLMs can respond very inconsistently to RL finetuning: some show substantial performance gains, while others plateau or even degrade. To understand this divergence, we analyze the per-step influence of the RL objective and identify two key conditions for effective post-training: (1) RL-informative rollout accuracy, and (2) strong data co-influence, which quantifies how much the training data affects performance on other samples. Guided by these insights, we propose behavior injection, a task-agnostic data augmentation scheme applied prior to RL. Behavior injection enriches the supervised finetuning (SFT) data by seeding exploratory and exploitative behaviors, effectively making the model more RLready. We evaluate our method across two reasoning benchmarks with multiple base models. The results demonstrate that our theoretically motivated augmentation can significantly increase the performance gain from RL over the pre-RL model. Website: https://bridge-llm-reasoning.github.io/ . Our key insights stem from two perspectives: ( 1 ) Analysis of the RL learning objective: we identify two critical factors that influence model improvement during RL tuning: rollout accuracy and the data co-influence coefficient, which quantifies how strongly RL training data affects generalization to the target domain. (2) Desirable behaviors for RL: while exploration and exploitation are central in low-dimensional RL [31], they are often underexplored in the context of LLM post-training. Motivated by these findings, we introduce BRIDGE (BehavioR Injection Data auGmEntation), a data-centric augmentation strategy applied during the SFT stage. BRIDGE injects desired behaviors into the model before RL, enabling it to generate more informative trajectories during RL rollout and leading to greater final performance improvements. Our contributions are summarized as: 1. In-depth analysis of LLM reinforcement learning. We provide a detailed examination of the RL training process, highlighting two key factors that drive learning efficiency: rollout accuracy distribution and the data co-influence coefficient. Introduction of the BRIDGE augmentation algorithm. We propose BRIDGE, which prepares the model for RL by explicitly injecting exploration and exploitation behaviors during SFT. 3. Comprehensive empirical evaluation. We evaluate BRIDGE across diverse tasks from iGSM and PromptBench. Extensive experiments and ablation studies demonstrate that BRIDGE enhances data co-influence and significantly improves performance in the RL stage. Related Work RL-based post-training for LLMs. Reinforcement learning (RL) has become a central posttraining approach for aligning and extending large language models. Large-scale efforts such as and DeepSeek-R1 [33] illustrated the gains obtainable from reward optimization on general-purpose models. Since then, RL fine-tuning has been pushed into a variety of domainspecialized settings. In mathematics, verifier-guided or programmatically graded rewards help models master challenging problems [34, 35, 24, 36, 37, 38] , while logic benchmarks likewise benefit from RL-driven reasoning refinement [39, 40] . Interactive agents leverage RL for textual tool use and multistep planning [41, 42, 43, 44, 45, 46] , mobile-app control [25] , device manipulation [47, 48] , and web navigation [49]. Additional applications include medical visual QA [50], software-engineering assistance [51], social reasoning [52], and tool-centric instruction following [53]. Analysis of LLM finetuning. To understand how model performance changes during finetuning, researchers study SFT learning dynamics of LLM [54] in terms of data influence [55, 56] or likelihood analysis [57, 58] . In the context of RL finetuning, OpenAI o1 [32] showed that RL significantly improve reasoning by encouraging the generation of longer CoT. Subsequent studies [59, 60, 61, 62] validate this effect and show that RL enables inference-time scaling by favoring more expressive reasoning traces. This finding is aligned with theoretical analyses that characterize the expressivity of CoT [63, 64, 65, 66] . Furthermore, researchers [67, 68, 69] observe that RL fine-tuning often amplifies behaviors already accessible in the base model rather than introducing entirely new ones, which are crucial cognitive operations to performance growth in RL [70] . Distinct from these works, we investigate the tuning dynamics by RL learning objective, and we identify behaviors from the perspective of exploration and exploitation to prepare models for RL tuning.