ICLR2025

ParamΔ for Direct Mixing: Post-Train Large Language Model At Zero Cost

Sheng Cao, Mingrui Wu, Karthik Prasad, Yuandong Tian, Zechun Liu

摘要

The post-training phase of large language models is essential for enhancing capabilities such as instruction-following, reasoning, and alignment with human preferences. However, it demands extensive high-quality data and poses risks like overfitting, alongside significant computational costs due to repeated posttraining and evaluation after each base model update. This paper introduces Param∆, a novel method that streamlines post-training by transferring knowledge from an existing post-trained model to a newly updated base model with zero additional training. By computing the difference between post-trained model weights (Θ post ) and base model weights (Θ base ), and adding this to the updated base model (Θ ′ base ), we define Param∆ Model as: Θ Param∆ = Θ post -Θ base +Θ ′ base . This approach surprisingly equips the new base model with post-trained capabilities, achieving performance comparable to direct post-training. We did analysis on LLama3, Llama3.1, Qwen, and DeepSeek-distilled models. Results indicate Param∆ Model effectively replicates traditional post-training. For example, the Param∆ Model obtained from 70B Llama3-inst, Llama3-base, Llama3.1-base models attains approximately 95% of Llama3.1-inst model's performance on average. Param∆ brings a new perspective on how to fully leverage models in the open-weight community, where checkpoints for base and instruct models are readily available and frequently updated, by providing a cost-free framework to accelerate the iterative cycle of model development.