ICML2025

Scalable Model Merging with Progressive Layer-wise Distillation

Jing Xu, Jiazheng Li, Jingzhao Zhang

摘要

Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performance. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layerwise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing fewshot merging methods, ProDistill achieves state-of-the-art performance, with up to 6.14% and 6.61% improvements in vision and NLU tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of ProDistill. Scalable Model Merging with Progressive Layer-wise Distillation Preliminaries We consider model merging in a pretrain-to-finetune setup. Let θ 0 denote the weights of a pre-trained model. Consider a set of T tasks, each with a model θ i fine-tuned from θ 0 . Model merging aims to combine the knowledge learned by task-specific models θ i into a unified model θ, which preserves the generalization ability of the pre-trained model and incorporates the specialized knowledge from each task.