ICML2025
TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs
Felipe Pinto Coelho Nuti, Tim Franzmeyer, João F. Henriques
摘要
Past work has studied the effects of fine-tuning on large language models' (LLMs) overall performance on certain tasks. However, a way to quantitatively analyze its effect on individual outputs is still lacking. In this work, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses using the model's intermediate hidden states, and assuming access to the original pre-trained model. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pretraining component and a fine-tuning component. Empirically, we find that one can steer model behavior and performance by up-or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution (TuCo) in terms of the ratio of the fine-tuning component and the pre-training component. We find that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces the Tuning Contribution, and that TuCo is consistently lower on prompts where the attacks succeed compared to ones where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of these attacks. In short, TuCo enables the quantitative study of how fine-tuning influences model behavior and safety, and vice-versa. 2