ICML2025

TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

Felipe Pinto Coelho Nuti, Tim Franzmeyer, João F. Henriques

摘要

Past work has studied the effects of fine-tuning on large language models' (LLMs) overall performance on certain tasks. However, a way to quantitatively analyze its effect on individual outputs is still lacking. In this work, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses using the model's intermediate hidden states, and assuming access to the original pre-trained model. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pretraining component and a fine-tuning component. Empirically, we find that one can steer model behavior and performance by up-or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution (TuCo) in terms of the ratio of the fine-tuning component and the pre-training component. We find that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces the Tuning Contribution, and that TuCo is consistently lower on prompts where the attacks succeed compared to ones where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of these attacks. In short, TuCo enables the quantitative study of how fine-tuning influences model behavior and safety, and vice-versa. 2