EMNLP2024
Dual-Space Knowledge Distillation for Large Language Models
Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, Jinan Xu
3 citations
Abstract
Knowledge distillation (KD) is known as a promising solution to compress large language models (LLMs) via transferring their knowledge to smaller models.During this process, white-box KD methods usually minimize the distance between the output distributions of the two models so that more knowledge can be transferred.However, in the current whitebox KD framework, the output distributions are from the respective output spaces of the two models, using their own prediction heads.We argue that the space discrepancy will lead to low similarity between the teacher model and the student model on both representation and distribution levels.Furthermore, this discrepancy also hinders the KD process between models with different vocabularies, which is common for current LLMs.To address these issues, we propose a dual-space knowledge distillation (DSKD) framework that unifies the output spaces of the two models for KD.On the basis of DSKD, we further develop a cross-model attention mechanism, which can automatically align the representations of the two models with different vocabularies.Thus, our framework is not only compatible with various distance functions for KD (e.g., KL divergence) like the current framework, but also supports KD between any two LLMs regardless of their vocabularies.Experiments on task-agnostic instructionfollowing benchmarks show that DSKD significantly outperforms the current white-box KD framework with various distance functions, and also surpasses existing KD methods for LLMs with different vocabularies 1 .* Yufeng Chen is the corresponding author. 1 Our code is publicly available at https://github.com/ songmzhang/DSKD.vocabulary, which, however, is hardly satisfied for various LLMs in this era ( 2.2.2).Towards these limitations, we then propose a new framework for white-box KD, named dualspace knowledge distillation (DSKD), which is as simple as the current white-box KD framework but addresses the issues due to the space discrepancy.Specifically, DSKD unifies the output spaces of the two models by projecting the output hidden states 2 of the teacher/student to the representation spaces of the student/teacher, where we can use the shared prediction heads to produce the two distributions in the same output spaces.In particular, for models with different vocabularies, we further develop a cross-model attention (CMA) mechanism to automatically align the tokens in two differently tokenized sequences.Like the current framework, DSKD is also compatible with existing distance functions for distributions, including KL divergence, JS divergence, and so on.Meanwhile, with CMA, we can transform distributions of the two LLMs into the same shape, which makes our framework more general and can be applied to any two LLMs regardless of their vocabularies.We evaluate our framework on instructionfollowing benchmarks under both settings that the two LLMs have the same/different vocabularies.Experimental results showcase that for LLMs with the same vocabulary, our DSKD framework significantly outperforms the current white-box KD framework on various distance functions.Moreover, DSKD with CMA surpasses all existing KD methods for LLMs with different vocabularies.To sum up, the contributions are as follows: We empirically reveal that the current whitebox KD framework limits the similarity between the student and the teacher due to their different output spaces.