NeurIPS2023
SLaM: Student-Label Mixing for Distillation with Unlabeled Examples
Vasilis Kontonis, Fotis Iliopoulos, Khoa Trinh, Cenk Baykal, Gaurav Menghani, Erik Vee
被引用 10 次
摘要
Knowledge distillation with unlabeled examples is a powerful training paradigm for generating compact and lightweight student models in applications where the amount of labeled data is limited but one has access to a large pool of unlabeled data. In this setting, a large teacher model generates "soft" pseudo-labels for the unlabeled dataset which are then used for training the student model. Despite its success in a wide variety of applications, a shortcoming of this approach is that the teacher's pseudo-labels are often noisy, leading to impaired student performance. In this paper, we present a principled method for knowledge distillation with unlabeled examples that we call Student-Label Mixing (SLaM) and we show that it consistently improves over prior approaches by evaluating it on several standard benchmarks. Finally, we show that SLaM comes with theoretical guarantees; along the way we give an algorithm improving the best-known sample complexity for learning halfspaces with margin under random classification noise, and provide the first convergence analysis for so-called "forward loss-adjustment" methods. performance, and this is a well-known phenomenon that has been observed and studied in a plethora of papers in the literature, e.g., [6, 36, 44, 51, 53, 8, 27] . In this work, we propose Student-Label Mixing (SLaM), a principled method for knowledge distillation with unlabeled examples that accounts for the teacher's noise and consistently improves over prior approaches. At the heart of our method lies the observation that the noise introduced by the teacher is neither random nor adversarial, in the sense that it correlates well with metrics of "confidence" such as the margin score or the entropy of the teacher's predictions. We exploit this empirical fact to our benefit in order to introduce a model for the teacher's noise, which we use to appropriately modify the student's loss function. At a high level, for any given example during the student's training process, we evaluate the student's loss function on a convex combination of the student's current prediction and another (soft-)label that we estimate using our model for the teacher's noise (hence the name "student-label mixing"). Our contributions can be summarized as follows: 1. We propose SLaM: a principled method for improving knowledge distillation with unlabeled examples. The method is efficient, data-agnostic and simple to implement. 2. We provide extensive experimental evidence and comparisons which show that our method consistently outperforms previous approaches on standard benchmarks. Moreover, we show that SLaM can be combined with standard distillation techniques such as temperature scaling and confidence-based weighting schemes. 3. We give theoretical guarantees for SLaM under standard assumptions. As a byproduct of our analysis we obtain a simple "forward loss-adjustment" iteration that provably learns halfspaces with γ-margin under Random Classification Noise with O(1/(ϵ 2 γ 2 )) samples improving over prior works that had worse dependence on either the margin γ or the generalization error ϵ (see Theorem 5.1 and Remark 5.2). Related Work Knowledge Distillation. Most of the literature on knowledge distillation has been focused on the fully supervised/labeled setting, i.e., when distillation is performed on the labeled training data of the teacher model rather than on new, unlabeled data -see e.g. the original paper of [26] . Naturally, in this setting the pseudo-labels generated by the teacher are almost always accurate and so many follow-up works [2, 14, 15, 41, 52] have developed advanced distillation techniques that aim to enforce greater consistency between the teacher's and the student's predictions, or even between the intermediate representations learned by the two models. Applying such methods in our setting where the training dataset contains mainly unlabeled examples is still possible but, in this case, it is known [51, 27] that fully trusting the teacher model can be actually harmful to the student model, making these methods less effective. (In fact, when the teacher is highly noisy these methods even underperform vanilla distillation with unlabeled examples.) In Section 4.2 we present results that show the improved effectiveness of SLaM relative to the state-of-the-art supervised knowledge distillation methods like the Variational Information Distillation for Knowledge Transfer (VID) framework [2] . Moreover, in Appendix D.5 we show that our method can be combined with (i.e., provide an additional improvement) the most simple, yet surprisingly effective, methods of improving knowledge distillation, namely the temperature-scaling idea introduced by [26]. For distillation with unlabeled examples, many approaches [17, 33, 29] propose filtering-out or reweighting the teacher's pseudo-labels based on measures of teacher's uncertainty, such as dropout variance, entropy, margin-score, or the cut-statistic. These methods are i