ICML2021

A statistical perspective on distillation

Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, Sanjiv Kumar

97 citations

Abstract

Knowledge distillation is a technique for improving a "student" model by replacing its one-hot training labels with a label distribution obtained from a "teacher" model. Despite its broad success, several basic questions -e.g., Why does distillation help? Why do more accurate teachers not necessarily distill better? -have received limited formal study. In this paper, we present a statistical perspective on distillation which sheds light on these questions. Our core observation is that a "Bayes teacher" providing the true classprobabilities can lower the variance of the student objective, and thus improve performance. We then establish a bias-variance tradeoff that quantifies the utility of teachers that approximate the Bayes class-probabilities. This provides a formal criterion as to what constitutes a "good" teacher, namely, the quality of its probability estimates. Finally, we illustrate how our statistical perspective facilitates novel applications of distillation to bipartite ranking and multiclass retrieval.