NeurIPS2022

What Makes a "Good" Data Augmentation in Knowledge Distillation - A Statistical Perspective

Huan Wang, Suhas Lohit, Michael J. Jones, Yun Fu

62 citations

Abstract

Good-DA-in-KD Teacher (fixed) Student Raw input 𝑥 ! Input 𝑥 Standard DA Stronger DA Input 𝑥' KL Div. loss How to define "stronger" DA? 0.00425 0.00450 0.00475 0.00500 0.00525 0.00550 T. stddev 1.00 1.05 1.10 S. test loss Pearson: 0.9581 (p-value: 0.00%) Spearman: 0.9667 (p-value: 0.00%) Kendall: 0.8889 (p-value: 0.02%) wrn_40_2/wrn_16_2, CIFAR100 Identity Flip Crop+Flip Cutout AutoAugment Mixup CutMix CutmixPick (S. ent.) CutmixPick (T. ent.) (a) Apply additional stronger DA in KD (b) S. test loss vs. T. stddev with different DA schemes Figure 1: (a) Illustration of applying a stronger data augmentation (DA) in addition to the standard DA (random crop and flip) in knowledge distillation (KD). We ask: What makes a "good" DA when it is applied to KD in the manner of (a)? (b) We present a proven proposition (Proposition 3.1) to answer this question rigorously, along with a practical metric to evaluate the "goodness" of a DA. The proposed metric is called stddev of teacher's mean probability (shorted as T. stddev). As seen in (b), there is a strong positive correlation (p-value < 5% is typically considered statistically significant) between the student's test loss (S. test loss) and T. stddev, showing that T. stddev well captures the "goodness" of different DA schemes in KD. The most striking fact from this plot may be: T. stddev is purely calculated with the teacher (no any student used) while it can "predict" the relative order of the student's performance, implying the "goodness" of DA in KD probably is student-invariant.