NeurIPS2021
Do Input Gradients Highlight Discriminative Features?
Harshay Shah, Prateek Jain, Praneeth Netrapalli
70 citations
Abstract
Post-hoc gradient-based interpretability methods [1, 2] that provide instancespecific explanations of model predictions are often based on assumption (A): magnitude of input gradients-gradients of logits with respect to input-noisily highlight discriminative task-relevant features. In this work, we test the validity of assumption (A) using a three-pronged approach: 1. We develop an evaluation framework, DiffROAR, to test assumption (A) on four image classification benchmarks. Our results suggest that (i) input gradients of standard models (i.e., trained on original data) may grossly violate (A), whereas (ii) input gradients of adversarially robust models satisfy (A) reasonably well. 2. We then introduce BlockMNIST, an MNIST-based semi-real dataset, that by design encodes a priori knowledge of discriminative features. Our analysis on BlockMNIST leverages this information to validate as well as characterize differences between input gradient attributions of standard and robust models. 3. Finally, we theoretically prove that our empirical findings hold on a simplified version of the BlockMNIST dataset. Specifically, we prove that input gradients of standard one-hidden-layer MLPs trained on this dataset do not highlight instance-specific "signal" coordinates, thus grossly violating (A). Our findings motivate the need to formalize and test common assumptions in interpretability in a falsifiable manner [3] . We believe that the DiffROAR framework and BlockMNIST datasets serve as sanity checks to audit interpretability methods; code and data available at https://github.com/harshays/inputgradients . * Part of the work completed after joining Google Research India 2 In Appendix C, we show that our results also hold for input gradients taken w.r.t. the loss 35th Conference on Neural Information Processing Systems (NeurIPS 2021).