ACL2022

A Comparative Study of Faithfulness Metrics for Model Interpretability Methods

Chun Sik Chan, Huanqi Kong, Guanqing Liang

Abstract

Interpretable methods to reveal the internal 001 reasoning processes behind machine learning 002 models have attracted increasing attention in 003 recent years. To quantify the extent to which 004 the identified interpretations truly reflect the in-005 trinsic decision-making mechanisms, various 006 faithfulness evaluation metrics have been pro-007 posed. However, we find that different faith-008 fulness metrics show conflicting preferences 009 when comparing different interpretations. Mo-010 tivated by this observation, we aim to conduct 011 a comprehensive and comparative study of the 012 widely adopted faithfulness metrics. In partic-013 ular, we introduce two assessment dimensions, 014 namely diagnosticity and complexity. Diagnos-015 ticity refers to the degree to which the faithful-016 ness metric favors relatively faithful interpreta-017 tions over randomly generated ones, and com-018 plexity is measured by the average number of 019 model forward passes. According to the ex-020 perimental results, we find that sufficiency and 021 comprehensiveness metrics have higher diag-022 nosticity and lower complexity than the other 023 faithfulness metrics. 024 1 Introduction 025 NLP has made tremendous progress in recent years. 026 However, the increasing complexity of the mod-027 els makes their behavior difficult to interpret. To 028 disclose the rationale behind the models, various 029 interpretable methods have been proposed. 030 Interpretable methods can be broadly classified 031 into two categories: model-based methods and post-032 hoc methods. Model-based approaches refer to 033 designing simple and white-box machine learning 034 models whose internal decision logic can be easily 035 interpreted, such as linear regression models, de-036 cision trees, etc. Post-hoc method is applied after 037 model training and aims to disclose the relation-038 ship between feature values and predictions. As 039 pre-trained language models (Devlin et al., 2019a; 040 Liu et al., 2019; Brown et al., 2020) become more 041 128 "interpretation" of a classification instance is a se-129 quence of scores where each score quantifies the 130 importance of the input token at the corresponding 131 position. An "interpretation pair" is a pair of inter-132 pretations of the same classification instance. An 133 "interpretation method" is a function that generates 134 a interpretation from a classification instance with 135 its associated classification model. 136 Notations Let x be the input tokens. Denote the 137 number of tokens of x as l x . Denote the predicted 138 class of x as c(x), and the predicted probability 139 corresponding to class j as p j (x). 140 Assume an interpretation is given. Denote the 141 k-th important token as x k . Denote the input se-142 quence containing only the top k (or top q%) impor-143 tant tokens as x :k (or x :q% ). Denote the modified 144 input sequence from which a token sub-sequence 145 x are removed as x x . 146 Let (x, y) be a classification instance associated 147 with classification model m, and g be an interpreta-148 tion method. Denote the interpretation of z gener-149 ated by g as g(x, y, m). Let u be an interpretation, 150 (u, v) be an interpretation pair, and F be a faithful-151 ness metric. Denote the importance score that u 152 assigns to the i-th input token as [u] i . Denote the 153 statement "u is more faithful than v" as "u v", 154 and the statement "F considers u as more faithful 155 than v" as "u F v". 156 2 3 Faithfulness Metrics 157 An interpretation is called faithful if the identified 158 important tokens truly contribute to the decision 159 making process of the model. Mainstream faith-160 fulness metrics are removal-based metrics, which 161 measure the changes in model outputs after remov-162 ing important tokens. 163 We compare the most widely adopted faithful-164 ness metrics, introduced as follows. 165 Decision Flip -Most Informative Token 166 (DFMIT) Introduced by Chrysostomou and Ale-167 tras (2021), this metric focuses on only the most 168 important token. It assumes that the interpretation 169 is faithful only if the prediction label is changed 170 after removing the most important token, i.e. 171 DFMIT = 1 if c(x) = c(x x :1 )) 0 if c(x) = c(x x :1 )) 172 A score of 1 implies that the interpretation is faith-173 ful. 174 Decision Flip -Fraction of Tokens (DFFOT) 175 This metric measures faithfulness as the minimum 176 fraction of important tokens needed to be erased in 177 order to change the model decision (Serrano and 178 Smith, 2019), i.e. 179 DFFOT = min k lx s.t. c(x) = c(x x :k ) 1 if c(x) = c(x x :k ) for any k 180 If the predicted class change never occurs even if 181 all tokens are deleted, then the score will be 1. A 182 lower value of DFFOT means the interpretation is 183 more faithful. 184 Comprehensiveness (COMP) As proposed by 185 DeYoung et al. (2020), comprehensiveness as-186 sumes that an interpretation is faithful if the im-187 portant token