NeurIPS2020

How does This Interaction Affect Me? Interpretable Attribution for Feature Interactions

Michael Tsang, Sirisha Rambhatla, Yan Liu

被引用 101 次

摘要

Machine learning transparency calls for interpretable explanations of how inputs relate to predictions. Feature attribution is a way to analyze the impact of features on predictions. Feature interactions are the contextual dependence between features that jointly impact predictions. There are a number of methods that extract feature interactions in prediction models; however, the methods that assign attributions to interactions are either uninterpretable, model-specific, or non-axiomatic. We propose an interaction attribution and detection framework called Archipelago which addresses these problems and is also scalable in real-world settings. Our experiments on standard annotation labels indicate our approach provides significantly more interpretable explanations than comparable methods, which is important for analyzing the impact of interactions on predictions. We also provide accompanying visualizations of our approach that give new insights into deep neural networks. To this end, we propose a novel framework called Archipelago, which consists of an interaction attribution method, ArchAttribute, and a corresponding interaction detector, ArchDetect, to address the challenges of being interpretable, axiomatic, and scalable. Archipelago is named after its ability to provide explanations by isolating feature interactions, or feature "islands". The inputs to Archipelago are a black-box model f and data instance x , and its outputs are a set of interactions and individual features I as well as an attribution score φ(I) for each of the feature sets I. ArchAttribute satisfies attribution axioms by making relatively mild assumptions: a) disjointness of interaction sets, which is easily obtainable, and b) the availability of a generalized additive Preprint. Under review.