ACL2022

Input-specific Attention Subnetworks for Adversarial Detection

Emil Biju, Anirudh Sriram, Pratyush Kumar, Mitesh M. Khapra

Abstract

Self-attention heads are characteristic of Trans-001 former models and have been well studied for 002 interpretability and pruning. In this work, we 003 demonstrate an altogether different utility of 004 attention heads, namely for adversarial detec-005 tion. Specifically, we propose a method to 006 construct input-specific attention subnetworks 007 (IAS) from which we extract three features to 008 discriminate between authentic and adversar-009 ial inputs. The resultant detector significantly 010 improves (by over 7.5%) the state-of-the-art 011 adversarial detection accuracy for the BERT 012 encoder on 10 NLU datasets with 11 different 013 adversarial attack types. We also demonstrate 014 that our method (a) is more accurate for larger 015 models which are likely to have more spurious 016 correlations and thus vulnerable to adversarial 017 attack, and (b) performs well even with modest 018 training sets of adversarial examples. 019 1 Introduction 020 Self-attention heads are characteristic of Trans-021 former models. Individual attention heads are inter-022 pretable in different ways. One, for a token in an 023 input sentence, we can visualize the attention paid 024 by a head to all other tokens. Such attention pat-025 terns are attractive linguistically and have come to 026 define roles for attention heads (Pande et al., 2021). 027 Two, the output of attention heads from various 028 layers can be probed for their ability to encode in-029 formation related to the "NLP pipeline" (Jawahar 030