ICML2022

Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them

Florian Tramèr

82 citations

Abstract

Making classifiers robust to adversarial examples is challenging. Thus, many works tackle the seemingly easier task of detecting perturbed inputs. We show a barrier towards this goal. We prove a hardness reduction between detection and classification of adversarial examples: given a robust detector for attacks at distance (cid:15) (in some met-ric), we show how to build a similarly robust (but computationally inefficient) classifier for attacks at distance (cid:15)/ 2 . Our reduction is computationally inefficient , but preserves the sample complexity of the original detector. The reduction thus cannot be directly used to build practical classifiers. Instead, it is a useful sanity check to test whether empirical detection results imply something much stronger than the authors presumably anticipated (namely a highly robust and data-efficient classifier ). To illustrate, we revisit 14 empirical detector defenses published over the past years. For 12 / 14 defenses, we show that the claimed detection results imply an inefficient classifier with robustness far beyond the state-of-the-art.