ICML2022

Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them

Florian Tramèr

82 citations

Abstract

Making classiﬁers robust to adversarial examples is challenging. Thus, many works tackle the seemingly easier task of detecting perturbed inputs. We show a barrier towards this goal. We prove a hardness reduction between detection and classiﬁcation of adversarial examples: given a robust detector for attacks at distance (cid:15) (in some met-ric), we show how to build a similarly robust (but computationally inefﬁcient) classiﬁer for attacks at distance (cid:15)/ 2 . Our reduction is computationally inefﬁcient , but preserves the sample complexity of the original detector. The reduction thus cannot be directly used to build practical classiﬁers. Instead, it is a useful sanity check to test whether empirical detection results imply something much stronger than the authors presumably anticipated (namely a highly robust and data-efﬁcient classiﬁer ). To illustrate, we revisit 14 empirical detector defenses published over the past years. For 12 / 14 defenses, we show that the claimed detection results imply an inefﬁcient classiﬁer with robustness far beyond the state-of-the-art.