NeurIPS2023

Adversarial Examples Are Not Real Features

Ang Li, Yifei Wang, Yiwen Guo, Yisen Wang

被引用 22 次

摘要

The existence of adversarial examples has been a mystery for years and attracted much interest. A well-known theory by Ilyas et al. [18] explains adversarial vulnerability from a data perspective by showing that one can extract non-robust features from adversarial examples and these features alone are useful for classification. However, the explanation remains quite counter-intuitive since non-robust features are mostly noise features to humans. In this paper, we re-examine the theory from a larger context by incorporating multiple learning paradigms. Notably, we find that contrary to their good usefulness under supervised learning, nonrobust features attain poor usefulness when transferred to other self-supervised learning paradigms, such as contrastive learning, masked image modeling, and diffusion models. It reveals that non-robust features are not really as useful as robust or natural features that enjoy good transferability between these paradigms. Meanwhile, for robustness, we also show that naturally trained self-supervised encoders from robust features are largely non-robust under AutoAttack. Our crossparadigm examination suggests that the non-robust features are not really useful but more like paradigm-wise shortcuts, and robust features alone might be insufficient to attain reliable model robustness across paradigms. Code is available at https://github.com/PKU-ML/AdvNotRealFeatures .