NeurIPS2023

A Theory of Transfer-Based Black-Box Attacks: Explanation and Implications

Yanbo Chen, Weiwei Liu

16 citations

Abstract

Transfer-based attacks [1] are a practical method of black-box adversarial attacks in which the attacker aims to craft adversarial examples from a source model that is transferable to the target model. Many empirical works [2–6] have tried to explain the transferability of adversarial examples from different angles. However, these works only provide ad hoc explanations without quantitative analyses. The theory behind transfer-based attacks remains a mystery. This paper studies transfer-based attacks under a unified theoretical framework. We propose an explanatory model, called the manifold attack model , that formalizes popular beliefs and explains the existing empirical results. Our model explains why adversarial examples are transferable even when the source model is inaccurate as observed in Papernot et al. [7]. Moreover, our model implies that the existence of transferable adversarial examples depends on the “curvature” of the data manifold, which further explains why the success rates of transfer-based attacks are hard to improve. We also discuss our model’s expressive power and applicability.