ICLR2026

Nasty Adversarial Training: A Probability Sparsity Perspective for Robustness Enhancement

Yuhang Zhou, Zhongyun Hua, Zhaoquan Gu, Keke Tang, Rushi Lan, Yushu Zhang, Qing Liao, Leo Yu Zhang

摘要

The vulnerability of deep neural networks to adversarial examples poses significant challenges to their reliable deployment. Among empirical defenses, adversarial training and robust distillation remain the most effective. In this paper, we identify a property originally studied in the context of model intellectual property protection, i.e., probability sparsity induced by nasty training, and reveal its potential to enhance adversarial robustness in an interpretable manner. We first analyze how nasty training drives models toward sparse probability distributions and qualitatively explore the spatial metric preferences introduced by such sparsity. Building on these insights, we propose nasty adversarial training (NAT), a simple yet effective adversarial training framework that incorporates probability sparsity as a regularization mechanism to strengthen robustness. Both theoretical analysis and extensive experiments demonstrate the effectiveness of NAT, showing that probability sparsity not only improves adversarial resilience but also provides interpretability to the robustness gains.