ICLR2026

Why Adversarially Train Diffusion Models?

Maria Rosaria Briglia, Mujtaba Hussain Mirza, Giuseppe Lisanti, Iacopo Masi

摘要

Adversarial Training (AT) is a known, powerful, well-established technique for improving classifier robustness to input perturbations, yet its applicability beyond discriminative settings remains limited. Motivated by the widespread use of scorebased generative models and their need to operate robustly under substantial noisy or corrupted input data, we propose an adaptation of AT for these models, providing a thorough empirical assessment. We introduce a principled formulation of AT for Diffusion Models (DMs) that replaces the conventional invariance objective with an equivariance constraint aligned to the denoising dynamics of score matching. Our method integrates seamlessly into diffusion training by adding either random perturbations-similar to randomized smoothing-or adversarial ones-akin to AT. Our approach offers several advantages: (a) tolerance to heavy noise and corruption, (b) reduced memorization, (c) robustness to outliers and extreme data variability and (d) resilience to iterative adversarial attacks. We validate these claims on proof-of-concept low-and high-dimensional datasets with known ground-truth distributions, enabling precise error analysis. We further evaluate on standard benchmarks (CIFAR-10, CelebA, and LSUN Bedroom), where our approach shows improved robustness and preserved sample fidelity under severe noise, data corruption, and adversarial evaluation. Code available at github.com/OmnAI-Lab/Adversarial-Training-DM Samples generation is then performed by solving the probability flow ODE (PF-ODE) Song et al. (2021b), from t = T to 0 and starting from x T ≃ N (0, ε 2 max I), whose solution is learned from the DM. For a given x 0 , the training objective L DM reported in Ho et al. (2020) is thus defined as: L DM = E ω↑N (0,I) t↑U (0,I) ε ↔ ε ε x t (x 0 , ε), t 2 2