CVPR2025

STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models

Koushik Srivatsan, Fahad Shamshad, Muzammal Naseer, Vishal M. Patel, Karthik Nandakumar

摘要

RECE (ECCV'24) Erased Attack Erased Attack Erased Attack Erased Attack AdvUnlearn (NeurIPS'24) STEREO (Ours) Figure 1. Vulnerability of "robust" concept erasure methods to concept inversion attacks. Even state-of-the-art concept erasure methods such as RECE [13], RACE [19], and AdvUnlearn [42] that claim to be "adversarially robust" are still vulnerable to concept inversion attacks [27] that regenerate the erased concept by operating in the embedding space. Examples of this vulnerability across diverse categories such as Artistic-Style, Nudity, and Object are shown in this figure. Our STEREO method achieves superior robustness through its two-stage framework: thorough vulnerability identification via adversarial training followed by anchor-concept guided erasure.