ICLR2025

ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning

Ruchika Chavhan, Da Li, Timothy M. Hospedales

被引用 2 次

摘要

While large-scale text-to-image diffusion models have demonstrated impressive image-generation capabilities, there are significant concerns about their potential misuse for generating unsafe content, violating copyright, and perpetuating societal biases. Recently, the text-to-image generation community has begun addressing these concerns by editing or unlearning undesired concepts from pre-trained models. However, these methods often involve data-intensive and inefficient fine-tuning or utilize various forms of token remapping, rendering them susceptible to adversarial jailbreaks. In this paper, we present a simple and effective training-free approach, ConceptPrune, wherein we first identify critical regions within pretrained models responsible for generating undesirable concepts, thereby facilitating straightforward concept unlearning via weight pruning. Experiments across a range of concepts including artistic styles, nudity, object erasure, and gender debiasing demonstrate that target concepts can be efficiently erased by pruning a tiny fraction, approximately 0.12% of total weights, enabling multi-concept erasure and robustness against various white-box and black-box adversarial attacks. Our code is available at https://github.com/ruchikachavhan/concept-prune.git Introduction In recent years, text-to-image generation has witnessed significant advances driven by the development and adoption of diffusion models (DMs) [24, 43, 45, 46, 35, 60, 33, 39] across industries and realworld scenarios. However, this swift advancement presents a substantial risk. Diffusion models can threaten artists' livelihoods through style replication [11] , generate convincing deepfakes and NSFW content [40, 14] , and perpetuate societal biases [32] . The risks associated with large-scale textto-image models arise from billion-sized web-scraped datasets used in training, comprising public datasets like LAION [48], COYO [4], and CC12M [5] , that often lack human-level quality assurance. A simplistic and naive solution to mitigate these risks involves fine-tuning the model on datasets without this undesired content; however, this approach can prove to be highly compute-expensive. Several efforts addressing the risks of diffusion models have been made from the perspective of Concept Editing [26, 18, 19, 58, 36] and Model Unlearning (MU) [23, 65, 30, 56, 12] , both aimed at eliminating undesired prompts, albeit with differing objectives. Concept editing methods seek to eliminate undesired prompts by aligning latent representations of the target concept with a concept to be retained, via methods such as maximizing similarity [26, 18] and token remapping [58, 19] . Conversely, Model Unlearning formulates an objective that penalizes forgetting desired concepts while promoting the elimination of undesired ones, but this requires expensive computations and fine-tuning. Moreover, as most concept editing approaches rely on some form of token blacklisting or resteering [58] , adversarial attacks based on textual inversion [61, 38, 57, 53] have demonstrated the