AAAI2025

An Item Is Worth a Prompt: Versatile Image Editing with Disentangled Control

Aosong Feng, Weikang Qiu, Jinbin Bai, Zhen Dong, Kaicheng Zhou, Xiao Zhang, Rex Ying, Leandros Tassiulas

9 citations

Abstract

Building on the success of text-to-image diffusion models (DPMs), image editing has emerged as a crucial application for enabling human interaction with AI-generated content. Among various editing techniques, prompt-based editing has garnered significant attention for its capacity to simplify semantic control. However, because diffusion models are typically pretrained on descriptive text captions, directly modifying words in text prompts often results in entirely different generated images, which undermines the objectives of image editing. Conversely, existing editing methods often employ spatial masks to maintain the integrity of unedited regions, but these are frequently disregarded by DPMs, leading to disharmonious editing outcomes. To address these two challenges, we propose a method that disentangles the comprehensive image-prompt interaction into multiple item-prompt interactions, with each item associated with a uniquely learned prompt. The resulting framework, named D-Edit, leverages pretrained diffusion models with disentangled cross-attention layers and employs a two-step optimization process to establish item-prompt associations. This approach allows for versatile image editing by enabling targeted manipulations of specific items through their corresponding prompts. We demonstrate state-of-the-art results in four types of editing operations including image-based, text-based, mask-based editing, and item removal, covering most types of editing applications, all within a single unified framework. Notably, D-Edit is the first framework that can (1) achieve item editing through mask editing and (2) combine image and text-based editing. We demonstrate the quality and versatility of the editing results for a diverse collection of images through both qualitative and quantitative evaluations.