AAAI2026

Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Chun-Hsiao Yeh, Yilin Wang, Nanxuan Zhao, Richard Zhang, Yuheng Li, Yi Ma, Krishna Kumar Singh

Abstract

Could you add a spaceship in the sky, and make tree in cyberpunk, and change the style to sci-fi style. Insertion: <box> Add a spaceship in the sky Local texture: Make tree to be in cyberpunk Style: Change the style to sci-fi style Could you make all animals look like they are celebrating Christmas? Insertion: Add <box> Christmas ornaments around the cat Local texture: Change the dog to have a red and white Christmas suit Background: Make the background look like a cozy snowy Christmas setting Could you make this image look like the season when ice cream is a daily need? Local color change: Turn the grass into a lush green Insertion: <box> Add a picnic blanket with a basket on the ground Background: Change the sky to a bright, sunny day Complex User instruction X-Planner (Ours) SmartEdit MGIE Figure 1. Left. Given a source image and complex instruction, our MLLM based X-Planner decomposes the complex instruction into simpler sub-instructions (with edit type) along with auto-generated segmentation masks indicating the editing regions (shown in bottom left of each edited image) and hallucinates additional bounding box of object for the insertion case. We iteratively perform localized editing, by providing X-Planner's editing instruction and region (mask and box) to compatible editing model for each edit type. Right. Recent SmartEdit [16] and MGIE [11] which also use MLLM struggles with complex instruction understanding and identity preservation.