ICLR2025

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Junlin Xie, Peng Gao, Hongsheng Li

摘要

This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning with human perceptual processes. The model also incorporates structure-aware and semantic-aware guidance to facilitate effective fusion of information from the input image. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits generalization capabilities with unseen tasks and human instructions. * Equal Contribution † Corresponding Authors ‡ Project Lead 1. Drawing: caption. 2. Text-to-image generation: caption 3. Please create a photo that includes the following elements: caption Published as a conference paper at ICLR 2025 in most tasks, delivering its strong overall multi-task performance. More importantly, our model handles tasks and instruction prompts it has not encountered during training, demonstrating promising generalization capabilities. This further highlights PixWizard's strength as a powerful interactive image-to-image visual assistant. OMNI PIXEL-TO-PIXEL INSTRUCTION-TUNING DATASET To equip our image-to-image visual assistant with comprehensive capabilities in image generation, manipulation, and translation, we compiled a multi-task training dataset for visual instruction tuning, consisting of 30 million instances across seven primary domains, as illustrated in Fig. 1 . This is the user-friendly image-instruction-image triplet dataset, built from both open-source and in-house data, filtered with the help of MLLMs and manual review. Image Restoration. We incorporate low-level data to restore images degraded by various environmental or technical factors. This section utilizes a wide range of open-source datasets covering key restoration tasks, including (1) Denoising, (2) Deraining, (3) Demoireing, (4) Dehazing, (5) Deblurring, (6) Desnowing, (7) Deshadowing, (8) Low-Light Enhancement, (9) Face Restoration, (10) Watermark Removal, and (11) Super Resolution. Since both the inputs and outputs are inherently defined in the RGB space, these tasks can be seamlessly unified by our PixWizard model without any extra transformations. All open-source datasets we use are provided in Sec. B.1.