ICML2025

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu

摘要

Predict what the next image is: monkey + + cat + (a) Multimodal in-context reasoning generation (b) Multimodal composing "The girl holds a board showing 'Stop copy & paste. Let's Think DIFFERENT.'." + Ground truth reasoning answer: flying zebra + zebra + Figure 1. (a) Our ThinkDiff reasons over interleaved images (a flying monkey and a flying cat) and text prompts (monkey, cat, and zebra) to generate a logically correct and high-quality image (a flying zebra). The ground truth reasoning answer is provided as a reference for readers. (b) ThinkDiff composes images and texts into a coherent and reasonable image.