CVPR2025

Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis

Woojung Han, Yeonkyung Lee, Chanyoung Kim, Kwanghyun Park, Seong Jae Hwang

Abstract

A car to the right of a bicycle" "Super Mario above a gray car" Missing Objects "a black cake above a red stand" Mismatched Attributes "a blue suitcase to the left of a red vase" "a dog to the right of a tv" "a backpack below a donut" Figure 1. Three main challenges in training-free text-to-image (T2I) generation: (1) missing objects, (2) mismatched attributes, and (3) mislocated objects. While existing approaches address missing objects and mismatched attributes, effectively controlling object positioning remains an open problem. Our proposed model, STORM, introduces a dynamic approach to aligning relative object positions throughout the denoising process, enabling precise spatial control without additional spatial templates. Red underline highlights errors made by SD.