CVPR2025

VODiff: Controlling Object Visibility Order in Text-to-Image Generation

Dong Liang, Jinyuan Jia, Yuhao Liu, Zhanghan Ke, Hongbo Fu, Rynson W. H. Lau

Abstract

A picture of a living room with a table and a plant in front of a sofa, and a dog on the sofa." "A picture of a table, with a mug and a cup on it, the mug is on a dish." "A suitcase with a banana in front of it and a dog behind it." 1 2 3 4 2 3 1 1 2 3 4 (a) Layout+Order (b) SD 1.5 [54] (c) GLIGEN [30] (d) DenseDiffusion [27] (e) MIGC [82] (f) VODiff (Ours) Figure 1. Existing T2I methods that rely on text prompts (b) and those that combine text and layout conditioning (c-e) often struggle to produce images with accurate occlusion relationships. In this work, we propose a new framework called VODiff , which enhances control over object arrangement by introducing their visibility orders (indicated by numbers above their bounding box in the layout map in (a)) as auxiliary constraints. VODiff enables the generation of images with correct spatial arrangements and occlusion relationships.