CVPR2025

VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

SeungJu Cha, Kwanyoung Lee, Ye-Chan Kim, Hyunwoo Oh, Dong-Jin Kim

摘要

A photo of a man walking a bicycle on the street, carrying an umbrella" "A basketball player jumping and throwing a basketball and a man blocking the ball" Stable Diffusion GLIGEN InteractDiffusion VerbDiff (Ours) Real Figure 1. Generated samples illustrating multiple human-object interactions. Each color represents distinct humans, objects, and interaction words. GLIGEN [15] and InteractDiffusion [10] use grounding boxes as additional conditions, whereas Stable Diffusion [25] and VerbDiff rely solely on text.