ICLR2025

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Koichi Namekata, Sherwin Bahmani, Ziyi Wu, Yash Kant, Igor Gilitschenski, David B. Lindell

Abstract

Build on pre-trained image-to-video diffusion models (Stable Video Diffusion) • Previous work: Computationally expensive finetuning Require motion-annotated dataset collection • This work: ✓ No finetuning ✓ Relies solely on the knowledge present in the pre-trained image-to-video diffusion models. Optimization process Challenge I: How to obtain semantically aligned feature maps? 19 Issue : naively extracted feature maps are not semantically aligned Naively extracted feature maps Challenge I: How to obtain semantically aligned feature maps? 20 Issue : naively extracted feature maps are not semantically aligned Pixels belonging to the same objects have different feature vectors at different frames :( Frame 1 Frame 2 Frame 3 Naively extracted feature maps Challenge I: How to obtain semantically aligned feature maps? 21 Key finding: We can produce semantically aligned feature maps by modifying the computations of self-attention layers. Original spatial self-attention Frame 1 Frame 2 ・・・ Query Key Values Self-attention is computed independently for each frame Challenge I: How to obtain semantically aligned feature maps? 22 Key finding: We can produce semantically aligned feature maps by modifying the computations of self-attention layers. Modified spatial self-attention Frame 1 Frame 2 ・・・ Query Key Values Replace the key/value tokens with that of the first frame -> Produced feature maps are weighted sum of the value tokens from the first frame Challenge I: How to obtain semantically aligned feature maps? 23 Original feature maps Semantically aligned feature maps Modified Self-attention Computation Key finding: We can produce semantically aligned feature maps by modifying the computations of self-attention layers. Frame 1 Frame 2 Frame 3 Frame 1 Frame 2 Frame 3 Key observation: only the low-frequency components of optimized latents significantly influence motion.