CVPR2025

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin

Abstract

Baseline (a) (b) (c) ⊕ L2 loss Condition leakage "Common vole drags sunflower seeds in the hole." Re-weighting A bear jumping high on a meadow. A bear running to the left on a meadow. Figure 1. Motivation and results of MotiF. (a) Example video frames and the corresponding motion heatmaps calculated from optical flow. In this example, 97% of the pixels are static while only 3% has meaningful motion. (b) In standard TI2V training pipeline, the model may learn to over-rely on the conditional image to optimize the L2 loss. This issue has been identified in [53] and termed as conditional image leakage. We propose MotiF to guide the model's learning to focus on regions with more motion via motion heatmap re-weighting. (c) Qualitative results comparing MotiF to the baseline on examples from our proposed TI2V-Bench evaluation set.