CVPR2025

Mind the Time: Temporally-Controlled Multi-Event Video Generation

Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Yuwei Fang, Varnith Chordia, Igor Gilitschenski, Sergey Tulyakov

摘要

Prompts: "A cat is on a table" → "jumps to the floor" → "jumps to the sofa" → "walks to the table again" → "sits down and looks around" Prompts: "A man is typing on a laptop" → "touches his headphone with his right hand" → "closes the laptop with his left hand" → "stands up" Prompts: "A man is smiling" → "looks to his left with a surprised face" → "lowers his head with a sad face" → "smiles to the camera again" Prompts: "An old lady waves her right hand" → "makes a thumbs-up gesture" → "makes a heart gesture" → "gives a blow kiss" Figure 1 . Time-controlled multi-event video generation with MinT. Given a sequence of event text prompts and their desired start and end timestamps, MinT synthesizes smoothly connected events with consistent subjects and backgrounds. In addition, it can control the time span of each event flexibly. Here, we show the results of sequential gestures, daily activities, facial expressions, and cat movements.