CVPR2025

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Frédo Durand, Eli Shechtman, Xun Huang

摘要

https://causvid.github.io/ "Macro shot of a man wearing an antique diving helmet with dark glass and a jetpack walking on the veins of a leaf. Realistic style" Bidirectional teacher Causal student Latency (gen. full 128-frame video) 219s Asymmetric distillation with DMD Initial Latency 1.3s On-the-fly generation 9.4 FPS … Figure 1. Traditional bidirectional diffusion models (top) deliver high-quality outputs but suffer from significant latency, taking 219 seconds to generate a 128-frame video. Users must wait for the entire sequence to complete before viewing any results. In contrast, we distill the bidirectional diffusion model into a few-step autoregressive generator (bottom), dramatically reducing computational overhead. Our model (CausVid) achieves an initial latency of only 1.3 seconds, after which frames are generated continuously in a streaming fashion at approximately 9.4 FPS, facilitating interactive workflows for video content creation.