CVPR2025
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, Bingyi Kang
Abstract
ByteDance videodepthanything.github.io 3s 22s 42s 59s 79s 99s 122s 140s 161s 175s Figure 1. Left: Our model can generate consistent depth predictions for long videos with rich actions. The demo video shows a 196-second (4690 frames) long take of pair skating, as sourced from [14]. Right: Comparison to baselines in terms of accuracy (δ1), consistency, and latency on the Nvidia A100 GPU (denoted with circle size). Consistency is defined as the maximum Temporal Alignment Error (TAE) among all models minus the TAE of each individual model. Our model achieves the best performance in all aspects.