CVPR2025

Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors

Zhengfei Kuang, Tianyuan Zhang, Kai Zhang, Hao Tan, Sai Bi, Yiwei Hu, Zexiang Xu, Milos Hasan, Gordon Wetzstein, Fujun Luan

Abstract

Time Time Input Frames Input Frames Figure 1. Buffer Anytime improves temporal consistency in video geometry estimation without paired training data. Top: Comparison of depth estimation between Depth Anything V2 [55] and our method on a challenging dynamic scene with lighting variations. While the original model shows inconsistent depth predictions across frames, our approach maintains stable depth estimates. Bottom: Surface normal estimation comparison between Marigold-E2E-FT [20] and our method on an outdoor scene with complex geometry. Our method preserves consistent normal maps across frames while maintaining accurate geometric details. In both cases, our method achieves better temporal consistency without requiring video-geometry paired training data.