CVPR2025

Can Generative Video Models Help Pose Estimation?

Ruojin Cai, Jason Y. Zhang, Philipp Henzler, Zhengqi Li, Noah Snavely, Ricardo Martin-Brualla

摘要

Image A Image B Pose Model (e.g. DUSt3R) Cameras + Scene from DUSt3R Video Model Interpolated Frames from a Video Model Image A Image B Cameras + Scene from DUSt3R Pose Model (e.g. DUSt3R) Figure 1. Improving pose estimation by interpolating frames using a video model. Given two images of a scene with almost no overlap, we aim to recover their relative camera pose. Without being able to rely on visual correspondences, existing methods struggle in this setting (left). We propose to use an off-the-shelf video generation model to interpolate a video connecting the two images. Augmented with the frames generated by the video model, existing pose estimators (e.g. DUSt3R [59]) are able to more accurately recover the correct pose (right).