CVPR2025

AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

Khiem Vuong, Anurag Ghosh, Deva Ramanan, Srinivasa Narasimhan, Shubham Tulsiani

Abstract

Figure 1. First row: Examples of our generated cross-view (aerial-ground) geometry data, including co-registered pseudo-synthetic (i.e., mesh-rendered) aerial and real ground-level images, with corresponding depth maps, point clouds, and camera intrinsics/extrinsics in a unified coordinate system, for a variety of scenes. Second row: Leveraging such data curated over 137 landmarks and 132K geo-registered images, we show significant improvements in learning-based methods on real unseen ground-aerial scenarios across two representative tasks: 1) multi-view geometry prediction using DUSt3R [64] finetuned on our data, and 2) novel view synthesis from a single image conditioned on a target pose by fine-tuning ZeroNVS [45] that was originally trained on MegaScenes [60].