WWW2026
Augmenting Cross-View Geo-Localization with Spatial Semantics from Vision Foundation Models
Ji Shen, Lixing Chen, Yang Bai, Zhongqi Miao, Zhe Qu, Pan Zhou, Jianhua Li
摘要
Cross-view geo-localization (CVGL) establishes correspondences between ground-level and satellite images of the same geographic location, serving as a fundamental technology for smart city applications, including autonomous navigation, urban planning, and location-based services. Current CVGL approaches fall into two categories: feature-based methods achieve superior performance through 2D representation learning but lack interpretability. Spatial-based methods provide geometric understanding and interpretable matching but suffer from limited spatial modeling and weak cross-view alignment, leading to lower performance. We reformulate CVGL from a spatial perspective and propose an auxiliary task-enhanced network. The network captures spatial semantics and provides explicit alignment processes with visualizable results. We introduce an auxiliary spatial semantic alignment (SSA) task that learns spatial structure via vision foundation models (VFM) and BEV transformation to enhance the primary CVGL task. The primary task captures visual semantics, including texture and appearance. Within this unified framework with shared encoders, the primary task enriches the learned embeddings by fusing spatial structure with visual semantics, yielding spatially complete representations. Extensive experiments on three standard CVGL benchmarks demonstrate that our method significantly surpasses previous spatial-based approaches while maintaining competitive performance with state-of-the-art (SOTA) feature-based methods, achieving 98.48% R@1 on CVUSA and 71.05% R@1 on CVACT_test. We provide comprehensive analyses through pixel-level activation maps and feature-space UMAP visualizations to validate both effectiveness and interpretability.