ICLR2026

SatDreamer360: Multiview-Consistent Generation of Ground-Level Scenes from Satellite Imagery

Xianghui Ze, Beiyi Zhu, Zhenbo Song, Jianfeng Lu, Yujiao Shi

被引用 1 次

摘要

Generating multiview-consistent 360 • ground-level scenes from satellite imagery is a challenging task with broad applications in simulation, autonomous navigation, and digital twin cities. Existing approaches primarily focus on synthesizing individual ground-view panoramas, often relying on auxiliary inputs like height maps or handcrafted projections, and struggle to produce multiview consistent sequences. In this paper, we propose SatDreamer360, a framework that generates geometrically consistent multi-view ground-level panoramas from a single satellite image, given a predefined pose trajectory. To address the large viewpoint discrepancy between ground and satellite images, we adopt a triplane representation to encode scene features and design a ray-based pixel attention mechanism that retrieves view-specific features from the triplane. To maintain multi-frame consistency, we introduce a panoramic epipolar-constrained attention module that aligns features across frames based on known relative poses. To support the evaluation, we introduce VIGOR++, a large-scale dataset for generating multi-view ground panoramas from a satellite image, by augmenting the original VIGOR dataset with more ground-view images and their pose annotations. Experiments show that SatDreamer360 outperforms existing methods in both satellite-to-ground alignment and multiview consistency. INTRODUCTION Generating ground-level scenes from satellite imagery has attracted significant attention due to the broad coverage and low acquisition cost of satellite images. This task shows promising applications in autonomous driving (Villalonga Pineda ( 2021 ); Lu et al. (2024)), 3D reconstruction (Liu et al. (2024); Yan et al. (2024)) and data augmentation (Yang et al. (2023); Gao et al. (2023)) for downstream tasks. Many existing works (Li et al. (2024a); Lin et al. (2024); Xu & Qin (2024); Ze et al. (2025)) focus on generating individual ground images from satellite views, leaving the continuity of multi-ground views largely unaddressed. In this paper, we aim to synthesize multiple ground-view images from a single satellite image, controlled by a predefined trajectory. This introduces new challenges in maintaining both geometric consistency with the top-down satellite image and multiview coherence across the sequence of generated frames. Early approaches (Isola et al. (2017a); Regmi & Borji (2018); Shi et al. (2022); Lu et al. (2020); Qian et al. ( 2023 )) formulate cross-view synthesis as a one-to-one mapping problem, often implemented with Conditional Generative Adversarial Networks (cGANs). These methods focus on aligning representations at pixel or perceptual level. However, the extreme viewpoint disparity between top-down satellite views and street-level images leads to limited field-of-view overlap. Satellite images inherently miss key elements such as building facades, tree trunks, and other occluded details, making the ground view generation task highly under-constrained and naturally one-to-many. Recent advances leverage latent diffusion models (LDMs) (Rombach et al. (2022)) to better handle this uncertainty (Li et al. (2024a); Lin et al. (2024); Deng et al. (2024); Xu & Qin (2024); Ze et al. ( 2025 )). These methods introduce probabilistic modeling to produce diverse and high-fidelity ground