ICLR2026

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Manling Li

10 citations

DOI arXiv Publisher

Abstract

Spatial embodied intelligence often operates under partial observability, where agents must act to acquire missing information rather than passively consume complete observations. In such settings, progress depends on actively selecting informative actions that reduce uncertainty and support the construction of spatial understanding. While multimodal foundation models have shown strong performance on passive multimodal perception and reasoning tasks, their ability to support active, self-directed exploration under partial observability has not been systematically studied. In particular, it remains unclear whether and how these models can decide what to observe next in order to build and maintain a coherent spatial belief over time. We therefore propose THEORY OF SPACE, defined as an agent's ability to actively acquire information through self-directed, active exploration and to construct, revise, and exploit a spatial belief from sequential, partial observations. We implement THEORY OF SPACE using a benchmark with textual and visual environments. Rather than solving specific tasks, the goal is curiositydriven exploration to build a complete, accurate spatial belief. A core innovation is spatial belief probing: we prompt it to reveal its internal spatial belief as a cognitive map at each step, letting us measure the quality of its underlying spatial model. Our evaluation of state-of-the-art models on a suite of downstream tasks reveals critical bottlenecks: (1) The Active-Passive Gap: Performance degrades when agents must autonomously gather information (e.g., GPT-5.2: 0.57→0.46); (2) Inefficiency: Models explore in an unsystematic way and with high redundancy, failing to match the efficiency of program-based proxies while producing no better results. Through belief probing, we diagnose that perception acts as an initial bottleneck, yet global beliefs suffer further from instability that causes spatial knowledge to degrade over time. Finally, using a false belief paradigm to test belief revision, we uncover Belief Inertia where agents fail to overwrite obsolete priors. This issue exists in text agents but is notably severe in vision-based models.