CVPR2025

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, Stan Birchfield

Abstract

Generalist robot policies require strong spatial priors to operate reliably across diverse environments, enabling them to perceive, reason, and act within 3D space from multiple perspectives. Vision-language models (VLMs) are promising backbones for such policies but are limited by training on generic web-scale image-text datasets that lack rich, multi-frame spatial cues for manipulation. One example is reference frame comprehension-deciding whether to reason in egocentric, world-centric, or object-centric coordinates-which is critical for precise, context-aware actions. We introduce ROBOSPATIAL, a large-scale dataset built from real indoor and tabletop 3D scans paired with egocentric RGB views, containing 1M images, 5k scans, and 3M annotated spatial relations spanning objectobject, object-space, and object-compatibility reasoning. Its 2D/3D-ready design supports learning priors that generalize across viewpoints, scales, and task contexts. Models trained on ROBOSPATIAL achieve significant gains in spatial reasoning benchmarks and robot manipulation, demonstrating how targeted spatial priors enhance the generalization and reliability of robot policies.