ICLR2026

InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models

Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Jinhui Yin, Qi Wei, Tianshuo Yang, Min Dou, Tong He, Wenqi Shao, Kaipeng Zhang, Yi Wang, Botian Shi, Yanting Zhang, Jifeng Dai, Yu Qiao, Wenhai Wang, Hongjie Zhang

被引用 22 次

DOI arXiv 出版方

摘要

Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain constrained by limited scale, narrow visual diversity, and restricted instruction expressiveness. To address these gaps, we present InternSpatial---the largest open-source dataset for spatial reasoning in VLMs---alongside InternSpatial-Bench, a comprehensive evaluation benchmark designed to assess spatial understanding across diverse instruction formats. InternSpatial contains 12 million question-answer(QA) pairs covering both single-view and multi-view scenarios, sourced from varied visual environments and supporting 19 distinct instruction formats that mirror real-world query patterns. InternSpatial-Bench aims to single-view assessment and also extends multi-view reasoning through a novel rotation estimation task. Experimental validation demonstrates that models trained on achieve substantial performance improvement of 12.1% on InternSpatial-Bench and 10.7% on VSI-Bench, while preserving competitive performance on general-purpose benchmarks. We expect these resources can advance the development of spatially-capable VLMs for practical applications in robotics and embodied AI systems. Our codes and datasets are publicly available at https://github.com/dengnianchen/intern-spatial.