ICLR2026
Spatially Guided Training for Vision-Language-Action Model
Jinhui Ye, Fangjing Wang, Ning Gao, Junqiu Yu, Zhu Yangkun, Bin Wang, Jinyu Zhang, Weiyang Jin, Yanwei Fu, Feng Zheng, Yilun Chen, Jiangmiao Pang
被引用 6 次
摘要
Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system Vision-Language-Action framework that leverages Spatial Guided Training to align action learning with spatial priors in VLMs. ST4VLA includes two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting. This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, ST4VLA achieves substantial improvements over vanilla VLA, with performance increasing from 66.1 to 84.6 on Google Robot and from 54.7 to 73.2 on WidowX Robot, establishing new state-of-the-art results on SimplerEnv. It also demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings. These results highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning. Source code, data and models are released at https: //internrobotics.github.io/internvla-m1.github.io . INTRODUCTION Large multimodal foundation models Li et al. (2024b); Chen et al. (2024); Bai et al. (2025b); Ye et al. (2025a); Radford et al. (2021); Zhai et al. (2023); Liu et al. ( 2025b ) have demonstrated remarkable generalization capabilities by learning from web-scale vision-language data. However, a critical gap remains when transferring these capabilities to the physical domain, because robots must not only understand what an instruction means but also determine where and how to act in the 3D world. This gap is fundamental, as real-world robotic tasks must align textual instruction with embodimentspecific motor actions. However, textual instruction is sparse, whereas real-world actions demand continuous, embodied interactions. Yet, such text-to-action pairs are inherently scarce in standard VLM training data. Core spatial priors, such as object recognition, affordance grounding, visual trajectory reasoning, and relative localization, provide transferable and generalizable knowledge for robotic manipulation. Once these spatial priors are established, embodiment-specific learning can focus on concrete control strategies (e.g., manipulator joints, end-effector trajectories, humanoid locomotion, or mobile navigation). Such a division clarifies the role of spatial priors as general-purpose foundations while leaving embodiment-specific details to downstream adaptation, thereby bridging the gap between abstract linguistic instruction and grounded physical execution.