EMNLP2025

ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model

Zhongyi Zhou, Yichen Zhu, Minjie Zhu, Junjie Wen, Ning Liu, Zhiyuan Xu, Weibin Meng, Yaxin Peng, Chaomin Shen, Feifei Feng, Yi Xu

6 citations

Abstract

Visual Grounding Q: Provide the bounding box coordinate of the police vehicle. A: [0.26, 0.56, 0.44, 0.71] Image Captioning Q: Provide a one-sentence caption for the image. A: A vintage-style street clock stands prominently at a city intersection, with a historic brick building in the background and several cars, including a police car, navigating the crosswalk.