CVPR2024

Situational Awareness Matters in 3D Vision Language Reasoning

Yunze Man, Liang-Yan Gui, Yu-Xiong Wang

9 citations

Abstract

Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant mile-stone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a criti-cal and distinct challenge in 3D vision language reasoning is the situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation, and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situational estimation and question answering by a large margin (e.g., an enhancement of over 30% on situation accuracy). Subsequent analysis corrobo-rates our architectural design choices, explores the distinct functions of visual and textual tokens, and highlights the importance of situational awareness in the domain of 3D question-answering. Project page is available at htt ps: //yunzeman.github.io/situation3d.