AAAI2025

LIBA: Language Instructed Multi-granularity Bridge Assistant for 3D Visual Grounding

Yuan Wang, Yali Li, Eastman Z. Y. Wu, Shengjin Wang

被引用 11 次

摘要

3D Vision Grounding (3D-VG) seeks to unravel referential language and identify targets in 3D physical world. Prevailing methods align with the 2D-VG's pipeline to pinpoint the referred object in a categorical multi-modal reasoning manner. However, the geometric complexities of 3D scenes and the nuanced syntactic structures of language, exacerbates the granularity inconsistency of point cloud and text features, hindering the development of 3D-VG systems in complex scenarios. Towards this issue, we propose LIBA, a Language-Instructed multi-granularity Bridge Assistant tailored for 3D-VG task. LIBA tackles this issue as follows. (1) How to establish a multi-granularity 3D vision-text feature alignment in a unified model? We advance a bilateral Dynamic Bridge Adapter (DBA) build multi-granularity interaction of 3D vision and language backnones during feature extraction. We further develop the Language-aware Cross-scale Object Modulation (LCOM) module to integrate multi-scale point cloud features modulated by language information. (2) After aligning multi-modal features, how to fully harness language model's knowledge to bolster vision concepts understanding? A LLM-guided Hierarchical Query Selection (LLM-HQS) module incorporates world knowledge of Large Language Model (LLM) to ground the target referral via an Attribute-then-Relation reasoning process. In this manner, our LIBA inherits reasoning prowess and world knowledge of LLM to bridge point clouds and texts at multiple granularities. Experiments on ScanRefer and Nr3D/Sr3D benchmarks substantiate the superiority of our LIBA, trumping state-of-the-arts by a considerable margin.