CVPR2024

Towards CLIP-Driven Language-Free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency

Yuqi Zhang, Han Luo, Yinjie Lei

5 citations

Abstract

3D visual grounding plays a crucial role in scene understanding, with extensive applications in AR/VR. Despite the significant progress made in recent methods, the re-quirement of dense textual descriptions for each individ-ual object, which is time-consuming and costly, hinders their scalability. To mitigate reliance on text annotations during training, researchers have explored language-free training paradigms in the 2D field via explicit text gen-eration or implicit feature substitution. Nevertheless, un-like 2D images, the complexity of spatial relations in 3D, coupled with the absence of robust 3D visual language pre-trained models, makes it challenging to directly trans-fer previous strategies. To tackle the above issues, in this paper, we introduce a language-free training framework for 3D visual grounding. By utilizing the visual-language joint embedding in 2D large cross-modality model as a bridge, we can expediently produce the pseudo-language features by leveraging the features of 2D images which are equivalent to that of real textual descriptions. We fur-ther develop a relation injection scheme, with a Neighboring Relation-aware Modeling module and a Cross-modality Relation Consistency module, aiming to enhance and pre-serve the complex relationships between the 2D and 3D embedding space. Extensive experiments demonstrate that our proposed language-free 3D visual grounding approach can obtain promising performance across three widely used datasets - ScanRefer, Nr3D and Sr3D. Our codes are avail-able at https://github.com/xibi777/3DLFVG