CVPR2025

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, Junwei Liang

摘要

The window to the right of the oven hood." "The laptop beside the floral-patterned chair." "The bookshelf second from the right. " "The chair backed to the window." "The closed door. Not the bathroom door. " "Select the couch that has an L shape." Previous SoTA Ours (a) Texture (b) Shape (c) Viewpoint (d) Orientation (e) State (f) Order Closed Open Right Left Right Illustration Flor. chair Chair back L shape I shape Window Window Lap. 1 st 3 rd 4 th 2 nd Window Oven hood Figure 1. Effectiveness of SeeGround: Different from previous SoTA, our method associates 2D visual cues -color, texture, viewpoint, spatial position, orientation, and state -with 3D spatial text description to achieve precise scene understanding. Specifically, our method: (a) identifies the floral chair by recognizing unique color and texture cues; (b) recognizes the couch by interpreting geometric shape; (c) determines the right window by interpreting spatial relationships and perspective; (d) identifies the chair by discerning directional alignment; (e) detects the closed door by visually interpreting its state; and (f) selects the bookshelf by understanding relative positioning.