CVPR2025

GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding

Yawen Shao, Wei Zhai, Yuhang Yang, Hongchen Luo, Yang Cao, Zheng-Jun Zha

Abstract

A. Implementation Details A.1. Method Details We demonstrate dimensions and meanings of tensors in the GREAT pipeline as shown in Tab. 1. For the image branch, ResNet18 [4] is chosen as the feature extractor. The input image is randomly cropped and resized to 224 × 224, producing image features with a shape of F i ∈ R 512×7×7 . A 1×1 convolutional layer is applied to reduce the feature dimension and the feature is flattened to F i ∈ R 512×49 . For the point branch, each input point cloud contains 2048 points. We employ pointnet++ [11] , which consists of three set abstraction (SA) layers, to progressively extract multi-scale point cloud features. Within each SA layer, Farthest Point Strategy (FPS) is used to sample points, with the sampling counts set to 512, 128, and 64. Ultimately, this branch outputs point features represented as F p ∈ R 512×2048 . Detailed prompts on Multi-Head Affordance Chain-of-Thought (MHACoT) reasoning are presented below. -Prompt One: "Point out which part of the object in the image interacts with the person. If this part is different from the part of the object shown in the image that performs the main function, point out the part of the object that performs the main function shown in the image." -Prompt Two: "Explain why this part can interact from the geometric structure of the object. Just give the final result in one sentence." -Prompt Three: "Describe the interaction between object and the person in the image, including the interaction