CVPR2020

HOnnotate: A Method for 3D Annotation of Hand and Object Poses

Shreyas Hampali, Mahdi Rad, Markus Oberweger, Vincent Lepetit

摘要

Achieving 3D hand-object pose estimation in interaction scenarios is challenging due to the severe occlusion generated during the interaction. Existing methods address this issue by utilizing the correlation between the hand and object poses as additional cues. They usually first extract the hand and object features from their respective regions and then refine them with each other. However, this paradigm disregards the role of a broad range of image context. To address this problem, we propose a novel and robust approach that learns a broad range of context by imposing priors. First, we build this approach using stacked transformer decoder layers. These layers are required for extracting image-wide context and regional hand or object features by constraining cross-attention operations. We share the context decoder layer parameters between the hand and object pose estimations to avoid interference in the context-learning process. This imposes a prior, indicating that the hand and object are mutually the most important context for each other, significantly enhancing the robustness of obtained context features. Second, since they play different roles, we provide customized feature maps for the context, hand, and object decoder layers. This strategy facilitates the disentanglement of these layers, reducing the feature learning complexity. Finally, we conduct extensive experiments on the popular HO3D and Dex-YCB databases. The experimental results indicate that our method significantly outperforms state-of-the-art approaches and can be applied to other hand pose estimation tasks. The code will be released. CCS CONCEPTS • Computing methodologies → Computer vision tasks.