CVPR2023

Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos

Yilin Wen, Hao Pan, Lei Yang, Jia Pan, Taku Komura, Wenping Wang

Abstract

 frequent self-occlusions between hands and objects.  severe ambiguity of action types judged from individual frames. ➢ Build a hierarchical temporal transformer with two cascaded blocks, to: ✓ leverage different time spans for pose and action estimation. ✓ model the semantic correlation by deriving the high-level action from the low-level hand motion and manipulated object label.