CVPR2023
Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos
Yilin Wen, Hao Pan, Lei Yang, Jia Pan, Taku Komura, Wenping Wang
Abstract
frequent self-occlusions between hands and objects. severe ambiguity of action types judged from individual frames. ➢ Build a hierarchical temporal transformer with two cascaded blocks, to: ✓ leverage different time spans for pose and action estimation. ✓ model the semantic correlation by deriving the high-level action from the low-level hand motion and manipulated object label.