CVPR2023
STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition
Xiaoyu Zhu, Po-Yao Huang, Junwei Liang, Celso M. de Melo, Alexander G. Hauptmann
Abstract
Figure 1 . Current state-of-the-art MoCap-based action recognition methods first convert body markers into a human body mesh, which is used to predict a standardized 3D skeleton. The 3D skeleton is used as input for action recognition models (dashed line). We propose a method that directly models the dynamics of raw mesh sequences (solid line). Our method saves the manual effort to derive skeleton representation, and achieves superior recognition performance by leveraging surface motion and body shape knowledge from meshes.