Multi-Scale Spatial-Temporal Feature Fusion For 3D Human Pose Estimation
-
-
Abstract
To address the problem of inaccurate representation, unsmooth results and high computational cost in video-based single person three-dimensional human pose estimation, a multi-scale spatial-temporal feature fusion method is proposed. Firstly, the joint, limb and upper/lower body tokens were defined in spatial domain to represent the spatial multi-scale features of human body using positional embeddings. Secondly, the spatial multi-scale feature fusion module was constructed based on self-attention mechanism and multilayer perceptron to fuse joint, limb and upper/lower body features, obtaining initial pose feature sequence. Lastly, the temporal multi-scale encoding was established for temporal feature fusion to acquire final pose feature sequence, and optimize the generation of refined three-dimensional human pose through temporal decoding. Experimental results on Human3.6M dataset show that the mean per joint position error in protocol 2 and joint velocity errors are 33.6 and 2.4 respectively, which reduce by 2.3% and 4%. The proposed method can improve three-dimensional human pose estimation accuracy and generate precise and smooth results while reducing computational cost. Furthermore, experimental results on HumanEva-I dataset show that the proposed method also has a certain degree of generalization ability.
-
-