Advanced Search
Zhang Yu, Liu Li, Fu Xiaodong, Liu Lijun, Peng Wei. Multi-Scale Spatial-Temporal Feature Fusion for 3D Human Pose Estimation[J]. Journal of Computer-Aided Design & Computer Graphics, 2025, 37(1): 75-88. DOI: 10.3724/SP.J.1089.2023-00041
Citation: Zhang Yu, Liu Li, Fu Xiaodong, Liu Lijun, Peng Wei. Multi-Scale Spatial-Temporal Feature Fusion for 3D Human Pose Estimation[J]. Journal of Computer-Aided Design & Computer Graphics, 2025, 37(1): 75-88. DOI: 10.3724/SP.J.1089.2023-00041

Multi-Scale Spatial-Temporal Feature Fusion for 3D Human Pose Estimation

  • To address the problem of inaccurate representation, inadequate fusion and unsmooth results in video-based single person three-dimensional human pose estimation, a multi-scale spatial-temporal feature fusion method is proposed. Firstly, the joint, limb and upper/lower body tokens were defined in spatial domain to represent the spatial multi-scale features of human body using positional embeddings. Secondly, the spatial multi-scale feature fusion module was constructed based on self-attention mechanism and multilayer perceptron to fuse joint, limb and upper/lower body features, obtaining initial pose feature sequence. Lastly, the temporal multi-scale encoding was established for temporal feature fusion to acquire final pose feature sequence, and optimize the generation of refined three-dimensional human pose through temporal decoding. Experimental results on Human3.6M dataset show that the mean per joint position error and joint velocity errors are 33.6 and 2.4 respectively, which reduce by 2.3% and 4.0%. The proposed method can improve three-dimensional human pose estimation accuracy and generate precise and smooth results while reducing computational cost. Furthermore, experimental results on HumanEva-I dataset show that the proposed method also has a certain degree of generalization ability.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return