高级检索

三维人体姿态估计中的多尺度时空特征融合

Multi-Scale Spatial-Temporal Feature Fusion for 3D Human Pose Estimation

  • 摘要: 针对视频输入的单人三维人体姿态估计中表征不精确、融合不充分、结果不平滑的问题,提出三维人体姿态估计的多尺度时空特征融合方法.首先在空域定义关节点、肢体和上/下身人体标记并通过位置嵌入表示人体的空间多尺度特征;然后结合自注意力机制和多层感知机构建空间多尺度特征融合模块,融合关节点、肢体和上/下身三个空间多尺度特征,得到初步姿态特征序列;最后建立时序多尺度编码进行时序特征融合获得最终姿态特征序列,并通过时序解码,优化生成细化的三维人体姿态.在Human3.6M数据集上的实验结果表明,所提方法的平均每关节位置P-MPJPE和速度误差MPJVE分别为33.6和2.4,较对比方法降低了2.3%和4.0%,能够降低计算复杂度,提高三维人体姿态估计精度,生成准确、平滑的三维人体姿态估计结果.此外,在HumanEva-I数据集的测试结果表明,所提方法也具有一定的泛化性.

     

    Abstract: To address the problem of inaccurate representation, inadequate fusion and unsmooth results in video-based single person three-dimensional human pose estimation, a multi-scale spatial-temporal feature fusion method is proposed. Firstly, the joint, limb and upper/lower body tokens were defined in spatial domain to represent the spatial multi-scale features of human body using positional embeddings. Secondly, the spatial multi-scale feature fusion module was constructed based on self-attention mechanism and multilayer perceptron to fuse joint, limb and upper/lower body features, obtaining initial pose feature sequence. Lastly, the temporal multi-scale encoding was established for temporal feature fusion to acquire final pose feature sequence, and optimize the generation of refined three-dimensional human pose through temporal decoding. Experimental results on Human3.6M dataset show that the mean per joint position error and joint velocity errors are 33.6 and 2.4 respectively, which reduce by 2.3% and 4.0%. The proposed method can improve three-dimensional human pose estimation accuracy and generate precise and smooth results while reducing computational cost. Furthermore, experimental results on HumanEva-I dataset show that the proposed method also has a certain degree of generalization ability.

     

/

返回文章
返回