三维人体姿态估计中的多尺度时空特征融合
Multi-Scale Spatial-Temporal Feature Fusion For 3D Human Pose Estimation
-
摘要: 针对视频输入的单人三维人体姿态估计中表征不精确、结果不平滑、计算成本高的问题, 提出三维人体姿态估计的多尺度时空特征融合方法. 首先在空域定义关节点、肢体和上/下身人体标记并通过位置嵌入表示人体的空间多尺度特征; 然后结合自注意力机制和多层感知机构建空间多尺度特征融合模块, 融合关节点、肢体和上/下身三个空间多尺度特征, 得到初步姿态特征序列; 最后建立时序多尺度编码进行时序特征融合获得最终姿态特征序列, 并通过时序解码, 优化生成细化的三维人体姿态. 在Human3.6M数据集上的实验结果表明, 所提方法在协议2的平均每关节位置P-MPJPE和速度误差MPJVE分别为33.6和2.4, 较现有方法降低了2.3%和4%, 能够降低计算复杂度, 提高三维人体姿态估计精度, 生成准确、平滑的三维人体姿态估计结果. 此外, 在HumanEva-I数据集的测试结果表明, 所提方法也具有一定的泛化性.Abstract: To address the problem of inaccurate representation, unsmooth results and high computational cost in video-based single person three-dimensional human pose estimation, a multi-scale spatial-temporal feature fusion method is proposed. Firstly, the joint, limb and upper/lower body tokens were defined in spatial domain to represent the spatial multi-scale features of human body using positional embeddings. Secondly, the spatial multi-scale feature fusion module was constructed based on self-attention mechanism and multilayer perceptron to fuse joint, limb and upper/lower body features, obtaining initial pose feature sequence. Lastly, the temporal multi-scale encoding was established for temporal feature fusion to acquire final pose feature sequence, and optimize the generation of refined three-dimensional human pose through temporal decoding. Experimental results on Human3.6M dataset show that the mean per joint position error in protocol 2 and joint velocity errors are 33.6 and 2.4 respectively, which reduce by 2.3% and 4%. The proposed method can improve three-dimensional human pose estimation accuracy and generate precise and smooth results while reducing computational cost. Furthermore, experimental results on HumanEva-I dataset show that the proposed method also has a certain degree of generalization ability.