三维人体姿态和形状估计的时空特征增强学习
Spatial-Temporal Feature Reinforcement Learning for 3D Human Pose and Shape Estimation
-
摘要: 针对单视角视频输入的三维人体姿态和形状估计中时空建模不充分、局部依赖性复杂、估计鲁棒性较弱的问题, 提出三维人体姿态和形状估计的时空特征增强学习方法. 首先构建全局时空特征增强模块, 对输入的视频序列提取静态特征, 并对其包含中间帧的2个子序列进行全局相关性建模与全局时序特征融合, 得到融合的时序特征; 然后设计由图卷积和自注意力机制构成的时空双分支结构编码器对人体局部依赖性进行建模, 实现局部时空特征增强学习, 得到细化的三维姿态; 最后提出基于双注意力机制的全局-局部时空特征融合方法, 对时序特征、姿态特征和形状特征进行特征融合, 得到最终估计的三维人体网格. 在Human3.6M数据集上的实验结果表明, 所提方法PA-MPJPE和MPJPE评估指标的值分别为36.0 mm和49.7 mm, 比对比方法分别降低0.6 mm和1.9 mm, 能够提高三维人体姿态和形状估计的精度, 生成准确且平滑的三维人体; 在3DPW数据集和互联网视频上的测试结果表明, 面对肢体遮挡、不同背景和场景等条件挑战时, 该方法也具有一定的鲁棒性.Abstract: To address the problem of insufficient spatial-temporal modeling, complex local dependence and weak es-timation robustness in three-dimensional human pose and shape estimation from single-view video, a spa-tial-temporal feature reinforcement learning method is proposed. Firstly, the global spatial-temporal feature reinforcement module was constructed to extract static features from the input video sequences, and global correlation modeling and global temporal features fusion were conducted for two sub-sequences containing intermediate frame, and the integrated temporal features was obtained. Secondly, the spatial-temporal dual branch encoder composed of graph convolution and self-attention mechanism was designed to model the local dependence of human body for local spatial-temporal feature reinforcement learning, and the refined three-dimensional pose was obtained. Finally, the global-local spatial-temporal feature fusion method based on dual attention mechanism was proposed to fuse the temporal, pose and shape feature, and the final estimated three-dimensional human body mesh was obtained. Experimental results on Human3.6M dataset show that the PA-MPJPE and MPJPE are 36.0 mm and 49.7 mm respectively, which are reduced by 0.6 mm and 1.9 mm compared with the comparison method. The proposed method can improve three-dimensional human pose and shape estimation accuracy, and generate precise and smooth three-dimensional human body. Furthermore, the testing results on 3DPW dataset and Internet videos show that the proposed method also has a certain degree of robustness when facing the challenges of occlusion limb, different background and scene conditions.