高级检索

基于时频交互融合的三维人体姿态估计方法

A 3D Human Pose Estimation Method Based on Time-frequency Interaction Fusion

  • 摘要: 为了解决基于Transformer的三维人体姿态估计方法中, 因冗余的视频帧造成的大量资源浪费, 以及在不可靠的二维姿态输入下三维人体姿态估计准确性较低的问题, 提出一种基于时频交互融合的三维人体姿态估计方法. 首先提出网络的空间模块, 设计一种基于频域增强的空间Transformer, 并基于离散余弦变换提取频域特征设计一种处理频域特征的频域多层感知机, 能够在有效地减小网络计算复杂度的同时, 利用频域特征增强捕捉帧内关节点的空间依赖关系, 提升网络在含有噪声的输入数据下的准确性; 然后提出网络的时间模块, 设计一种时频交互融合的时间Transformer, 通过对时频特征的交互融合减少模型对冗余帧的计算负担, 在提高了效率的同时更好地捕捉序列中的复杂变化, 增强了模型的鲁棒性; 最后提出一种深度卷积回归模块, 用于对空间和时间模块的输出特征进行处理, 实现从二维人体姿态到三维人体姿态的准确映射. 在Human3.6数据集上, 文中方法与当前主流的三维人体姿态估计方法P-STMO, MHFormer进行实验对比,MFLOPs分别减少19%、61%,MPJPE降低1.1%、1.7%,达到性能与精度的优化均衡, 充分验证了该方法和设计流程上的可行性.

     

    Abstract: In order to solve the problems of a large amount of resource waste caused by redundant video frames and low accuracy of 3D human pose estimation under unreliable 2D pose input in Transformer based 3D human pose estimation methods, a time-frequency interaction fusion based 3D human pose estimation method is proposed. Firstly, the spatial module of the network is proposed, and a spatial Transformer based on frequency domain enhancement is designed. Then, a frequency domain multi-layer perceptron is designed based on discrete cosine transform to extract frequency domain features. This perceptron can effectively reduce the computational complexity of the network while utilizing frequency domain feature enhancement to capture the spatial dependencies of joints within frames, thereby improving the accuracy of the network in noisy input data; Then, a time module for the network is proposed, and a time-frequency interactive fusion time Transformer is designed to reduce the computational burden of redundant frames on the model through the interactive fusion of time-frequency features. This not only improves efficiency but also better captures complex changes in the sequence, enhancing the robustness of the model; Finally, a deep convolutional regression module is proposed to process the output features of spatial and temporal modules, achieving accurate mapping from 2D human pose to 3D human pose. On the Human3.6 dataset, the method proposed in this paper is compared to the current mainstream 3D human pose estimation method P-STMO, MHFormer conducted experimental comparisons, and MFLOPs were reduced by 19% and 61%, respectively, while MPJPE was reduced by 1.1% and 1.7%, achieving an optimized balance between performance and accuracy, fully verifying the feasibility of this method and design process.

     

/

返回文章
返回