A 3D Human Pose Estimation Method Based on Time-frequency Interaction Fusion
-
Graphical Abstract
-
Abstract
In order to solve the problems of a large amount of resource waste caused by redundant video frames and low accuracy of 3D human pose estimation under unreliable 2D pose input in Transformer based 3D human pose estimation methods, a time-frequency interaction fusion based 3D human pose estimation method is proposed. Firstly, the spatial module of the network is proposed, and a spatial Transformer based on frequency domain enhancement is designed. Then, a frequency domain multi-layer perceptron is designed based on discrete cosine transform to extract frequency domain features. This perceptron can effectively reduce the computational complexity of the network while utilizing frequency domain feature enhancement to capture the spatial dependencies of joints within frames, thereby improving the accuracy of the network in noisy input data; Then, a time module for the network is proposed, and a time-frequency interactive fusion time Transformer is designed to reduce the computational burden of redundant frames on the model through the interactive fusion of time-frequency features. This not only improves efficiency but also better captures complex changes in the sequence, enhancing the robustness of the model; Finally, a deep convolutional regression module is proposed to process the output features of spatial and temporal modules, achieving accurate mapping from 2D human pose to 3D human pose. On the Human3.6 dataset, the method proposed in this paper is compared to the current mainstream 3D human pose estimation method P-STMO, MHFormer conducted experimental comparisons, and MFLOPs were reduced by 19% and 61%, respectively, while MPJPE was reduced by 1.1% and 1.7%, achieving an optimized balance between performance and accuracy, fully verifying the feasibility of this method and design process.
-
-