用于视频流人体姿态估计的时空信息感知网络
Spatiotemporal Neural Network for Video-Based Pose Estimation
-
摘要: 针对现有二维人体姿态估计方法应用于视频序列时由于图像质量不稳定而导致的估计精度下降和时域不连续的问题,提出一种时空信息感知网络STNet.首先利用卷积模块提取出每帧视频中人体关节点的二维空间热力图,接着利用循环卷积模块对连续帧热力图之间的时间信息进行编码.时空信息的解耦学习策略提升了姿态估计结果的时域连贯性和空间准确性,降低了视频序列中时空特征的提取难度.循环卷积模块中的ConvGRU结构在保证识别精度的同时有效减少了模型计算量.在二维关节点数据集PennAction和Sub-JHMDB上进行实验,并与其他现有方法进行对比分析.结果表明, STNet可以实现预测精度和运算量之间的更好权衡,更具实用价值.Abstract: The application of 2D pose estimation methods often suffers from performance degeneration because of the severe video quality degradation. To mitigate the problem, a novel model is proposed, namely spatiotemporal net (STNet). STNet utilizes convolution modules to extract the 2D joint heatmaps of each frame and exploits recurrent convolution modules to encode the time information between the adjacent frames. This decoupling learning of spatiotemporal information improves the temporal coherence and spatial accuracy of the estimated poses and reduces the difficulty of extracting spatiotemporal features. The application of ConvGRU effectively reduces the computational cost while ensuring recognition accuracy. Proposed model is compared with other existing methods on two benchmarks: Penn Action and Sub-JHMDB. The results show that STNet can better trade-off prediction performance and computational complexity and have more practical value.