高级检索

多尺度门控时空增强的唇语识别方法

Multi-scale Gated Spatio-temporal Enhancement for Lip Recognition

  • 摘要: 针对唇语识别模型中的普通卷积对唇部形变缺乏鲁棒性和不能有效地提取时间信息的问题, 提出时空增强与多尺度时间卷积网络(MSTCN)结合的唇语识别方法. 首先设计沙漏型卷积块(FCB), 增强网络对唇部形变的鲁棒性; 然后使用门控时移融合(GSF)模块提高前端网络的时间信息提取能力; 再基于FCB和GSF设计混合3D和2D卷积的时空增强网络STABNet; 最后将STABNet作为前端网络, MSTCN作为后端网络, 设计唇语识别模型. 在LRW数据集上的实验结果表明, 与基线模型相比, 所提方法的准确率提升4.15个百分点, 达到89.45%, 而模型的参数量仅增加3.17M. 在GRID数据集上准确率达到97.45%, 超过大部分对比模型.

     

    Abstract: In order to solve the problem that the ordinary convolution is not robust to lip deformation and the accuracy of lip recognition is not high because it cannot effectively extract temporal information, a lip recognition method based on spatiotemporal enhancement network and multi-scale temporal convolution network is proposed in this paper. Firstly, an hourglass convolution block (FCB) is designed to enhance the robustness of the network to lip deformation. Secondly, a gate-shift-fuse module (GSF) is used to improve the ability of extracting time information from the front-end network. Then, based on FCB and GSF, a hybrid 3D and 2D convolutional spatio-temporal augmented block network STABNet is designed. Finally, using STABNet as the front-end network and multi-scale time convolution (MSTCN) as the back-end network, the lip recognition model is designed. Experiments show that the proposed method outperforms the baseline model by 4.15 percentage points on the LRW dataset, reaching 89.45%, and the number of model parameters increases by only 3.17 M. An accuracy of 97.45% was achieved on the GRID dataset, surpassing the performance of most existing models.

     

/

返回文章
返回