Abstract:
In order to solve the problem that the ordinary convolution is not robust to lip deformation and the accuracy of lip recognition is not high because it cannot effectively extract temporal information, a lip recognition method based on spatiotemporal enhancement network and multi-scale temporal convolution network is proposed in this paper. Firstly, an hourglass convolution block (FCB) is designed to enhance the robustness of the network to lip deformation. Secondly, a gate-shift-fuse module (GSF) is used to improve the ability of extracting time information from the front-end network. Then, based on FCB and GSF, a hybrid 3D and 2D convolutional spatio-temporal augmented block network STABNet is designed. Finally, using STABNet as the front-end network and multi-scale time convolution (MSTCN) as the back-end network, the lip recognition model is designed. Experiments show that the proposed method outperforms the baseline model by 4.15 percentage points on the LRW dataset, reaching 89.45%, and the number of model parameters increases by only 3.17 M. An accuracy of 97.45% was achieved on the GRID dataset, surpassing the performance of most existing models.