针对唇语识别的上下文相关性蒸馏方法

Context Correlation Distillation for Lip Reading

摘要: 针对唇语识别模型的性能受到数据集大小限制的问题,提出一种跨模态知识蒸馏方法C2KD.C2KD将语音识别模型的多尺度上下文相关性知识蒸馏到唇语识别模型中.首先,利用Transformer模型的自注意力模块得到上下文相关性知识;其次,使用层映射策略来决定从语音识别模型的哪一层提取知识;最后,使用自适应训练策略来根据唇语识别模型的性能动态地进行知识的传递.C2KD在数据集LRS2和LRS3上取得了优异的表现,词错误率分别比基线方法低2.0%和2.7%.

Abstract: A cross-modal knowledge distillation method C2KD(context correlation knowledge distillation)is proposed to address the problem that the performance of the lip reading model is limited by the size of the dataset.C2KD distills the multi-scale context correlation from the speech recognition model to the lip reading model.Firstly,the self-attention module of the Transformer model is used to obtain the context correlation knowledge.Secondly,a layer mapping strategy is used to decide which layers of the speech recognition model to extract knowledge from.Finally,an adaptive training process is used to dynamically transfer speech recognition model’s knowledge based on lip reading model’s performance.C2KD achieves comparable performance on LRS2 and LRS3 datasets,outperforming the baseline by a margin of 2.0%and 2.7%in word error rate,respectively.