视频描述中全局-局部联合语义生成网络

毛琳; 高航; 杨大伟

doi:10.3724/SP.J.1089.2023.19619

视频描述中全局-局部联合语义生成网络

Global-Local Combined Semantic Generation Network for Video Captioning

摘要

摘要: 针对视频描述中语义特征不能兼顾全局概括信息和局部细节信息,影响视频描述效果的问题,提出一种视频描述中全局-局部联合语义生成网络GLS-Net.首先利用全局与局部信息的互补性设计全局和局部语义提取单元,2个单元采用残差结构的多层感知机(residual multi-layer perceptron,r-MLP)来增强特征提取效果;然后联合概括性全局语义和细节性局部语义增强语义特征的表达能力;最后将该语义特征作为视频内容编码,提升视频描述模型性能.在MSR-VTT和MSVD数据集上,以语义辅助视频描述(semantics-assisted video captioning network,SAVC)网络为基础进行的实验的结果表明,GLS-Net优于现有同类算法,与SAVC网络相比,准确率平均提升6.2%.

Abstract: Aiming at the problem that the semantic features in video captioning cannot take into account the global general information and local detail information, which affects the video captioning effect, a global-local combined semantic generation network (GLS-Net) in video captioning is proposed. Firstly, based on the complementarity of global and local information, the global and local semantic extraction units are designed, and the two units innovatively adopt a residual multi-layer perceptron (r-MLP) structure to enhance the feature processing effect. Secondly, the algorithm combines general global semantics and detailed local semantics to strengthen the expression ability of semantic features. Finally, the features obtained are used as video content coding to improve the video captioning performance. On MSR-VTT and MSVD datasets, simulations are carried out based on semantics-assisted video captioning (SAVC) network. Experimental results show that GLS-Net is superior to existing similar algorithms. Compared with SAVC network, the accuracy is increased by 6.2% on average.

HTML全文

参考文献(25)

施引文献

资源附件(0)