高级检索
毛琳, 高航, 杨大伟. 视频描述中全局-局部联合语义生成网络[J]. 计算机辅助设计与图形学学报.
引用本文: 毛琳, 高航, 杨大伟. 视频描述中全局-局部联合语义生成网络[J]. 计算机辅助设计与图形学学报.
Global-Local Combined Semantic Generation Network for Video Captioning
[J]. Journal of Computer-Aided Design & Computer Graphics.
Citation:
Global-Local Combined Semantic Generation Network for Video Captioning
[J]. Journal of Computer-Aided Design & Computer Graphics.

视频描述中全局-局部联合语义生成网络

Global-Local Combined Semantic Generation Network for Video Captioning

  • 摘要:
    针对视频描述中语义特征不能兼顾全局概括信息和局部细节信息, 影响视频描述效果的问题, 提出一种视
    频描述中全局-局部联合语义生成网络 GLS-Net. 利用全局与局部信息的互补性, 设计了全局和局部语义提取单元,
    两个单元创新地采用残差结构的多层感知机(r-MLP)来增强特征提取效果; 联合概括性全局语义和细节性局部语义,
    增强语义特征的表达能力; 将该语义特征作为视频内容编码, 提升视频描述模型性能. 在 MSR-VTT 和 MSVD数据集
    上, 以语义辅助视频描述网络(SAVC)为基础进行仿真测试, 实验结果表明 GLS-Net 优于现有同类算法, 与 SAVC 网
    络相比, 准确率 CIDEr 平均提升了 6.2%.

     

    Abstract:
    Aiming at the problem that the semantic features in video captioning cannot take into account the global general information and local detail information, which affects the video captioning effect, a global-local combined semantic generation network in video captioning is proposed (GLS-Net). Based on the complementarity of global and local information, the global and local semantic extraction units are designed, and the two units innovatively adopt a residual multi-layer perceptron (r-MLP) structure to enhance the feature processing effect. The algorithm combines general global semantics and detailed local semantics to strengthen the expression ability of semantic features. The features obtained are used as video content coding to improve the video captioning performance. On MSR-VTT and MSVD datasets, simulations are carried out based on semantics-assisted video captioning network (SAVC). Experimental results show that GLS-Net is superior to existing similar algorithms. Compared with SAVC network, the accuracy CIDEr is increased by
    6.2% on average.

     

/

返回文章
返回