基于非对称双向引导与门控多模态Transformer的图像字幕生成方法
AGTNet: Asymmetric Dual-level Guidance and Gated Multimodal Transformer for Image Captioning
-
摘要: 近年来, Transformer因其强大的模态表征能力及并行处理能力成为图像字幕领域的主流方法. 然而, 现有方法仅在编码阶段或解码阶段对多个特征进行简单的交互与融合, 没有充分利用模态内的互补性及模态间的相关性, 限制了字幕生成的能力. 针对这一问题, 提出一种基于非对称双向引导与门控多模态Transformer的图像字幕生成方法. 在编码阶段设计了一种非对称双向引导模块, 利用2种视觉特征中的互补语义信息实现全面特征表示, 并通过门控掩模解决2种特征源直接交互引起的语义噪声问题, 生成更加细粒度的字幕; 在解码阶段, 门控多模态模块在保留有用视觉信息的同时, 利用多模态特征间的相关性度量不同视觉特征对单词生成的贡献, 促进不同模态信息之间的动态交互, 提升生成字幕的准确性. 基于MS-COCO数据集的实验结果表明, 所提出的方法能够有效地提升模型生成图像字幕的质量, 与基线模型相比, 在单词匹配指标BELU-1和语义相关性指标CIDEr上分别提升1.2和3.9个百分点.Abstract: In recent years, Transformer-based models have become the mainstream approach in image captioning due to their powerful multimodal representation and parallel processing capabilities. However, for most cases, this framework only performs the interaction and fusion of multi features in the encoding or decoding stage, which does not fully exploit the complementarity within modalities and correlation between modalities, thus limiting the capability of caption generation. To address this issue, we propose AGTNet—an asymmetric dual-level guidance and gated multimodal Transformer for image captioning. In the encoding stage, we design an asymmetric dual-level guidance module which leverages the complementary semantic information from the two visual features to comprehensively represent the features. It also addresses the semantic noise problem caused by the direct fusion of the two feature sources through the gated mask, enabling the generation of more fine-grained captions. In the decoding stage, a gated multimodal module is proposed to preserve valuable visual information while leveraging the correlation between multimodal features to measure the contribution of different visual features to word generation. This facilitates the dynamic interaction between different modalities, thereby enhancing the accuracy of caption generation. Experimental results on the MS-COCO dataset demonstrate that the proposed methods can significantly improve the quality of image captioning, and can improve the word matching metric BELU-1 and the semantic relevance metric CIDEr by 1.2 and 3.9 percentage points respectively compared with baseline model.