AGTNet: Asymmetric Dual-level Guidance and Gated Multimodal Transformer for Image Captioning

Zhang Sulan; Lian Ying; Hu Lihua; Zhang Jifu

doi:10.3724/SP.J.1089.2024-00533

Zhang Sulan, Lian Ying, Hu Lihua, Zhang Jifu. AGTNet: Asymmetric Dual-level Guidance and Gated Multimodal Transformer for Image Captioning[J]. Journal of Computer-Aided Design & Computer Graphics. DOI: 10.3724/SP.J.1089.2024-00533

Citation:

AGTNet: Asymmetric Dual-level Guidance and Gated Multimodal Transformer for Image Captioning

Graphical Abstract

Graphical Abstract

Abstract

Abstract

In recent years, Transformer-based models have become the mainstream approach in image captioning due to their powerful multimodal representation and parallel processing capabilities. However, for most cases, this framework only performs the interaction and fusion of multi features in the encoding or decoding stage, which does not fully exploit the complementarity within modalities and correlation between modalities, thus limiting the capability of caption generation. To address this issue, we propose AGTNet—an asymmetric dual-level guidance and gated multimodal Transformer for image captioning. In the encoding stage, we design an asymmetric dual-level guidance module which leverages the complementary semantic information from the two visual features to comprehensively represent the features. It also addresses the semantic noise problem caused by the direct fusion of the two feature sources through the gated mask, enabling the generation of more fine-grained captions. In the decoding stage, a gated multimodal module is proposed to preserve valuable visual information while leveraging the correlation between multimodal features to measure the contribution of different visual features to word generation. This facilitates the dynamic interaction between different modalities, thereby enhancing the accuracy of caption generation. Experimental results on the MS-COCO dataset demonstrate that the proposed methods can significantly improve the quality of image captioning, and can improve the word matching metric BELU-1 and the semantic relevance metric CIDEr by 1.2 and 3.9 percentage points respectively compared with baseline model.

FullText(HTML)

References (0)

Cited By

Turn off MathJax

Article Contents

AGTNet: Asymmetric Dual-level Guidance and Gated Multimodal Transformer for Image Captioning

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content