Advanced Search
Zhang Sulan, Lian Ying, Hu Lihua, Zhang Jifu. AGTNet: Asymmetric Dual-level Guidance and Gated Multimodal Transformer for Image Captioning[J]. Journal of Computer-Aided Design & Computer Graphics. DOI: 10.3724/SP.J.1089.2024-00533
Citation: Zhang Sulan, Lian Ying, Hu Lihua, Zhang Jifu. AGTNet: Asymmetric Dual-level Guidance and Gated Multimodal Transformer for Image Captioning[J]. Journal of Computer-Aided Design & Computer Graphics. DOI: 10.3724/SP.J.1089.2024-00533

AGTNet: Asymmetric Dual-level Guidance and Gated Multimodal Transformer for Image Captioning

  • In recent years, Transformer-based models have become the mainstream approach in image captioning due to their powerful multimodal representation and parallel processing capabilities. However, for most cases, this framework only performs the interaction and fusion of multi features in the encoding or decoding stage, which does not fully exploit the complementarity within modalities and correlation between modalities, thus limiting the capability of caption generation. To address this issue, we propose AGTNet—an asymmetric dual-level guidance and gated multimodal Transformer for image captioning. In the encoding stage, we design an asymmetric dual-level guidance module which leverages the complementary semantic information from the two visual features to comprehensively represent the features. It also addresses the semantic noise problem caused by the direct fusion of the two feature sources through the gated mask, enabling the generation of more fine-grained captions. In the decoding stage, a gated multimodal module is proposed to preserve valuable visual information while leveraging the correlation between multimodal features to measure the contribution of different visual features to word generation. This facilitates the dynamic interaction between different modalities, thereby enhancing the accuracy of caption generation. Experimental results on the MS-COCO dataset demonstrate that the proposed methods can significantly improve the quality of image captioning, and can improve the word matching metric BELU-1 and the semantic relevance metric CIDEr by 1.2 and 3.9 percentage points respectively compared with baseline model.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return