Iterative Matching with Text Generation for Cross-Modal Image-Text Retrieval

Pan Yingying; Ma Qing; Bai Cong

doi:10.3724/SP.J.1089.2023-00363

Pan Yingying, Ma Qing, Bai Cong. Iterative Matching with Text Generation for Cross-Modal Image-Text RetrievalJ. Journal of Computer-Aided Design & Computer Graphics, 2025, 37(5): 856-864. DOI: 10.3724/SP.J.1089.2023-00363

Citation:

Iterative Matching with Text Generation for Cross-Modal Image-Text Retrieval

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Cross-modal image-text retrieval faces the problem of modal heterogeneity due to the different feature representations of images and texts, and it is difficult for traditional public space methods to measure the similarity of images and texts. Therefore, a cross-modal image-text retrieval framework based on text generation and iterative matching is proposed, which includes a feature fusion module and a text generation module. The feature fusion module which can optimizes the local public embedding space aligns images and texts multiple times through iterative fusion, aggregates fine-grained information in different iterative steps, and captures local association information between images and texts. The text generation module adopts the idea of feature transformation, which maps the features in the image modal to the sentence features in the text mode, enhances the overall semantic correlation of image and text through image and text information interaction. To improve the performance of cross-modal image-text retrieval model, the text generation module not only optimizes the global public embedding space, but also excavates deeper semantic information of images and texts. Comparing with existing models, the experiments on Flickr30K and COCO datasets show that the overall performance of the framework on Flickr30K and COCO is improved by 0.7% and 1.2%, respectively. In the recall index of text retrieval tasks, it can be increased by up to 3.4%. In the recall index of image retrieval tasks, it can be improved by up to 4.6%. The ablation experiments also prove the effectiveness of the feature fusion module and the text generation module.

FullText(HTML)

References (29)

Cited By

Turn off MathJax

Article Contents

Iterative Matching with Text Generation for Cross-Modal Image-Text Retrieval

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content