Iterative Matching with Text Generation for Cross-Modal Image-Text Retrieval
-
Graphical Abstract
-
Abstract
Cross-modal image-text retrieval faces the problem of modal heterogeneity due to the different feature representations of images and texts, and it is difficult for traditional public space methods to measure the similarity of images and texts. Therefore, a cross-modal image-text retrieval framework based on text generation and iterative matching is proposed, which includes a feature fusion module and a text generation module. The feature fusion module which can optimizes the local public embedding space aligns images and texts multiple times through iterative fusion, aggregates fine-grained information in different iterative steps, and captures local association information between images and texts. The text generation module adopts the idea of feature transformation, which maps the features in the image modal to the sentence features in the text mode, enhances the overall semantic correlation of image and text through image and text information interaction. To improve the performance of cross-modal image-text retrieval model, the text generation module not only optimizes the global public embedding space, but also excavates deeper semantic information of images and texts. Comparing with existing models, the experiments on Flickr30K and COCO datasets show that the overall performance of the framework on Flickr30K and COCO is improved by 0.7% and 1.2%, respectively. In the recall index of text retrieval tasks, it can be increased by up to 3.4%. In the recall index of image retrieval tasks, it can be improved by up to 4.6%. The ablation experiments also prove the effectiveness of the feature fusion module and the text generation module.
-
-