基于文本生成与迭代匹配的图像-文本检索
Iterative Matching with Text Generation for Cross-Modal Image-Text Retrieval
-
摘要: 跨模态图文检索由于图像和文本的特征表示方式不同面临着模态异构问题, 传统的公共空间方法难以度量图像和文本的相似性. 为此, 提出了基于文本生成与迭代匹配的跨模态图像文本检索框架, 它包含了特征融合模块和文本生成模块. 特征融合模块通过迭代融合的方式, 多次对齐图像和文本, 在不同的迭代步骤中聚合细粒度信息,捕获图像和文本之间的局部关联信息, 优化了局部公共嵌入空间; 文本生成模块采用特征转换的思路, 由图像模态中的特征映射到文本模态中的句子特征, 通过图文信息交互增强了图像和文本的整体语义相关性, 优化了全局公共嵌入空间, 挖掘出图像与文本更深层的语义信息, 以提高跨模态图像文本检索模型的性能. 在 Flickr30K 和 COCO 数据集上进行了实验, 并与现有的模型进行比较, 结果表明, 该框架在 Flickr30K 和 COCO 上的整体性能分别提升了0.7%和 1.2%. 在文本检索任务的召回指标中, 最高可以提升 3.4%; 在图像检索任务的召回指标中, 最高可以提升4.6%. 消融实验也证明了其中特征融合模块以及文本生成模块的有效性.Abstract: Cross-modal image-text retrieval faces the problem of modal heterogeneity due to the different feature representations of images and texts, and it is difficult for traditional public space methods to measure the similarity of images and texts. Therefore, a cross-modal image-text retrieval framework based on text generation and iterative matching is proposed, which includes a feature fusion module and a text generation module. The feature fusion module which can optimizes the local public embedding space aligns images and texts multiple times through iterative fusion, aggregates fine-grained information in different iterative steps, and captures local association information between images and texts. The text generation module adopts the idea of feature transformation, which maps the features in the image modal to the sentence features in the text mode, enhances the overall semantic correlation of image and text through image and text information interaction. To improve the performance of cross-modal image-text retrieval model, the text generation module not only optimizes the global public embedding space, but also excavates deeper semantic information of images and texts. Comparing with existing models, the experiments on Flickr30K and COCO datasets show that the overall performance of the framework on Flickr30K and COCO is improved by 0.7% and 1.2%, respectively. In the recall index of text retrieval tasks, it can be increased by up to 3.4%. In the recall index of image retrieval tasks, it can be improved by up to 4.6%. The ablation experiments also prove the effectiveness of the feature fusion module and the text generation module.