图文跨模态检索的概率分布表示学习

杨晨; 刘立波

doi:10.3724/SP.J.1089.2022.18990

图文跨模态检索的概率分布表示学习

杨晨,
刘立波

Probability Distribution Representation Learning for Image-Text Cross-Modal Retrieval

Yang Chen,
Liu Libo

摘要

摘要: 针对现有图文跨模态检索方法中样本的单点特征仅能体现样本与语义空间中特定一点的关联而无法表征与整个空间所有点的关系,进而造成对样本语义复杂性和样本间局部相似性表达能力不足的问题,提出一种图文跨模态检索的概率分布表示学习方法.首先利用样本标签信息学习样本的显著性语义特征,从而基于变分信息瓶颈思想构建不同模态的语义空间并学习各样本语义分布,通过样本语义分布对应于空间中每点的概率密度值直接反映各点与给定样本的关联;接着引入铰链三元组损失对齐属于不同模态的样本语义分布,以保证相似图像-文本对所对应语义分布的相似性;最后使用语义分布作为样本特征,采用巴氏距离度量样本语义分布间的差异衡量样本间语义相似度,以提升对样本语义复杂性的建模能力和对样本间局部相似性的表达能力.在Wikipedia和Pascal Sentence数据集上与9种现有方法进行对比的实验结果表明,所提方法优于所有对比方法.针对Wikipedia数据集,所提方法在图检文和文检图任务上的平均mAP比对比方法提升了15.0个百分点,针对Pascal Sentence数据集提升了13.6个百分点.

Abstract: In image-text cross-modal retrieval,most existing methods map the samples to point representations that relate the given sample to the specific single point and do not reflect the correlations between the given sample and all the points in semantic space,so that the semantic complexities of the samples and the local similarities between the samples are not fully exploited.To address the above issues,a probability distribution representation learning approach for image-text cross-modal retrieval is presented.Specifically,the method first incorporates lots of label information to capture the prominent features of the samples,which guides the model to construct the semantic spaces for different modalities and learn the distributions of the samples,based on variational information bottleneck.The probability densities of the learned distribution naturally reflect the correlations between the corresponding sample and all the points in semantic space.Moreover,the hinge triplet loss is introduced to align the distributions of the samples from different modalities at the semantic level,which makes the similar image-text pair have similar distributions.Finally,the learned distributions are used to represent the corresponding samples and the Bhattacharyya distance of the learned distributions to measure the similarities between the samples is leveraged,which is able to model the semantic complexities and explore the local similarities.The results of the experiments on Wikipedia and Pascal Sentence demonstrate that proposed method outperforms all of the 9 compared methods.The average mAP for the retrieval tasks of image to text and text to image evenly improves 15.0 percent points on Wikipedia and 13.6 percent points on Pascal Sentence.

HTML全文

参考文献(0)

施引文献

资源附件(0)