Advanced Search
Yang Chen, Liu Libo. Probability Distribution Representation Learning for Image-Text Cross-Modal Retrieval[J]. Journal of Computer-Aided Design & Computer Graphics, 2022, 34(5): 751-759. DOI: 10.3724/SP.J.1089.2022.18990
Citation: Yang Chen, Liu Libo. Probability Distribution Representation Learning for Image-Text Cross-Modal Retrieval[J]. Journal of Computer-Aided Design & Computer Graphics, 2022, 34(5): 751-759. DOI: 10.3724/SP.J.1089.2022.18990

Probability Distribution Representation Learning for Image-Text Cross-Modal Retrieval

  • In image-text cross-modal retrieval,most existing methods map the samples to point representations that relate the given sample to the specific single point and do not reflect the correlations between the given sample and all the points in semantic space,so that the semantic complexities of the samples and the local similarities between the samples are not fully exploited.To address the above issues,a probability distribution representation learning approach for image-text cross-modal retrieval is presented.Specifically,the method first incorporates lots of label information to capture the prominent features of the samples,which guides the model to construct the semantic spaces for different modalities and learn the distributions of the samples,based on variational information bottleneck.The probability densities of the learned distribution naturally reflect the correlations between the corresponding sample and all the points in semantic space.Moreover,the hinge triplet loss is introduced to align the distributions of the samples from different modalities at the semantic level,which makes the similar image-text pair have similar distributions.Finally,the learned distributions are used to represent the corresponding samples and the Bhattacharyya distance of the learned distributions to measure the similarities between the samples is leveraged,which is able to model the semantic complexities and explore the local similarities.The results of the experiments on Wikipedia and Pascal Sentence demonstrate that proposed method outperforms all of the 9 compared methods.The average mAP for the retrieval tasks of image to text and text to image evenly improves 15.0 percent points on Wikipedia and 13.6 percent points on Pascal Sentence.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return