跨模态时尚检索的服装分层特征表示和关联学习

姜爱萍; 刘骊; 付晓东; 刘利军; 彭玮

doi:10.3724/SP.J.1089.2023-00263

跨模态时尚检索的服装分层特征表示和关联学习

Clothing Hierarchical Feature Representation and Association Learning for Cross-Modal Fashion Retrieval

摘要

摘要: 针对时尚服装的图像和文本具有匹配视角单一、服装信息粒度细且模态关联性弱, 导致跨模态时尚检索图文匹配不准确的问题, 提出跨模态时尚检索的服装分层特征表示和关联学习方法. 首先以成对的服装图文及标签为输入, 通过构建服装分层特征表示模块进行层次化的视觉和本文特征表示, 提取得到服装图像的全局、款式、结构特征, 以及服装文本的描述、主语、标签特征的分层表示; 然后基于交叉注意和向量相似度进行层次化的关联计算, 得到服装图文对的 3 层初始关系, 并通过结合关系推理和聚合的分层关联学习, 获得全局和描述、款式和主语、结构和标签 3 层关系; 最终计算 3 层关系的关联得分, 输出服装的图文匹配结果. 在跨模态时尚检索基准数据集 Fashion-gen上的实验结果表明, 所提方法能够提升跨模态时尚检索的精度, 与文中基线方法相比, 在双向检索前 1 的召回率R@1 上分别提升了 10.26 个百分点和 14.22 个百分点.

Abstract: Aiming at the problems that the images and texts of fashion clothing have a single matching perspective, fine granularity of clothing information and weak modal association ability, which lead to the inaccurate image-text matching of cross-modal fashion retrieval, a clothing hierarchical feature representation and association learning method for cross-modal fashion retrieval was proposed. First, with pairs of clothing images, text and labels as input, hierarchical visual and textual feature representation is carried out by constructing hierarchical feature representation module of clothing, and the global, style and structural features of clothing images and the description, subject and label features of clothing text are extracted respectively. Then, based on cross attention and vector similarity, the hierarchical association calculation is carried out to obtain the three-layer initial relationship of the clothing image-text pair, and through hierarchical association learning that combines relational reasoning and aggregation, the three-layer relationship of global and description, style and subject, structure and label is obtained. Finally, the correlation scores of the three layer are calculated, and the image-text matching results of the clothing are output. Experimental results on the cross-modal fashion retrieval benchmark dataset Fashion-gen show that the proposed method can improve the accuracy of cross-modal fashion retrieval. Compared with the latest baseline method, the recall rate of the top-1 (R@1) in bidirectional retrieval is increased by 10.26 percentage points and 14.22 percentage points, respectively.

HTML全文

参考文献(28)

施引文献

资源附件(0)