Abstract:
Aiming at the problem that the image and text information of fashion clothing has fine granularity of clothing information, weak modal association ability and single matching perspective, which leads to inaccurate image-text matching in cross-modal fashion retrieval, a clothing hierarchical feature representation learning for cross-modal fashion retrieval was proposed. Firstly, taking the paired clothing images, text and labels as input, a clothing hierarchical feature representation module including CNN, Faster-RCNN, cascaded pyramid network and Bi-GRU is constructed to extract the global, style and structure features of the clothing image and the description, subject and label features of the clothing text respectively. Then, combining cross-attention, correlation calculation, graph reasoning and relation fusion, the association learning is performed on the three-layer features of global and description, style and subject, structure and label. Through hierarchical association and fusion, the matching score is calculated to obtain the image-text matching result of clothing. Experimental results on the cross-modal fashion retrieval benchmark dataset Fashion-gen show that the proposed method can improve the accuracy of cross-modal fashion retrieval. Compared with the latest baseline method, the recall rate of the top-1 (R@1) in bidirectional retrieval is increased by 10.26% and 14.22%, respectively.