Abstract:
Cross-modal retrieval takes one modality data as a query and retrieves semantically relevant data in another modality. Most existing cross-modal retrieval methods are designed for scenarios with complete modality data. However, in real-world applications, incomplete modality data often exists, which these methods struggle to handle effectively. In this paper, we propose a typical concept-driven modality-missing deep cross-modal retrieval model. Specifically, we first propose a multi-modal Transformer integrated with multi-modal pretraining networks, which can fully capture the multi-modal fine-grained semantic interaction in the incomplete modality data, extract multi-modal fusion semantics and construct cross-modal subspace, and at the same time supervise the learning process to generate typical concepts. In addition, the typical concepts are used as the cross-attention key and value to drive the training of the modal mapping network, so that it can adaptively preserve the implicit multi-modal semantic concepts of the query modality data, generate cross-modal retrieval features, and fully preserve the pre-extracted multi-modal fusion semantics. Experimental results on four benchmark cross-modal retrieval datasets—Wikipedia, Pascal-Sentence, NUS-WIDE, and XmediaNet—show that our proposed method outperforms the existing state-of-the-art models, with average precision improvements of 1.7%, 5.1%, 1.6%, and 5.4%, respectively. The source code of our method is available at: https://gitee.com/MrSummer123/CPCMR.