跨媒体智能关联分析与语义理解理论与技术研究进展

于俊清; 王鑫; 况琨; 刘偲; 张新峰; 宋子恺

doi:10.3724/SP.J.1089.2023.19296

跨媒体智能关联分析与语义理解理论与技术研究进展

Advances in Theory and Technology of Cross-Media Intelligent Association Analysis

摘要

摘要: 深入分析了跨媒体智能关联分析与语义理解理论技术的最新研究进展,包括多模态数据的统一表达、知识引导的数据融合、跨媒体关联分析、基于知识图谱的跨媒体表征技术以及面向多模态的智能应用.其中,多模态数据的统一表达是对跨媒体信息进行分析推理的先决条件,利用多模态信息间的语义一致性剔除冗余信息,通过跨模态相互转化来实现跨媒体信息统一表达,学习更全面的特征表示;跨媒体关联分析立足于图像语言、视频语言以及音视频语言的跨模态关联分析与理解技术,旨在弥合视觉、听觉以及语言之间的语义鸿沟,充分建立不同模态间的语义关联;基于知识图谱的跨媒体表征技术通过引入跨媒体的知识图谱,从跨媒体知识图谱构建、跨媒体知识图谱嵌入以及跨媒体知识推理3个方面展开研究,增强跨媒体数据表征的可靠性,并提升后续推理任务的分析效率和准确性;随着跨模态分析技术的快速发展,面向多模态的智能应用得到了更多的技术支撑,依据智能应用所需要的领域知识,选取了多模态视觉问答,多模式视频摘要、多模式视觉模式挖掘、多模式推荐、跨模态智能推理和跨模态医学图像预测等跨模态应用实例,梳理了其在多模态数据融合以及跨媒体分析推理方面的研究进展.

Abstract: This paper provides an analysis of the latest research trends of theories and technologies in cross-media intelligent correlation analysis and semantic understanding. The main content of this report includes a unified representation of cross-media information, knowledge-guided data fusion, cross-media correlation analysis, cross-media knowledge graph, and intelligent applications for multi-modal. Unified representations are preconditions for analyzing and inference about multi-modal information. The semantic consistency between multi-modal information is utilized to eliminate redundant information and achieve unified representation through cross-modal interconversion to learn more comprehensive feature representation. The cross-media association analysis focuses on image-language, video-language, and audio-video-language,aiming to bridge the semantic gap between visual, auditory, language, and fully establish the semantic association between different modalities. By introducing the construction of cross-media knowledge graph,cross-media knowledge graph construction, cross-media knowledge graph embedding, and cross-media knowledge inference, the cross-media representation based on knowledge graph enhances the reliability and improves the efficiency and accuracy of subsequent inference tasks. With the rapid development of cross-modal analysis, intelligent applications for multi-modal are supported by more technologies. According to the required domain knowledge, this paper selects cross-modal applications such as multi-modal visual question answering, multi-modal video summarization, multi-modal visual pattern mining, multi-modal recommendation, cross-modal intelligent inference, and cross-modal medical image prediction, their research progress is compared and reviewed in terms of multi-modal fusion and cross-media inference.

HTML全文

参考文献(193)

施引文献

资源附件(0)