高级检索

基于特征增强和模态交互的视频异常行为检测

Video Anomaly Detection based on Feature Enhancement and Modal Interaction

  • 摘要: 对比语言-图像预训练模型作为一种基于多模态对比训练的神经网络, 通过预训练大量的语言-图像对提取具有判别性的图像特征. 为了关注连续帧之间的时序关系, 消除不同模态特征之间的信息分布存在差异, 提出一种基于特征增强和模态交互的视频异常行为检测算法. 首先针对CLIP模型在视频连续帧特征提取阶段时间依赖性差的问题, 使用局部和全局时间适配器构建时间相关性增强模块, 分别在局部和全局注意力层关注时序信息; 然后针对不同模态特征存在域间信息差异的问题, 设计一种基于窗口分区移位的多模态特征交互模块, 通过滑动窗口控制特征内部交互, 消除信息分布差异; 最后通过对齐视觉特征和文本特征, 得到帧级异常置信度. 在UCF-Crime数据集上, 所提算法取得87.20%的准确率, 验证了该算法的有效性.

     

    Abstract: The contrastive language-image pre-training model, as a neural network based on multimodal contrastive training, extracts discriminative image features by pre-training on a large number of language-image pairs. In order to focus on the temporal relationships between consecutive frames and eliminate the information distribution discrepancies between different modality features, we propose a video anomaly detection algorithm based on feature enhancement and modality interaction.Firstly, to address the issue of poor temporal dependency in the CLIP model during the feature extraction phase of consecutive video frames, we construct a temporal correlation enhancement module using local and global temporal adapters, which focus on temporal information at local and global attention layers, respectively. Secondly, to tackle the problem of domain information discrepancies between different modality features, we design a multimodal feature interaction module based on window partition shifting. This module controls internal feature interaction through a sliding window, eliminating information distribution discrepancies. Finally, by aligning visual features and textual features, we obtain frame-level anomaly confidence.On the UCF-Crime dataset, the proposed algorithm achieves an accuracy of 87.20%, validating the effectiveness of the algorith.

     

/

返回文章
返回