Abstract:
The contrastive language-image pre-training model, as a neural network based on multimodal contrastive training, extracts discriminative image features by pre-training on a large number of language-image pairs. In order to focus on the temporal relationships between consecutive frames and eliminate the information distribution discrepancies between different modality features, we propose a video anomaly detection algorithm based on feature enhancement and modality interaction.Firstly, to address the issue of poor temporal dependency in the CLIP model during the feature extraction phase of consecutive video frames, we construct a temporal correlation enhancement module using local and global temporal adapters, which focus on temporal information at local and global attention layers, respectively. Secondly, to tackle the problem of domain information discrepancies between different modality features, we design a multimodal feature interaction module based on window partition shifting. This module controls internal feature interaction through a sliding window, eliminating information distribution discrepancies. Finally, by aligning visual features and textual features, we obtain frame-level anomaly confidence.On the UCF-Crime dataset, the proposed algorithm achieves an accuracy of 87.20%, validating the effectiveness of the algorith.