卷积与自注意力结合和时空融合的无参考视频质量评价
No-Reference Video Quality Assessment Based on Convolution and Self-Attention Combined Network and Spatial-Temporal Fusion
-
摘要: 随着互联网上的视频分享日益普遍,对视频进行质量评估越来越重要.现有的视频质量评价研究中,卷积被广泛应用于视频的空域和时域特征提取,但由于其固有的局部感知特性,缺乏对于视频全局信息的感知;同时,已有方法未对空域和时域特征进行充分融合.针对上述问题,提出一种结合卷积与自注意力机制和时空融合的无参考视频质量评价方法.为了提取更有效的空域和时域特征,首先分别设计空域特征提取模块和时域特征提取模块,其中,空域特征提取模块通过结合卷积和自注意力的特征感知增强模块,对卷积提取的局部特征通过长距离依赖进行全局特征增强,时域特征提取模块联合视频帧序列和视频帧差序列,通过Video Swin Transformer和3D卷积学习视频的时域特征;然后使用时空特征融合增强模块,通过自注意力机制对时域和空域特征进行融合并增强;最后对视频质量进行预测.在KoNVid-1k和YouTube-UGC数据集上的实验结果表明,与文中对比方法中最优的视频质量评价方法相比,所提方法具有更出色的表现,并通过消融实验验证了每个模块的有效性.Abstract: Sharing videos on the Internet is becoming increasingly common, therefore, video quality assessment is becoming increasingly important. In existing video quality assessment research, convolution is widely used for feature extraction in the spatial and temporal domains of video, but due to its inherent local perceptual characteristics, the global information perception for video is neglected. At the same time, existing methods do not fully integrate spatial and temporal features. To solve the above problems, this paper proposes a no-reference video quality assessment method that combines convolution with a self-attention mechanism and spatiotemporal fusion. To extract more effective spatial and temporal features, the method first designs a spatial feature extraction module and a temporal feature extraction module, where the spatial feature extraction module enhances the global information of the local features extracted by convolution through long-range dependencies by the proposed feature perception and enhancement module that combines convolution and self-attention. The temporal feature extraction module jointly learns the temporal features of video using Video Swin Transformer and 3D convolution based on video frame sequences and video frame difference sequences. Then, the spatiotemporal feature fusion and enhancement module is used to fuse and enhance the temporal and spatial features through the self-attention mechanism, and finally predict the video quality. Experimental results on the KoNVid-1k and YouTube-UGC datasets show that the proposed method has better performance compared to existing video quality assessment methods, and ablation experiments also verify the effectiveness of each proposed module.
下载: