卷积与自注意力结合和时空融合的无参考视频质量评价
No-reference video quality assessment based on convolution and self-attention combined network and spatial-temporal fusion
-
摘要: 互联网上的视频分享日益普遍, 因此, 对视频进行质量评估愈发重要. 现有的视频质量评价研究中, 卷积被广泛应用于视频的空域和时域特征提取, 但由于其固有的局部感知特性, 缺乏对于视频全局信息的感知. 同时已有方法未对空域和时域特征进行充分融合. 针对上述问题, 本文提出了一种能够结合卷积与自注意力机制和时空融合的无参考视频质量评价方法. 为了提取更有效的空域和时域特征, 该方法首先分别设计空域特征提取模块和时域特征提取模块, 其中空域特征提取模块通过结合卷积和自注意力的特征感知增强模块对卷积提取的局部特征通过长距离依赖进行全局特征增强; 时域特征提取模块联合视频帧序列和视频帧差序列, 通过视频Swin Transformer和3D卷积学习视频的时域特征. 然后使用时空特征融合增强模块通过自注意力机制对时域和空域特征进行融合并增强, 最后对视频质量进行预测. 在KoNVid-1k和YouTube-UGC数据集上的实验结果显示, 本文的方法相比现有的视频质量评价方法具有更出色的表现, 同时消融实验也验证了每个模块的有效性.Abstract: Sharing videos on the Internet is becoming increasingly common, therefore, video quality assessment is becoming increasingly important. In existing video quality assessment research, convolution is widely used for feature extraction in the spatial and temporal domains of video, but due to its inherent local perceptual characteristics, the global information perception for video is neglected. At the same time, existing methods do not fully integrate spatial and temporal features. Tosolve the above problems, this paper proposes a no-reference video quality assessment method that combines convolution with a self-attention mechanism and spatiotemporal fusion. To extract more effective spatial and temporal features, the method first designs a spatial feature extraction module and a temporal feature extraction module, where the spatial feature extraction module enhances the global information of the local features extracted by convolution through long-range dependencies by the proposed feature perception and enhancement module that combines convolution and self-attention. The temporal feature extraction module jointly learns the temporal features of video using video Swin Transformer and 3D convolution based on video frame sequences and video frame difference sequences. Then, the spatiotemporal feature fusion and enhancement module is used to fuse and enhance the temporal and spatial features through the self-attention mechanism, and finally predict the video quality. Experimental results on the KoNVid-1k and YouTube-UGC datasets show that the proposed method has better performance compared to existing video quality assessment methods, and ablation experiments also verify the effectiveness of each proposed module.