No-reference video quality assessment based on convolution and self-attention combined network and spatial-temporal fusion

Yuzhen Niu; zhenlong wang; bolin zhang; Yuzhong Chen

doi:10.3724/SP.J.1089.2023-00808

Yuzhen Niu, zhenlong wang, bolin zhang, Yuzhong Chen. No-reference video quality assessment based on convolution and self-attention combined network and spatial-temporal fusion[J]. Journal of Computer-Aided Design & Computer Graphics. DOI: 10.3724/SP.J.1089.2023-00808

Citation:

No-reference video quality assessment based on convolution and self-attention combined network and spatial-temporal fusion

Graphical Abstract

Graphical Abstract

Abstract

Abstract

Sharing videos on the Internet is becoming increasingly common, therefore, video quality assessment is becoming increasingly important. In existing video quality assessment research, convolution is widely used for feature extraction in the spatial and temporal domains of video, but due to its inherent local perceptual characteristics, the global information perception for video is neglected. At the same time, existing methods do not fully integrate spatial and temporal features. Tosolve the above problems, this paper proposes a no-reference video quality assessment method that combines convolution with a self-attention mechanism and spatiotemporal fusion. To extract more effective spatial and temporal features, the method first designs a spatial feature extraction module and a temporal feature extraction module, where the spatial feature extraction module enhances the global information of the local features extracted by convolution through long-range dependencies by the proposed feature perception and enhancement module that combines convolution and self-attention. The temporal feature extraction module jointly learns the temporal features of video using video Swin Transformer and 3D convolution based on video frame sequences and video frame difference sequences. Then, the spatiotemporal feature fusion and enhancement module is used to fuse and enhance the temporal and spatial features through the self-attention mechanism, and finally predict the video quality. Experimental results on the KoNVid-1k and YouTube-UGC datasets show that the proposed method has better performance compared to existing video quality assessment methods, and ablation experiments also verify the effectiveness of each proposed module.

FullText(HTML)

References (0)

Cited By

Turn off MathJax

Article Contents

No-reference video quality assessment based on convolution and self-attention combined network and spatial-temporal fusion

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content