Abstract:
Existing image aesthetics assessment methods only focus on images but ignore the rich semantic information in user comments, thus limiting their performance. Since user comments contain rich image semantic information, some recent works have attempted to utilize user comments to assist in image aesthetics assessment. However, these methods fail to fully exploit image features or model the complex relationship between image features and text features, resulting in insufficient utilization of image information and partially-modeled interaction between image and text information. To solve the above problems, this paper proposes a cross-modal image aesthetics assessment method that integrates scene features (CIAASF). Since image scenes usually affect the aesthetics assessment of human perception for images, this paper first extracts scene features and aesthetic features from images and deeply fuses them using a multi-scale feature fusion module. Second, considering the intrinsic correlation between image features and text features, this paper uses the multi-headed cross-attention mechanism to compute the cross-attention between image features and text features, which can thus interact and fuse the image and text information. Finally, the fused cross-modal features are used for aesthetics assessment tasks. Extensive experimental results on a general ized large image aesthetics assessment dataset AVA show that the proposed method achieves 86.96%, 0.852 3, and 0.864 8 on ACC, SRCC, and PLCC metrics, which outperforms the cross-modal image aesthetics assessment methods compared in the paper.