Cross-modal Image Aesthetics Assessment with Scene Features
-
Graphical Abstract
-
Abstract
Image aesthetics assessment aims to simulate human perception and cognition of beauty through computers, enabling the computers to automatically evaluate the aesthetic qualities of images. The images on social media are typically accompanied by comments, but existing image aesthetics assessment methods only focus on images but ignore user comments, thus limiting their performance. Since user comments contain rich image semantic information, some recent works have attempted to utilize user comments to assist in image aesthetics assessment. However, these methods fail to fully exploit image features or model the complex relationship between image features and text features, resulting in insufficient utilization of image information and partially-modeled interaction between image and text information. To solve the above problems, this paper proposes a cross-modal image aesthetics assessment method that integrates scene features (CIAASF). Since image scenes usually affect the aesthetics assessment of human perception for images, this paper first extracts scene features and aesthetic features from images and deeply fuses them using a multi-scale feature fusion module. Second, considering the intrinsic correlation between image features and text features, this paper uses the multi-headed cross-attention mechanism to compute the cross-attention between image features and text features, which can thus interact and fuse the image and text information. Finally, the fused cross-modal features are used for aesthetics assessment tasks. Extensive experimental results on a generic large image aesthetics assessment dataset AVA show that the performance of the proposed CIAASF model outperforms the state-of-the-art cross-modal image aesthetics assessment methods on both classification prediction and score prediction tasks for image aesthetics assessment.
-
-