融合场景特征的跨模态图像美学评价
Cross-modal Image Aesthetics Assessment with Scene Features
-
摘要: 现有的图像美学评价方法通常依赖图像本身而忽略了用户评论中的丰富语义信息, 因此在性能方面呈现一定的局限性. 一些研究尝试结合用户评论辅助进行图像美学评价, 但未能充分挖掘图像特征,且未能较好地对图像-文本特征的复杂关系进行建模, 导致图像-文本信息利用不充分且交互不够紧密. 为解决上述问题, 提出一种融合场景特征的跨模态图像美学评价方法. 由于图像场景通常会影响人们对图像的美学评价, 因此首先提取图像的场景特征和美学特征, 并使用多尺度特征融合模块将两者深度融合; 考虑到图像-文本特征之间的内在相关性, 使用多头交叉注意力机制在图像特征和文本特征之间进行交叉注意力计算, 将图像-文本模态信息进行交互融合; 最后将融合后的跨模态特征用于美学评价. 在通用的大型图像美学评价数据集AVA上的广泛实验结果表明, 所提方法在ACC, SRCC和PLCC指标上分别达到了86.96%, 0.852 3和0.864 8, 超越了文中对比的跨模态图像美学评价方法.Abstract: Image aesthetics assessment aims to simulate human perception and cognition of beauty through computers, enabling the computers to automatically evaluate the aesthetic qualities of images. The images on social media are typically accompanied by comments, but existing image aesthetics assessment methods only focus on images but ignore user comments, thus limiting their performance. Since user comments contain rich image semantic information, some recent works have attempted to utilize user comments to assist in image aesthetics assessment. However, these methods fail to fully exploit image features or model the complex relationship between image features and text features, resulting in insufficient utilization of image information and partially-modeled interaction between image and text information. To solve the above problems, this paper proposes a cross-modal image aesthetics assessment method that integrates scene features (CIAASF). Since image scenes usually affect the aesthetics assessment of human perception for images, this paper first extracts scene features and aesthetic features from images and deeply fuses them using a multi-scale feature fusion module. Second, considering the intrinsic correlation between image features and text features, this paper uses the multi-headed cross-attention mechanism to compute the cross-attention between image features and text features, which can thus interact and fuse the image and text information. Finally, the fused cross-modal features are used for aesthetics assessment tasks. Extensive experimental results on a generic large image aesthetics assessment dataset AVA show that the performance of the proposed CIAASF model outperforms the state-of-the-art cross-modal image aesthetics assessment methods on both classification prediction and score prediction tasks for image aesthetics assessment.