Abstract:
Blind Image Quality Assessment (BIQA) aims to simulate human prediction of image quality distortion lev-els and provide quality scores. However, existing unimodal-based BIQAs have limited representational ability when facing complex contents and distortion types, and the predicted scores also fail to provide ex-planatory descriptions which further affects the credibility of their prediction results. To address these challenges, we propose an eXplainable Blind Image Quality Assessment (xBIQA) guided by Large Lan-guage Model (LLM). Our method leverages image distortion and overall description to generate global quality text, while local quality text is produced to provide detailed descriptions of specific areas. These global texts, local texts, and prompts are then jointly fed into an LLM to generate detailed semantic fea-tures. Compared to traditional BIQA methods based on a single image modality, our approach demonstrates that LLMs can effectively produce text descriptions highly correlated with image quality, thereby enhanc-ing the performance of BIQA models based on multimodal learning. Then, we align and fuse the text se-mantic features and the image texture features, and regress to obtain the image quality score, while output-ting its corresponding quality explanatory description. Experimental results show that our xBIQA performs best on the KonIQ-10k and LIVE Challenge datasets, with improvements of 1.64% and 2.60% in the SRCC metric, respectively.