Visual Analysis Method for Compositional Understanding in Vision-Language Models
-
Graphical Abstract
-
Abstract
Vision-language pre-trained models have shown robust cross-modal understanding capabilities across various benchmarks, yet their “compositional understanding” ability still requires investigation. Research in computer vision often focuses on quantitative metrics and model architectures, lacking effective methods for dynamically exploring cross-modal alignment. We propose an interactive analysis method, CouLens, which elucidates the “bag-of-objects” patterns exhibited by vision-language models from the visualization perspective. CouLens optimizes the traditional grid layout to enhance visual perception of cross-modal alignment capabilities on large-scale datasets. It also interprets the multi-head attention responses during cross-modal semantic understanding. 90% of participants found that, compared to methods relying solely on data metrics, CouLens offers a more innovative and effective approach for investigating modality gaps in CLIP.
-
-