Advanced Search
Li Tong, Sun Guodao, Wang Haixia, Gao Haidong, Tan Xu, Liang Ronghua. Visual Analysis Method for Compositional Understanding in Vision-Language Models[J]. Journal of Computer-Aided Design & Computer Graphics. DOI: 10.3724/SP.J.1089.2025-00071
Citation: Li Tong, Sun Guodao, Wang Haixia, Gao Haidong, Tan Xu, Liang Ronghua. Visual Analysis Method for Compositional Understanding in Vision-Language Models[J]. Journal of Computer-Aided Design & Computer Graphics. DOI: 10.3724/SP.J.1089.2025-00071

Visual Analysis Method for Compositional Understanding in Vision-Language Models

  • Vision-language pre-trained models have shown robust cross-modal understanding capabilities across various benchmarks, yet their “compositional understanding” ability still requires investigation. Research in computer vision often focuses on quantitative metrics and model architectures, lacking effective methods for dynamically exploring cross-modal alignment. We propose an interactive analysis method, CouLens, which elucidates the “bag-of-objects” patterns exhibited by vision-language models from the visualization perspective. CouLens optimizes the traditional grid layout to enhance visual perception of cross-modal alignment capabilities on large-scale datasets. It also interprets the multi-head attention responses during cross-modal semantic understanding. 90% of participants found that, compared to methods relying solely on data metrics, CouLens offers a more innovative and effective approach for investigating modality gaps in CLIP.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return