面向视觉语言模型组合性理解的可视分析方法

李童; 孙国道; 王海霞; 高海东; 谭谞; 梁荣华

doi:10.3724/SP.J.1089.2025-00071

面向视觉语言模型组合性理解的可视分析方法

Visual Analysis Method for Compositional Understanding in Vision-Language Models

摘要

摘要: 视觉语言预训练模型在众多基准测试中展现出了强大的跨模态理解能力, 但其“组合性理解”能力仍有待进一步探究. 来自计算机视觉领域的研究往往侧重于量化指标和模型架构, 缺乏动态探索跨模态对齐能力的有效手段. 为此, 我们提出了交互式分析方法CouLens, 从可视化视角出发阐释视觉语言模型专注于独立实体元素的具体模式. CouLens通过优化传统网格布局, 增强了对大规模数据集上模型跨模态对齐能力的视觉感知; 解释了多头注意力机制在执行跨模态语义理解的具体反应. 90%的参与者表示, 相较于仅依赖数据指标的方法, CouLens提供了一种更为新颖且有效的方式协助他们探究视觉语言模型中的模态隔离现象.

Abstract: Vision-language pre-trained models have shown robust cross-modal understanding capabilities across various benchmarks, yet their “compositional understanding” ability still requires investigation. Research in computer vision often focuses on quantitative metrics and model architectures, lacking effective methods for dynamically exploring cross-modal alignment. We propose an interactive analysis method, CouLens, which elucidates the “bag-of-objects” patterns exhibited by vision-language models from the visualization perspective. CouLens optimizes the traditional grid layout to enhance visual perception of cross-modal alignment capabilities on large-scale datasets. It also interprets the multi-head attention responses during cross-modal semantic understanding. 90% of participants found that, compared to methods relying solely on data metrics, CouLens offers a more innovative and effective approach for investigating modality gaps in CLIP.

HTML全文

参考文献(0)

施引文献

资源附件(0)