面向中文语义驱动的3D视觉定位基准方法

黄永乐; 王天添; 孙士杰; 胡红利; 刘泽东; 宋翔宇

doi:10.3724/SP.J.1089.2025-00047

面向中文语义驱动的3D视觉定位基准方法

A Benchmark Method for 3D Visual Grounding Driven by Chinese Semantics

摘要

摘要: 3D视觉定位是多模态学习中的重要研究方向, 旨在理解自然语言描述中的语义需求以提取其在场景中的目标3D信息, 在人机交互、自动驾驶等依赖语义指导的领域有巨大的应用前景. 针对现有方法依赖昂贵的高精度设备, 且仅支持英文作为输入, 限制了在中国市场的推广与应用的问题, 提出面向中文语义驱动的3D视觉定位基准方法. 首先构建一个中文基准数据集-Traffic3DRefer, 包含5 148幅图像和10 296条中文自然语言描述, 场景中最远目标的距离可达175 m, 并涵盖了单目标和多目标2种场景类型. 然后提出一个有效的网络框架Traffic3DVG: 通过多模态特征编码器提取文本特征、目标图像信息和3D几何信息; 利用空间感知融合模块将2D图像特征与3D几何信息融合以学习具有判别性的表征; 通过多模态特征融合模块进行多模态融合以获得鲁棒的多模态表征, 用于视觉-文本匹配. 在Traffic3DRefer数据集上的大量实验结果表明, 所提框架在低成本硬件上显著提升了F1分数、精确率和召回率, 有效地推动了中文3D视觉定位研究的发展与实际应用.

Abstract: 3D visual grounding is an important research direction in multimodal learning. It aims to understand the semantic requirements in natural language descriptions to extract objects’ 3D information in the scene. It has tremendous application potential in fields that rely on semantic guidance, such as human-computer in-teraction and autonomous driving. To address the fact that existing methods depend on costly high-precision equipment and accept only English input, which limits their promotion and application in the Chinese market, we introduced a Chinese-semantic-driven benchmark for 3D visual grounding. We first constructed a Chinese benchmark dataset, Traffic3DRefer, comprising 5 148 images and 10 296 Chinese natural language descriptions. The farthest object in a scene is up to 175 meters away, and the dataset co-vers both single-object and multi-object scene types. Next, we presented an efficient network architecture, Traffic3DVG. It first employs a multimodal encoder to extract textual features, object image information, and 3D geometric information. A spatial-aware fusion module then integrates the 2D image features with the 3D geometric information to learn discriminative representations. Finally, a multimodal feature fusion module is employed to perform multimodal integration, obtaining robust multimodal representations that can be utilized for visual-text matching. Extensive experiments on the Traffic3DRefer dataset demonstrate that the proposed framework significantly improves F1 scores, precision, and recall on low-cost hardware, which effectively advances the development and practical application of Chinese 3D visual grounding re-search.

HTML全文

参考文献(0)

施引文献

资源附件(0)