Abstract:
3D visual grounding is an important research direction in multimodal learning. It aims to understand the semantic requirements in natural language descriptions to extract objects’ 3D information in the scene. It has tremendous application potential in fields that rely on semantic guidance, such as human-computer in-teraction and autonomous driving. To address the fact that existing methods depend on costly high-precision equipment and accept only English input, which limits their promotion and application in the Chinese market, we introduced a Chinese-semantic-driven benchmark for 3D visual grounding. We first constructed a Chinese benchmark dataset, Traffic3DRefer, comprising 5 148 images and 10 296 Chinese natural language descriptions. The farthest object in a scene is up to 175 meters away, and the dataset co-vers both single-object and multi-object scene types. Next, we presented an efficient network architecture, Traffic3DVG. It first employs a multimodal encoder to extract textual features, object image information, and 3D geometric information. A spatial-aware fusion module then integrates the 2D image features with the 3D geometric information to learn discriminative representations. Finally, a multimodal feature fusion module is employed to perform multimodal integration, obtaining robust multimodal representations that can be utilized for visual-text matching. Extensive experiments on the Traffic3DRefer dataset demonstrate that the proposed framework significantly improves F1 scores, precision, and recall on low-cost hardware, which effectively advances the development and practical application of Chinese 3D visual grounding re-search.