高级检索

多模态基础模型驱动的层次化三维场景图生成方法

Hierarchical 3D Scene Graph Generation via Multimodal Foundation Models

  • 摘要: 三维场景图生成是具身智能体实现自主环境交互的核心挑战之一. 现有的方法虽然能够建模几何结构与对象间基本关系, 但是在功能性语义表征、交互属性推理及上下文关联建模等高层次语义理解层面仍存在局限性. 针对上述问题, 提出多模态基础模型驱动的层次化三维场景图生成方法. 首先构建增量式多模态感知网络, 并通过联合几何-语义共同约束的融合策略实现几何与语义一致的三维重建; 然后利用视觉-语言模型实现对象标签的上下文语义增强, 生成融合功能属性与空间约束的复合描述符; 最后引导大语言模型推导和预测多元语义关联, 并将这种由多个对象参与的高层次语义关系形式化为超边结构, 系统化构建包含几何-功能-交互多维耦合的语义超图. 在Replica数据集上的实验结果表明, 所提方法的可行性超越主流基线方法; 在消融实验中的结果表明, 高层次语义关系能够使任务规划成功率提升26.6%个百分点; 该方法能够有效地构建三维场景的几何与多维度语义表征, 相较于已有方法, 为复杂环境下的智能体认知决策提供了更为丰富和准确的推理信息.

     

    Abstract: 3D scene graph generation represents a core challenge for embodied AI agents to achieve autonomous environmental interaction. While existing methods can model geometric structures and basic inter-object relationships, they remain limited in higher-level semantic comprehension, such as functional semantic representation, inference of interactive attributes, and contextual relationship modeling. To address these issues, we propose a hierarchical 3D scene graph generation method based on multimodal foundation models. First, an incremental multimodal perception network is constructed, and a fusion strategy with joint geometric-semantic constraints is applied to achieve 3D reconstruction with geometric and semantic consistency. Then, a vision-language model is leveraged to enhance the contextual semantics of object labels, generating composite descriptors that integrate functional attributes and spatial constraints. Finally, a large language model is guided to infer and predict multi-type semantic relations. These high-level semantic interactions involving multiple objects are formalized as hyperedge structures, systematically constructing a semantic hypergraph that couples geometric, functional, and interactive dimensions. Experimental results on the Replica dataset demonstrate that the feasibility of our approach surpasses mainstream baselines. Ablative studies further show that incorporating high-level semantic relations improves task planning success rates by 26.6 percentage points%. The proposed method effectively constructs both geometric and multi-dimensional semantic representations of 3D scenes, providing richer and more accurate reasoning information for agent cognitive decision-making in complex environments compared to existing methods.

     

/

返回文章
返回