Abstract:
3D scene graph generation represents a core challenge for embodied AI agents to achieve autonomous environmental interaction. While existing methods can model geometric structures and basic inter-object relationships, they remain limited in higher-level semantic comprehension, such as functional semantic representation, inference of interactive attributes, and contextual relationship modeling. To address these issues, we propose a hierarchical 3D scene graph generation method based on multimodal foundation models. First, an incremental multimodal perception network is constructed, and a fusion strategy with joint geometric-semantic constraints is applied to achieve 3D reconstruction with geometric and semantic consistency. Then, a vision-language model is leveraged to enhance the contextual semantics of object labels, generating composite descriptors that integrate functional attributes and spatial constraints. Finally, a large language model is guided to infer and predict multi-type semantic relations. These high-level semantic interactions involving multiple objects are formalized as hyperedge structures, systematically constructing a semantic hypergraph that couples geometric, functional, and interactive dimensions. Experimental results on the Replica dataset demonstrate that the feasibility of our approach surpasses mainstream baselines. Ablative studies further show that incorporating high-level semantic relations improves task planning success rates by 26.6 percentage points%. The proposed method effectively constructs both geometric and multi-dimensional semantic representations of 3D scenes, providing richer and more accurate reasoning information for agent cognitive decision-making in complex environments compared to existing methods.