大模型驱动三维资产程序化生成与评估框架
A Framework for Procedural Generation and Evaluation of 3D Assets Driven by Large Language Models
-
摘要: 三维内容的自动化生成对于虚拟现实、游戏和影视等领域具有重要意义. 然而, 如何根据自然语言指令高效、可控地生成高质量3D物体仍是一个挑战. 本文提出了一种大语言模型驱动三维资产程序化生成与评估框架, 我们首先设计了基于大语言模型(LLM)的程序化生成框架, 通过将用户的自然语言描述转化为Infinigen库的脚本指令, 自动生成符合语义要求的三维模型, 实现从文本到代码再到3D物体的跨模态转换. 之后我们进一步提出筛选优化机制, 使用CLIP模型对生成结果进行评估打分, 自动筛选生成的3D模型. 本文使用消融实验评估了各模块对生成效果的影响, 并使用CLIP Score指标定量分析生成结果的语义匹配度和视觉质量. 实验结果表明, 所提出的方法相比于不采用CLIP筛选的基线显著提高了生成三维模型的质量以及其和输入文本的一致性. 在多个类别的3D物体生成任务上, 我们的方法取得了更高的CLIP匹配分数. 本文的研究证明了将大模型与程序化3D生成技术相结合的可行性, 为跨模态可控生成三维内容提供了一条有效路径.Abstract: The automated generation of 3D content holds significant importance for fields such as virtual reality, gaming, and film production. However, efficiently and controllably generating high-quality 3D objects based on natural language instructions remains challenging. This paper proposes a large language model driven framework for procedural generation and evaluation of 3D assets. We first design an LLM-based procedural generation framework that converts natural language descriptions into script commands for the Infinigen library, enabling cross-modal conversion from text to code and then to 3D objects that meet semantic requirements. Subsequently, we introduce a screening and optimization mechanism that employs the CLIP model to evaluate and score generation results, enabling automatic filtering of generated 3D models. Through ablation experiments, we assess the impact of different modules on generation effectiveness, and utilize the CLIP Score metric to quantitatively analyze the semantic alignment and visual quality of generated results. Experimental results demonstrate that our method significantly improves the quality of generated 3D models and their consistency with input text compared to baselines without CLIP screening. Our approach achieves higher CLIP matching scores across multiple categories of 3D object generation tasks. This research proves the feasibility of integrating large models with procedural 3D generation techniques, providing an effective pathway for cross-modal controllable generation of 3D content.
下载: