多模态思维链驱动和层次化认知的Blob布局可控图像生成方法

郭向林; 周硕; 刘恒; 魏明强

doi:10.3724/SP.J.1089.2025-00399

多模态思维链驱动和层次化认知的Blob布局可控图像生成方法

Controllable Blob Layout Image Generation via Multimodal Chain-of-Thought and Hierarchical Cognition

摘要

摘要: 文本到图像生成旨在依据自然语言描述合成语义精准且空间结构合理的图像。针对现有扩散模型在处理长文本提示或复杂场景时，常因语义解析能力有限与空间建模不足而难以保证生成结果的布局一致性与细节准确性的问题，提出一种融合多模态大语言模型（MLLM）与扩散模型的框架，并引入思维链（CoT）和层次化认知机制，实现从高层语义理解到细粒度空间推理的全链路层次化优化。首先，利用MLLM解析长文本输入，通过CoT推理获得层次化布局，并采用基于Blob的模块化场景表示，对物体数量、相对位置及整体布局进行显式建模与一致性约束，从而获得高精度的多对象布局。在生成阶段，采用两阶段扩散生成策略，逐步完成粗粒度结构与语义内容的复现以及像素级细节的精炼，在保证空间一致性的前提下实现高精度、布局可控的多对象图像生成。在涵盖室内外环境的50余种常见物体类别、且单条文本长度至少为130个单词、涉及6个以上实体的复杂多对象长文本测试中，所提方法的提示一致性召回率（PAR）达到90%，显著优于当前主流的文本到图像生成模型，验证了该方法在复杂语义理解与结构化图像生成中的有效性与先进性。

Abstract: Text-to-image generation aims to synthesize images that are semantically precise and spatially coherent from natural-language descriptions. However, existing diffusion models often struggle with long prompts or complex scenes, as limited semantic parsing and spatial modeling make it difficult to maintain layout consistency and fine-grained accuracy. To overcome this bottleneck, we propose a framework that inte-grates multimodal large language models (MLLM) with diffusion models and incorporates Chain-of-Thought (CoT) and hierarchical reasoning to enable layered optimization from high-level seman-tic understanding to fine-grained spatial reasoning. Specifically, the MLLM parses long-text inputs to de-rive hierarchical layouts via CoT reasoning. In addition, we introduce a modular blob scene representation to explicitly model object counts, relative positions, and global layout consistency, achieving highly accu-rate multi-object layouts. During synthesis, we adopt a two-stage diffusion strategy that progressively re-covers coarse-grained structures and semantic content, then refines pixel-level details, ensuring spatially consistent, high-fidelity, and layout-controllable generation. Extensive experiments on complex mul-ti-object long-text tests, which cover indoor and outdoor environments, over 50 common object categories, prompts exceeding 130 words, and six or more entities, show that our method achieves a Prompt Alignment Recall (PAR) of 90%. It significantly outperforms mainstream text-to-image models, validating our frame-work's effectiveness and superiority in complex semantic understanding and structured image synthesis.

HTML全文

参考文献(0)

施引文献

资源附件(0)

英文长摘要