Abstract:
Text-to-image generation aims to synthesize images that are semantically precise and spatially coherent from natural-language descriptions. However, existing diffusion models often struggle with long prompts or complex scenes, as limited semantic parsing and spatial modeling make it difficult to maintain layout consistency and fine-grained accuracy. To overcome this bottleneck, we propose a framework that inte-grates multimodal large language models (MLLM) with diffusion models and incorporates Chain-of-Thought (CoT) and hierarchical reasoning to enable layered optimization from high-level seman-tic understanding to fine-grained spatial reasoning. Specifically, the MLLM parses long-text inputs to de-rive hierarchical layouts via CoT reasoning. In addition, we introduce a modular blob scene representation to explicitly model object counts, relative positions, and global layout consistency, achieving highly accu-rate multi-object layouts. During synthesis, we adopt a two-stage diffusion strategy that progressively re-covers coarse-grained structures and semantic content, then refines pixel-level details, ensuring spatially consistent, high-fidelity, and layout-controllable generation. Extensive experiments on complex mul-ti-object long-text tests, which cover indoor and outdoor environments, over 50 common object categories, prompts exceeding 130 words, and six or more entities, show that our method achieves a Prompt Alignment Recall (PAR) of 90%. It significantly outperforms mainstream text-to-image models, validating our frame-work's effectiveness and superiority in complex semantic understanding and structured image synthesis.