高级检索

基于局部空间自注意力的3D感知图像合成

3D-Aware Image Synthesis via Local Spatial Self-Attention

  • 摘要: 神经辐射场(neural radiance fields, NeRF)拥有将3D场景建模为连续函数的强大隐式表征能力, 促进了生成模型在非结构化的2D图像集合上完成3D感知图像合成任务. 然而, 由于忽略了真实世界的3D本质或依赖表达能力不足的网络架构, 当前方法在图像合成方面仍然存在保真度低、一致性差和不可控制等挑战. 为此, 提出一种基于自注意力机制的局部空间神经表示方法对NeRF生成器进行改良, 实现可控的高质量3D感知图像合成. 首先通过结合各对象的位姿信息和预定义的3D边界框, 将待合成场景进行解耦表示; 然后利用自注意力局部表示模块为各对象单独建模, 定义组合算子得到完整场景表示; 最后经过渲染模块处理得到最终的RGB图像. 实验结果表明, 所提方法能够在对空间点低频采样的情况下完成视角一致的、可控的高质量3D感知图像合成任务; 且在高复杂度的CompCars公开数据集上进行64×64和256×256像素分辨率的合成实验, 该方法获得的FID相较于基线方法分别提升4.69和2.51; 同样在Churches公开数据集上FID分别提升1.93和3.56, 进一步突出了该方法的有效性.

     

    Abstract: The Neural Radiance Field(NeRF), endowed with the implicit representation capability to model 3D scenes as continuous functions, facilitates the generation model in accomplishing 3D-aware image synthesis tasks on unstructured 2D image sets. Nevertheless, current approaches face challenges in image synthesis, including low fidelity, poor consistency and uncontrollability, stemming from their inability to fully capture the three-dimensional essence of the real world or their reliance on insufficiently expressive network architectures. To address these limitations, we introduce a locally spatial neural representation method grounded in a self-attention mechanism. This approach augments the NeRF generator, empowering it to achieve controllable high-quality 3D-aware image synthesis. Initially, the scene to be synthesized is decomposed by incorporating the pose information of each object and the predefined 3D boundary box. Subsequently, the self-attention local representation module models each object distinctly. A defined combination operator is then utilized to obtain the complete scene representation. Finally, the rendering module processes the resulting RGB image. Experimental results show that the proposed method accomplishes the task of high-quality 3D-aware image synthesis with multi-view consistency and controllability, even under conditions of low-frequency spatial point sampling. Furthermore, when conducting synthesis testing with resolutions of 64×64 and 256×256 pixels on the highly complex CompCars public dataset, our proposed method achieved a 4.69 and 2.51 improvement in terms of FID scores compared to the baseline method, respectively. Similarly, on the Churches public dataset, the FID scores were improved by 1.93 and 3.56, respectively, further highlighting the effectiveness of the method.

     

/

返回文章
返回