Abstract:
The Neural Radiance Field(NeRF), endowed with the implicit representation capability to model 3D scenes as continuous functions, facilitates the generation model in accomplishing 3D-aware image synthesis tasks on unstructured 2D image sets. Nevertheless, current approaches face challenges in image synthesis, including low fidelity, poor consistency and uncontrollability, stemming from their inability to fully capture the three-dimensional essence of the real world or their reliance on insufficiently expressive network architectures. To address these limitations, we introduce a locally spatial neural representation method grounded in a self-attention mechanism. This approach augments the NeRF generator, empowering it to achieve controllable high-quality 3D-aware image synthesis. Initially, the scene to be synthesized is decomposed by incorporating the pose information of each object and the predefined 3D boundary box. Subsequently, the self-attention local representation module models each object distinctly. A defined combination operator is then utilized to obtain the complete scene representation. Finally, the rendering module processes the resulting RGB image. Experimental results show that the proposed method accomplishes the task of high-quality 3D-aware image synthesis with multi-view consistency and controllability, even under conditions of low-frequency spatial point sampling. Furthermore, when conducting synthesis testing with resolutions of 64×64 and 256×256 pixels on the highly complex CompCars public dataset, our proposed method achieved a 4.69 and 2.51 improvement in terms of FID scores compared to the baseline method, respectively. Similarly, on the Churches public dataset, the FID scores were improved by 1.93 and 3.56, respectively, further highlighting the effectiveness of the method.