Abstract:
In order to generate personalized dance video images for users, a multi-conditions-guided pre-trained diffusion model for dance video generation is proposed, which directly generates dance videos based on user provided music, performer image, dance style, etc. Firstly, a music encoding and control module is designed to introduce music as condition input into the pre-trained diffusion model that guide the denoising process of the diffusion model through music features to generate dance video images that meet the music conditions; Secondly, with the condition of the performer’s image, a visual contextual attention module is proposed. The performer’s image control module based on ControlNet captures the local features of the performer’s image, and transmits the performer’s image features to the diffusion process through a cross-attention mechanism, ensuring that the generated dance image maintains the consistency of the performer’s appearance; Finally, a text prompt strategy is designed to guide the pre-trained diffusion model to generate higher-quality dance images. The experimental results on the music-dance video dataset AIST have verified the effectiveness of the proposed model. Compared with the baseline model, the proposed model improved the image quality metrics SSIM and PSNR by 10.24% and 7.04%, respectively, and achieved improvements of 6.18% and 17.16% in the video quality metric IS and the audio-video alignment metric AV-Align, respectively.