高级检索
周瑞, 姜聪, 许庆阳, 李贻斌, 张承进, 宋勇. 多条件生成对抗网络的文本到视频合成方法[J]. 计算机辅助设计与图形学学报, 2022, 34(10): 1567-1579. DOI: 10.3724/SP.J.1089.2022.19731
引用本文: 周瑞, 姜聪, 许庆阳, 李贻斌, 张承进, 宋勇. 多条件生成对抗网络的文本到视频合成方法[J]. 计算机辅助设计与图形学学报, 2022, 34(10): 1567-1579. DOI: 10.3724/SP.J.1089.2022.19731
Zhou Rui, Jiang Cong, Xu Qingyang, Li Yibin, Zhang Chengjin, Song Yong. Multi-Conditional Generative Adversarial Network for Text-to-Video Synthesis[J]. Journal of Computer-Aided Design & Computer Graphics, 2022, 34(10): 1567-1579. DOI: 10.3724/SP.J.1089.2022.19731
Citation: Zhou Rui, Jiang Cong, Xu Qingyang, Li Yibin, Zhang Chengjin, Song Yong. Multi-Conditional Generative Adversarial Network for Text-to-Video Synthesis[J]. Journal of Computer-Aided Design & Computer Graphics, 2022, 34(10): 1567-1579. DOI: 10.3724/SP.J.1089.2022.19731

多条件生成对抗网络的文本到视频合成方法

Multi-Conditional Generative Adversarial Network for Text-to-Video Synthesis

  • 摘要: 针对目前主流的模型在视频合成过程中随机性强,缺乏合成复杂场景、多样运动视频的能力的问题,提出基于多条件生成对抗网络的文本生成视频方法,包括文本处理模块、位姿建模与转换模块、视频帧生成与优化模块.文本处理模块将传统生成方法(检索与监督学习方法)与生成模型相结合建立动作检索数据库,提高生成过程的可控性;位姿建模与转换模块实现对位姿信息的提取及三维建模;视频帧生成与优化模块利用多条件生成对抗网络进行视频帧的合成与优化.在iPER,DeepFashion等公开数据集上,采用IS,SSIM,PSNR等指标进行评价的实验结果表明,与现有模型相比,所提方法生成视频的语义一致性及视频质量均具有优势.相较于目前主流的姿势转换模型MonkeyNet,在iPER数据集上的SSIM值提升了16.8%,IS提升了22.7%,PSNR值提升了27.1%.在评价姿势转换方面,在基线数据集DeepFashion进行比较,FreID值提升了26.7%.

     

    Abstract: Aiming at the problem that the current mainstream models have strong randomness in the text-to-video process and lack the ability to synthesize complex scenes and diverse motion videos,a text-to-video method based on multi-condition generative adversarial networks is proposed,including text processing modules,pose modeling and transition module,video frame generation and optimization module.The text processing module combines traditional generation methods(retrieval and supervised learning meth¬ods)with generative models to establish an action retrieval database to improve the controllability of the gen¬eration process,the pose modeling and transition module realizes the extraction of pose information and 3D modeling,the video frame generation and optimization module uses multi-condition generative adversarial networks to synthesize and optimize video frames.The model is verified by public datasets such as iPER and DeepFashion,and evaluated by IS,SSIM,PSNR and other indicators.Compared with the existing models,the semantic consistency and video quality of the video synthesis by the proposed model have advantages.Compared with the current mainstream posture transformation model such as MonkeyNet,the SSIM on the iPER dataset has increased by 16.8%,the IS has increased by 22.7%,and the PSNR value has increased by 27.1%.In terms of evaluating posture transition,in the baseline dataset DeepFashion,the FreID value increased by 26.7%.

     

/

返回文章
返回