高级检索
谢志峰, 郑迦恒, 王吉, 梁佳佳, 马利庄. 基于结构化潜码引导NeRF的语音驱动人脸重演[J]. 计算机辅助设计与图形学学报.
引用本文: 谢志峰, 郑迦恒, 王吉, 梁佳佳, 马利庄. 基于结构化潜码引导NeRF的语音驱动人脸重演[J]. 计算机辅助设计与图形学学报.
Zhifeng Xie, Jiaheng Zheng, Ji Wang, Jiajia Liang, Lizhuang Ma. Speech-driven Facial Reenactment based on Implicit Neural Representations with Structured Latent Codes[J]. Journal of Computer-Aided Design & Computer Graphics.
Citation: Zhifeng Xie, Jiaheng Zheng, Ji Wang, Jiajia Liang, Lizhuang Ma. Speech-driven Facial Reenactment based on Implicit Neural Representations with Structured Latent Codes[J]. Journal of Computer-Aided Design & Computer Graphics.

基于结构化潜码引导NeRF的语音驱动人脸重演

Speech-driven Facial Reenactment based on Implicit Neural Representations with Structured Latent Codes

  • 摘要: 语音驱动的人脸重演的目标是生成与输入语音内容相匹配的高保真人脸面部动画. 然而, 由于音频与视频模态之间存在鸿沟, 当前方法难以实现高质量的面部重演. 针对现有方法保真度低、唇音同步效果差等问题, 提出一种基于结构化潜码引导隐式神经表示的语音驱动人脸重演方法, 以人脸点云序列作为中间表示, 将语音驱动人脸重演分解为跨模态映射和神经辐射场渲染两大任务分别解决. 首先, 通过跨模态映射从音频预测人脸表情系数, 利用人脸三维重建技术获得人脸身份系数; 然后, 基于3DMM模型合成人脸点云动画序列; 接着, 使用顶点位置信息构建结构化隐式神经表示, 回归场景中每个采样点的密度和颜色值; 最后, 通过体绘制技术渲染人脸RGB帧, 并装配到原图像中. 在多个时长为3~5min的单人演讲视频上的可视化比较、量化评估、主观评估等实验结果表明, 文中所提方法在唇音同步效果与图像生成精度上优于AD-NeRF等方法, 能够实现高保真语音驱动人脸重演.

     

    Abstract:

    The goal of speech-driven facial reenactment aims to generate high-fidelity facial animation matching with the input speech content. However, existing methods can hardly achieve high-quality facial reenactment because of the gap between audio and video modals. In order to address the problems of existing methods such as low fidelity and poor lip sync effect, we propose a speech-driven facial reenactment method based on implicit neural representations with structured latent codes, which takes the point cloud sequence of human face as the intermediate representation, decomposing the speech-driven facial reenactment into two tasks: cross-modal mapping and neural radiance fields rendering. Firstly, we predict the facial expression coefficients through cross-modal mapping and get the facial identity coefficients by 3D face reconstruction; then, we synthesize face point cloud sequence based on 3DMM; next, we use the position of vertices constructing the structured implicit neural representations and regress density and color for each sampling points; finally, we render RGB frames of human face through volume rendering techniques and assemble them into original image. Experiments results on multiple 3~5min individual speech videos, including visual comparison, quantitative evaluation, and subjective assessment demonstrate that our method achieves better results than state-of-the-art methods such as AD-NeRF in terms of lip-sync accuracy and image generation precision, which can achieve high-fidelity speech-driven facial reenactment.

     

/

返回文章
返回