个人风格指导下的语音驱动3D人脸动画生成方法
Method for Speech-Driven 3D Face Animation under Personal Style Guidance
-
摘要: 在语音驱动生成3D人脸动画的过程中, 融入目标人物的个人风格能够显著地提升动画的真实感和表现力. 针对现有方法在个人风格的体现上不够精细, 适应新人物风格的能力也需要进一步提高的问题, 提出一种个人风格指导下的语音驱动3D人脸动画生成方法. 首先, 设计一种基于风格注意力网络的个人风格提取器, 从人脸动作序列中提取潜在的面部运动特征, 再基于音频特征的语义空间分布对面部运动特征进行调整和融合, 为不同动画帧形成相匹配的风格特征; 然后基于Transformer构建风格指导的特征融合解码器, 得益于其多头注意力层, 在将音频特征映射为3D人脸动画的过程中可以参考个人风格特征的上下文信息, 使得生成的3D人脸动画能够更好地模拟目标人物的风格. 在公开数据集VOCASET上的实验结果表明, 所提方法在保持嘴型变化与驱动语音精准同步的前提下, 能够更加准确地反映已有人物的风格; 另外, 通过两阶段的训练策略, 能够从简短的视频中高效地适应新人物的风格, 生成具有新人物风格的人脸动画. 在自行构建的新人物数据集上的实验结果表明, 该方法具有较小的人脸顶点误差, 且风格相似性高, 对于新人物风格具有良好的泛化能力.Abstract: In the process of generating speech-driven 3D face animations, incorporating the personal style of the target character can significantly enhance the realism and expressiveness of the animation. Existing methods are not sufficiently detailed in reflecting personal style and need better adaptation to new characters' styles. To address this, we propose a method for generating speech-driven 3D face animation guided by personal style. First, we design a personal style extractor based on style attention network. This extractor derives latent facial movement features from facial action sequences. These features are then adjusted and integrated based on the semantic space distribution of audio features, forming style features that match different animation frames. Next, we construct a style-guided feature fusion decoder based on Transformer architecture. Thanks to its multi-head attention layers, this decoder can refer to the context of personal style features while mapping audio features to 3D face animations, allowing the generated 3D face animation to better mimic the target character's style. Experiments on the publicly available VOCASET dataset show that our method accurately reflects the style of existing characters while maintaining precise synchronization between lip movements and the driving audio. Additionally, with a two-stage training strategy, the method adapts quickly to new character styles from short video clips. Experiments on the self-constructed dataset of new characters show that the method achieves low vertex errors and high style similarity, demonstrating strong generalization capability for new character styles.