时序对齐视觉特征映射的音效生成方法

谢志峰; 孙络祎; 孙郁洲; 余椿鹏; 马利庄

doi:10.3724/SP.J.1089.2022.19725

时序对齐视觉特征映射的音效生成方法

Sound Generation Method with Timing-Aligned Visual Feature Mapping

摘要

摘要: 针对目前视觉引导的音效生成方法存在的保真度低、时序对齐效果差等问题,提出一种基于时序对齐视觉特征映射的音效生成方法.首先,设计基于时序约束的特征聚合窗口,将视频序列滑动整合为视觉特征集合;其次,构建时空匹配的跨模态视音频特征映射网络,将视觉特征集合转换为多频段音频特征;最后,采用音频解码器将音频特征解码为梅尔频谱,再使用声码器将其转换为最终波形.在VAS数据集上进行定性与定量实验,实验结果表明,与现有方法相比,文中方法在语音质量感知评估、发声点平均偏移量以及人工评估方面均有显著提升,其中,发声点平均偏移量平均降低至0.2 s.

Abstract: In order to address the problems of existing methods,such as obvious noise,weak reality and asynchronous with video,we proposed a sound generation method based on timing-aligned visual feature mapping.Firstly,we designed a feature aggregation window based on temporal constraint,which extract integrated visual feature from the video sequence.Secondly,the integrated visual feature was transformed into multi-frequency audio feature by a spatio-temporal matching cross-modal mapping network.Finally,we utilized an audio decoder to obtain Mel-spectrogram from audio features,and send to a vocoder to output the final waveform.We completed qualitative and quantitative experiments on the VAS dataset,and the results show that the proposed method significantly improves audio quality,timing alignment,and audience perception.

HTML全文

参考文献(0)

施引文献

资源附件(0)