高级检索
谢志峰, 孙络祎, 孙郁洲, 余椿鹏, 马利庄. 时序对齐视觉特征映射的音效生成方法[J]. 计算机辅助设计与图形学学报, 2022, 34(10): 1506-1514. DOI: 10.3724/SP.J.1089.2022.19725
引用本文: 谢志峰, 孙络祎, 孙郁洲, 余椿鹏, 马利庄. 时序对齐视觉特征映射的音效生成方法[J]. 计算机辅助设计与图形学学报, 2022, 34(10): 1506-1514. DOI: 10.3724/SP.J.1089.2022.19725
Xie Zhifeng, Sun Luoyi, Sun Yuzhou, Yu Chunpeng, Ma Lizhuang. Sound Generation Method with Timing-Aligned Visual Feature Mapping[J]. Journal of Computer-Aided Design & Computer Graphics, 2022, 34(10): 1506-1514. DOI: 10.3724/SP.J.1089.2022.19725
Citation: Xie Zhifeng, Sun Luoyi, Sun Yuzhou, Yu Chunpeng, Ma Lizhuang. Sound Generation Method with Timing-Aligned Visual Feature Mapping[J]. Journal of Computer-Aided Design & Computer Graphics, 2022, 34(10): 1506-1514. DOI: 10.3724/SP.J.1089.2022.19725

时序对齐视觉特征映射的音效生成方法

Sound Generation Method with Timing-Aligned Visual Feature Mapping

  • 摘要: 针对目前视觉引导的音效生成方法存在的保真度低、时序对齐效果差等问题,提出一种基于时序对齐视觉特征映射的音效生成方法.首先,设计基于时序约束的特征聚合窗口,将视频序列滑动整合为视觉特征集合;其次,构建时空匹配的跨模态视音频特征映射网络,将视觉特征集合转换为多频段音频特征;最后,采用音频解码器将音频特征解码为梅尔频谱,再使用声码器将其转换为最终波形.在VAS数据集上进行定性与定量实验,实验结果表明,与现有方法相比,文中方法在语音质量感知评估、发声点平均偏移量以及人工评估方面均有显著提升,其中,发声点平均偏移量平均降低至0.2 s.

     

    Abstract: In order to address the problems of existing methods,such as obvious noise,weak reality and asynchronous with video,we proposed a sound generation method based on timing-aligned visual feature mapping.Firstly,we designed a feature aggregation window based on temporal constraint,which extract integrated visual feature from the video sequence.Secondly,the integrated visual feature was transformed into multi-frequency audio feature by a spatio-temporal matching cross-modal mapping network.Finally,we utilized an audio decoder to obtain Mel-spectrogram from audio features,and send to a vocoder to output the final waveform.We completed qualitative and quantitative experiments on the VAS dataset,and the results show that the proposed method significantly improves audio quality,timing alignment,and audience perception.

     

/

返回文章
返回