Sound Generation Method with Timing-Aligned Visual Feature Mapping
-
Graphical Abstract
-
Abstract
In order to address the problems of existing methods,such as obvious noise,weak reality and asynchronous with video,we proposed a sound generation method based on timing-aligned visual feature mapping.Firstly,we designed a feature aggregation window based on temporal constraint,which extract integrated visual feature from the video sequence.Secondly,the integrated visual feature was transformed into multi-frequency audio feature by a spatio-temporal matching cross-modal mapping network.Finally,we utilized an audio decoder to obtain Mel-spectrogram from audio features,and send to a vocoder to output the final waveform.We completed qualitative and quantitative experiments on the VAS dataset,and the results show that the proposed method significantly improves audio quality,timing alignment,and audience perception.
-
-