Abstract:
Motion video transfer is one of the key tasks in the field of computer vision, but current methods have two shortcomings. Firstly, the accuracy of extracting motion pose features that change over time is insufficient. Secondly, the factorization of time-varying motion pose features and time independent content features is flawed, leading to a decrease in video transfer performance. To address the aforementioned issues, we proposes an innovative two-stage network model. In the first stage, two encoder structures are used to extract motion pose features and content features from the video, and feature fusion and video generation are performed through a decoder. In this stage, self-supervised pose alignment method was adopted to extract motion features, and adversarial loss, cycle consistency loss, and cross training techniques were introduced to improve the factorization effect of both. In the second stage, finetune the video generated by the first network, focusing on improving the motion consistency of the video. Through a series of comparative experiments, visualization expriments, and ablation expriments, the results demonstrate that the proposed model outperforms existing methods in terms of video frame quality, structural similarity, and mean squared error.