Abstract:
To address the problem of incomplete shape structure, inaccurate local details prediction and unsmooth motion model in clothed human reconstruction from monocular video, a spatial-temporal feature fusion method for clothed human reconstruction is proposed. Firstly, the temporal deformation is defined in deformation field between observe space and canonical space, representing and enhancing temporal features of contextual information between consecutive frames. Secondly, temporal features are used to guide the learning of fine-grained spatial features, obtaining global point-level features and pixel-aligned features of the clothed human body. Lastly, the spatial-temporal feature fusion module based on self-attention is proposed to obtain fusion features of global point-level features and pixel-aligned features with temporal information, and neural radiance fields is constructed that combines fusion features and canonical space coordinates to reconstruct accurate clothed human. Experimental results of the novel view and novel pose on ZJU-MoCap dataset show that total peak signal-to-noise ratio (PSNR) is 190.96 dB and 184.03 dB respectively, which is 10.62 dB and 2.45 dB higher than the comparison methods. The proposed method can improve accuracy of clothed human reconstruction and generate clothed human models with reasonable shapes, rich clothing textures, and smooth limbs. Experimental results of text driven model show that the proposed method also has a certain degree of generalization ability.