基于解耦表征学习的生成式视觉图像理解
Generative Visual Image Understanding Based on Disentangled Representation Learning
-
摘要: 学习可解释的视觉图像表征以揭示图像变化因素是计算机视觉领域的研究热点. 现有的许多解耦方法通过使用额外的正则项发现图像变化因素并学习解耦表征, 但通常导致解耦和生成质量之间的不平衡, 影响视觉图像理解效果. 为此, 从图像的可解释性变化出发, 提出基于解耦表征学习的生成式视觉图像理解方法. 首先设计预先训练的Glow生成模型, 获取目标图像的潜在表征; 然后由潜在表征构建基于图像变化的学习策略, 得到候选遍历的可解释方向; 最后在对比学习视角下设计对比模块, 根据候选遍历的可解释方向模拟图像变化, 进而提取解耦表征. 在解耦领域流行的数据集Shapes3D, MPI3D, Anime, MNIST和Cars3D上的实验结果表明, 所提方法取得较好的效果, 其中, 在Cars3D数据集上的MIG, DCI, FactorVAE score和 -VAE score指标值分别达到0.16, 0.27, 0.89和0.98, 验证了该方法的有效性和可行性.Abstract: Interpretable visual image representation learning to reveal image variation factors is a hot research topic in computer vision. Many existing disentanglement methods discover variation factors of images and learn disentangled representations by using extra regularization term. However, it usually leads to an imbalance between disentanglement and generative quality, which affects visual image understanding. To address this issue, a generative visual image understanding method based on disentangled representation learning is proposed in terms of interpretable variations in images. Firstly, a pre-trained Glow model is designed to acquire the latent representations of target images. Secondly, a learning strategy based on image variation is constructed from the latent representations to obtain interpretable directions of candidate traversals. Finally, the contrast module is designed under the contrastive learning perspective to simulate image variations based on the interpretable directions of candidate traversals and then extract disentangled representations. The experimental results show that better results are achieved on the popular disentanglement datasets, which are Shapes3D, MPI3D, Anime, MNIST and Cars3D, where the MIG, DCI, FactorVAE score and -VAE score metrics reach 0.16, 0.27, 0.89 and 0.98, respectively, on the Cars3D dataset, verifying the effectiveness and feasibility of the proposed method.