高级检索

基于条件生成对抗网络的图像描述生成方法

Image Captioning Based on Conditional Generative Adversarial Nets

  • 摘要: 图像描述,即利用电脑自动描述图像的语义内容一直是计算机视觉领域的一项重要研究任务.尽管使用卷积神经网络(convolutional neural networks, CNN)和长短期记忆网络(long short-term memory, LSTM)的组合框架在生成图像描述方面解决了梯度消失和梯度爆炸问题,但是基于LSTM的模型依赖序列化的生成描述,无法在训练时并行处理,且容易在生成描述时遗忘先前的信息.为解决这些问题,提出将条件生成对抗网络(conditionalgenerativeadversarial network, CGAN)引入到描述生成模型训练中,即采用CNN来生成图像描述.通过对抗训练来生成句子描述,并结合注意力机制提升描述的质量.在MSCOCO数据集上进行测试,实验结果表明,与基于CNN的其他方法相比,文中方法在语义丰富程度指标CIDEr上取得了2%的提升,在准确性指标BLEU上有1%左右的性能提升;同时,其在部分指标,尤其是语义指标上超过了基于LSTM模型的图像描述方法的性能;证明该方法生成的图像描述更接近图像的真实描述,并且语义内容更加丰富.

     

    Abstract: Generating the description of the semantic content of the image is defined as image captioning.It is an emerged research area in computer vision.Although the gradient disappearance and gradient explosion problems are solved by using convolutional neural networks(CNN)with long-short-term-memory(LSTM),the LSTM based model relies on serialized generation descriptions,which cannot be processed parallel during training,and it is easy to forget the previous information when generating the caption.In order to solve this problem,CNN-based generation model is used to generate image captions with the help of conditional generative adversarial training(CGAN).CGAN is used to generate caption,and the performance can be also improved by combining attention mechanism.The experiments are conducted on MSCOCO image datasets and the results show that our method could achieves 2%improvement on CIDEr that indicates semantic richness and 1%improvement on BLEU that indicates the accuracy compared with other CNN-based methods.And the proposed method can outperform LSTM based image captioning methods in some evaluation metrics.We could conclude that the image captions generated by our method are closer to the picture description,and the semantic content is richer.

     

/

返回文章
返回