高级检索
周治平, 张威. 结合视觉属性注意力和残差连接的图像描述生成模型[J]. 计算机辅助设计与图形学学报, 2018, 30(8): 1536-1542. DOI: 10.3724/SP.J.1089.2018.16825
引用本文: 周治平, 张威. 结合视觉属性注意力和残差连接的图像描述生成模型[J]. 计算机辅助设计与图形学学报, 2018, 30(8): 1536-1542. DOI: 10.3724/SP.J.1089.2018.16825
Zhou Zhiping, Zhang Wei. An Image Caption Generation Model Based on Visual Concept Attention and Residual Connection[J]. Journal of Computer-Aided Design & Computer Graphics, 2018, 30(8): 1536-1542. DOI: 10.3724/SP.J.1089.2018.16825
Citation: Zhou Zhiping, Zhang Wei. An Image Caption Generation Model Based on Visual Concept Attention and Residual Connection[J]. Journal of Computer-Aided Design & Computer Graphics, 2018, 30(8): 1536-1542. DOI: 10.3724/SP.J.1089.2018.16825

结合视觉属性注意力和残差连接的图像描述生成模型

An Image Caption Generation Model Based on Visual Concept Attention and Residual Connection

  • 摘要: 使机器自动描述图像一直是计算机视觉研究的长期目标之一.为了提高图像内容描述模型的精度,提出一种结合自适应注意力机制和残差连接的长短时间记忆网络(LSTM)的图像描述模型.首先根据pointer-net网络改进基本LSTM结构,增加记录图像视觉属性信息的单元;然后利用改进的LSTM结构,设计基于图像视觉语义属性的自适应注意力机制,自适应注意力机制根据上一时刻模型隐藏层状态,自动选择下一时刻模型需要处理的图像区域;此外,为了得到更紧密的图像与描述语句之间映射关系,构建基于残差连接的双层LSTM结构;最终得到模型能够联合图像视觉特征和语义特征对图像进行内容描述.在MSCOCO和Flickr30K图像集中进行训练和测试,并使用不同的评估方法对模型进行实验验证,结果表明所提模型的性能有较大的提高.

     

    Abstract: Making the machine automatically describe images has been one of the long-term goals in thefield of computer vision. In order to improve the accuracy of image caption model, an image caption modelbased on the stacked Long Short-Term Memory network is proposed, which combines the adaptive attentionmechanism with residual connection. Firstly, the basic LSTM structure is improved according to the pointer-net network, the units which can record the image visual attribute information are increased. Then theadaptive attention mechanism based on image visual semantic attribute is designed by using the improvedLSTM network, the image region to be processed at the next time is automatically chosen based on thehidden layer of the model at the previous time. In addition, to obtain a closer mapping relationship betweenthe image and description statement, a two-layer LSTM network based on residual connection is constructed,and finally the proposed model can describe the image by combining the image visual features with semanticfeatures. The training and testing are conducted on the MSCOCO and Flickr30K image datasets, theexperimental results demonstrate that the proposed model shows superior performance by using differentevaluation methods.

     

/

返回文章
返回