结合全局和局部特征的BiGRU-RA图像中文描述模型

邓珍荣; 张永林; 杨睿; 蓝如师; 黄文明; 罗笑南

doi:10.3724/SP.J.1089.2021.18262

结合全局和局部特征的BiGRU-RA图像中文描述模型

BiGRU-RA Model for Image Chinese Captioning via Global and Local Features

摘要

摘要: 针对目前基于全局特征的图像描述模型存在细节语义信息不足的问题,提出结合全局和局部特征的图像中文描述模型.该模型采用编码器-解码器框架,在编码阶段,分别使用残差网络(residualnetworks,ResNet)和Faster R-CNN提取图像的全局特征和局部特征,提高模型对不同尺度图像特征的利用.采用嵌入了残差连接结构和视觉注意力结构的双向门控循环单元(bi-directional gated recurrent unit, BiGRU)作为解码器(BiGRU with residual connection andattention,BiGRU-RA).模型可以自适应分配图像特征和文本权重,改善图像特征区域和上下文信息的映射关系.此外,加入基于强化学习的策略梯度对模型的损失函数进行改进,直接对评价指标CIDEr进行优化.在AI Challenger全球挑战赛图像中文描述数据集上进行训练和实验,实验结果表明,该模型获得更高的评分,生成的描述语句更准确、更详细.

Abstract: To address the problem of insufficient detailed semantic information in current global features-based image captioning models,an image Chinese captioning model combining global and local features is proposed.The proposed model adopts the encoder-decoder framework.In the coding stage,the residual networks(Res-Net)and Faster R-CNN are used to extract the global and local features of images respectively,improving the model҆s utilization of image features at different scales.A bi-directional gated recurrent unit(BiGRU)with embedded visual attention structure and residual connection structure is applied as the decoder(BiGRU with residual connection and attention,BiGRU-RA).The model can adaptively allocate image features and text weights,and improve the mapping relationship between image feature regions and context information.Additionally,the reinforcement learning-based policy gradient is added to improve the loss function of the model and optimize the evaluation criteria CIDEr directly.The training and experiments are conducted on the Chinese captioning dataset of AI challenger.The comparative results show that the proposed model obtained better scores and the generated caption are more accurate and detailed.

HTML全文

参考文献(0)

施引文献

资源附件(0)