高级检索
董荣胜, 刘意, 马雨琪, 李凤英. 轻量化卷积注意力特征融合网络的实时语义分割[J]. 计算机辅助设计与图形学学报, 2023, 35(6): 935-943. DOI: 10.3724/SP.J.1089.2023.19499
引用本文: 董荣胜, 刘意, 马雨琪, 李凤英. 轻量化卷积注意力特征融合网络的实时语义分割[J]. 计算机辅助设计与图形学学报, 2023, 35(6): 935-943. DOI: 10.3724/SP.J.1089.2023.19499
Dong Rongsheng, Liu Yi, Ma Yuqi, Li Fengying. Lightweight Network with Convolutional Attention Feature Fusion for Real-Time Semantic Segmentation[J]. Journal of Computer-Aided Design & Computer Graphics, 2023, 35(6): 935-943. DOI: 10.3724/SP.J.1089.2023.19499
Citation: Dong Rongsheng, Liu Yi, Ma Yuqi, Li Fengying. Lightweight Network with Convolutional Attention Feature Fusion for Real-Time Semantic Segmentation[J]. Journal of Computer-Aided Design & Computer Graphics, 2023, 35(6): 935-943. DOI: 10.3724/SP.J.1089.2023.19499

轻量化卷积注意力特征融合网络的实时语义分割

Lightweight Network with Convolutional Attention Feature Fusion for Real-Time Semantic Segmentation

  • 摘要: 轻量化卷积神经网络的出现促进了基于深度学习的语义分割技术在低功耗移动设备上的应用. 然而, 轻量化卷积神经网络一般不考虑融合特征之间的关系, 常使用线性方式进行特征融合, 网络分割精度有限. 针对该问题,提出一种基于编码器-解码器架构的轻量化卷积注意力特征融合网络. 在编码器中, 基于 MobileNetv2 给出空洞MobileNet 模块, 以获得足够大的感受野, 提升轻量化主干网络的表征能力; 在解码器中, 给出卷积注意力特征融合模块, 通过学习特征平面通道、高度和宽度 3 个维度间的关系, 获取不同特征平面之间的相对权重, 并以此对特征平面进行加权融合, 提升特征融合的效果. 所提网络仅有 0.68×106 参数量, 在未使用预训练模型、后处理和额外数据的情况下, 使用 NVIDIA 2080Ti 显卡在城市道路场景数据集 Cityscapes 和 CamVid 上进行实验的结果表明, 该网络的平均交并比分别达到了 72.7%和 67.9%, 运行速度分别为 86 帧/s 和 105 帧/s, 在分割精度、网络规模与运行速度之间达到了较好的平衡.

     

    Abstract: Recently reported lightweight networks have promoted the application of real-time semantic segmentation on mobile platforms. However, the linear combination operation performed in lightweight networks do not consider the relationship between fused features, resulting in limited segmentation accuracy. To solve this dilemma, a lightweight network with convolutional attention feature fusion based on encoder-decoder architecture is proposed in this paper. In the encoder, a dilated MobileNet block is given based on MobileNetv2 to create sufficient receptive fields and enhance representation ability of the lightweight backbone. In the decoder, convolutional attention feature fusion module is given. Relative attention weights that contain interactions between channel, height and width are used to aggregate feature maps. Specifically, without a pretrained model, postprocessing or extra data, the lightweight network with convolutional attention feature fusion has only 0.68 million parameters and achieves a 72.7% mean intersection over union on the Cityscapes dataset with a speed of 86 frames per second and a 67.9% mean intersection over union on the Camvid dataset with a speed of 105 frames per second on a single 2080Ti GPU. The comprehensive experiments demonstrate that our model achieves favorable trade-off between accuracy, model size and speed.

     

/

返回文章
返回