结合轴向增强Transformer与CNN双编码的医学图像分割方法
Combination of Axial Enhanced Transformer and CNN Network for medical image segmentation
-
摘要: 结合Swin Transformer与CNN的混合模型被证明是医学图像分割的有效方法, 但混合模型中2种网络提取的特征存在语义鸿沟, 直接融合导致模型的分割精度不理想. 并且Swin Transformer在patch内部的像素层面建模能力不足. 针对上述问题, 提出结合轴向增强Transformer与CNN双编码的医学图像分割方法. 对于特征间的语义鸿沟问题, 在编码阶段引入特征融合模块, 使用交叉融合、跨域增强和通道空间注意力模块对2种网络提取的特征进行有效融合, 保持语义的一致性和有效性, 以增强模型表达能力; 对于Swin Transformer的像素层面建模能力不足问题, 使用轴向增强Transformer编码器, 从高度和宽度2个维度考虑像素之间的相关性, 提高模型在像素层面的建模能力, 从而提高分割精度. 在GlaS, MoNuSeg, JSRT和ISIC2018这4个医学图像数据集上进行实验, 并与主流分割模型进行比较, 结果表明, 所提方法在不同的数据集中均获得最优Dice, IoU, 精确率和召回率, 可用于多种医学图像的分割.Abstract: The hybrid model, which combines Swin Transformer and CNN, has demonstrated its effectiveness in medical image segmentation. However, there exists semantic gaps between the features extracted from the two networks within the hybrid model, leading to unsatisfactory segmentation accuracy when directly fusing these features. Moreover, it is observed that the Swin Transformer lacks pixel-level modeling capability within patches. To address these challenges, we propose a novel method for medical image segmentation that integrates axial enhanced Transformer and CNN double encoder. In order to bridge the semantic gap between features, our method introduces a new feature fusion module during the coding stage. Additionally, we leverage cross-fusion techniques along with spatial channel attention and cross-domain enhancement modules to effectively merge the features extracted from both networks. The objective of these measures is to ensure semantic consistency and effectiveness, ultimately enhancing the model’s expressiveness. To address the issue of limited pixel-level modeling ability in Swin Transformer, an axial enhancement transformer encoder is employed to capture correlations between pixels in both height and width dimensions. This significantly improves the model’s pixel-level modeling capability, resulting in enhanced segmentation accuracy. Experiments are conducted on four medical image datasets, namely GlaS, MoNuSeg, JSRT and ISIC2018, and compared with various mainstream segmentation models. The experimental results demonstrate that our proposed model achieves optimal Dice, IoU, precision, and recall across diverse datasets. Furthermore, it can be utilized for segmenting a wide range of medical images.