Combination of Axial Enhanced Transformer and CNN Network for medical image segmentation
-
Graphical Abstract
-
Abstract
The hybrid model, which combines Swin Transformer and CNN, has demonstrated its effectiveness in medical image segmentation. However, there exists semantic gaps between the features extracted from the two networks within the hybrid model, leading to unsatisfactory segmentation accuracy when directly fusing these features. Moreover, it is observed that the Swin Transformer lacks pixel-level modeling capability within patches. To address these challenges, we propose a novel method for medical image segmentation that integrates axial enhanced Transformer and CNN double encoder. In order to bridge the semantic gap between features, our method introduces a new feature fusion module during the coding stage. Additionally, we leverage cross-fusion techniques along with spatial channel attention and cross-domain enhancement modules to effectively merge the features extracted from both networks. The objective of these measures is to ensure semantic consistency and effectiveness, ultimately enhancing the model’s expressiveness. To address the issue of limited pixel-level modeling ability in Swin Transformer, an axial enhancement transformer encoder is employed to capture correlations between pixels in both height and width dimensions. This significantly improves the model’s pixel-level modeling capability, resulting in enhanced segmentation accuracy. Experiments are conducted on four medical image datasets, namely GlaS, MoNuSeg, JSRT and ISIC2018, and compared with various mainstream segmentation models. The experimental results demonstrate that our proposed model achieves optimal Dice, IoU, precision, and recall across diverse datasets. Furthermore, it can be utilized for segmenting a wide range of medical images.
-
-