Abstract:
Self-supervised learning (SSL) can capture generic knowledge about different concepts, thereby beneficial for various downstream image analysis tasks. To address the shortcomings of underutilized multi-modal features in self-supervised learning methods for medical images, a self-supervised learning method considering multi-modal complementary information is proposed, named SLeM. This method first divides the four modalities into four blocks uniformly, these blocks are used to construct multi-modal images by randomly combining them, different multi-modal images are assigned different labels, and the multimodal feature representations can be learned by the classification task. The learned multi-modal features are followed by a contextual fusion block (CFB), which extracts features from tumors of various sizes. Finally, we transfer the learned representation to the downstream multi-modal medical image segmentation task via simple fine-tuning. Experiments conducted on public datasets BraTS and CHAOS were compared with multimodel baselines, including methods based on JiGen, Taleb and Supervoxel, etc. The results show that the segmentation accuracy of whole tumor, tumor core and enhanced tumor are improved by 2.03 percentage points, 3.92 percentage points and 1.75 percentage points, respectively. Meanwhile, the visual effect obtained by this method is also significantly better than others.