Semantic segmentation is usually described as a pixel-level classification task, while the MaskFormer model integrating Convolutional Neural Network and Transformer describes it as a mask-level classification task. To solve the problems of poor deformation modeling ability, blurred object contour segmentation and slow convergence speed, an adaptive road scene semantic segmentation network with refined multi-scale perception and optimized contours is proposed. Firstly, the bottleneck structure formed by standard convolution and deformable convolution stack is used in the encoder to improve the deformation modeling ability of the network. Then, the feature refinement module is adopted in the decoder to filter the irrelevant features, which further improves the decoding ability of the feature pyramid network. To address the problem of pixel misalignment in the up-sampled features when the feature pyramid network is used for multi-level feature fusion, a feature calibration module is introduced to optimize the segmentation effect of object contours; Finally, the Miti-DETR decoder is employed in the Transformer module to speed up the training speed of the network and improve the segmentation accuracy. Experimental results show that the proposed network surpasses the existing semantic segmentation model on the Cityscapes and Mapillary Vistas datasets.