To solve the problem of poor prediction of complex texture areas with existing monocular depth estimation methods. We propose a visual attention-based self-supervised monocular depth estimation method. Firstly, this method provides better fusion of multiple features by fusing multi-scale source images as input to the encoder; Then through parallel intermediate attention modules interacted across regions, we modelled semantic dependencies in the spatial dimension and channel dimension respectively, to obtain rich contextual information; In addition, the continuous external attentional feature aggregation module was used to form the decoder part, which effectively used contextual information to solve the maladjustment problem in complex regions. Experimental results on KITTI and Cityscapes datasets show that our method is better than the current mainstream methods, with better depth prediction performance in complex texture regions. In the KITTI dataset, RMS and RMSlog reached 4.486 and 0.181, respectively.