Abstract:
Depth estimation from a single image is critical in robot navigation, scene understanding, etc. It is also a complex problem in computer vision. Aiming at the inaccurate depth estimation of a single image, we propose a single-image depth estimation method based on ViT. First, we downsample the image by the pre-trained DenseNet and encode the features into sequences suitable for ViT; Then, the densely connected ViT processes the global context information, and the feature sequence is reassembled into high-dimensional feature maps; Finally, Upsampling to obtain a complete depth image; We conduct comparative experiments with other depth estimation methods on the NYU V2 dataset, and ablation experiments on the network structure. This paper quantitatively analyzes the average relative error, root means square error, and other errors. The results show that the method can generate high-quality depth images with rich details for a single image. Compared with the traditional encoder-decoder method, the PSNR value of the proposed method is increased by 1.052 dB on average, the REL is decreased by 7.7%~21.8%, and the RMS is reduced by 5.6%~16.9%.