高级检索

面向单幅图像的高质量深度估计方法

A High-quality Depth Estimation Method for Single Image

  • 摘要: 单幅图像的深度估计是机器人导航、场景理解等领域中的一项关键任务, 也是计算机视觉领域的一个复杂问题. 针对单幅图像深度估计不准确的问题, 提出一种基于Vision Transformer(ViT)的单幅图像深度估计方法. 首先用预训练的DenseNet对图像进行下采样, 同时将特征编码成适用于ViT的特征序列; 然后通过稠密连接的ViT处理全局上下文信息, 并将特征序列重新组装成高维度特征图; 最后将RefineNet进行上采样, 得到完整的深度图像. 在NYU V2数据集上与最近的深度估计方法进行对比实验, 并对网络结构进行消融实验, 同时对平均相对误差、均方根误差等误差进行量化分析, 结果表明, 所提方法面向单幅图像可以生成具有丰富细节的高质量深度图像; 与传统的编码器解码器方法相比, 该方法的PSNR值平均提高了1.052 dB, 平均相对误差下降了7.7%~21.8%, 均方根误差下降了5.6%~16.9%.

     

    Abstract: Depth estimation from a single image is critical in robot navigation, scene understanding, etc. It is also a complex problem in computer vision. Aiming at the inaccurate depth estimation of a single image, we propose a single-image depth estimation method based on ViT. First, we downsample the image by the pre-trained DenseNet and encode the features into sequences suitable for ViT; Then, the densely connected ViT processes the global context information, and the feature sequence is reassembled into high-dimensional feature maps; Finally, Upsampling to obtain a complete depth image; We conduct comparative experiments with other depth estimation methods on the NYU V2 dataset, and ablation experiments on the network structure. This paper quantitatively analyzes the average relative error, root means square error, and other errors. The results show that the method can generate high-quality depth images with rich details for a single image. Compared with the traditional encoder-decoder method, the PSNR value of the proposed method is increased by 1.052 dB on average, the REL is decreased by 7.7%~21.8%, and the RMS is reduced by 5.6%~16.9%.

     

/

返回文章
返回