从单幅图像估计景深的模型到底学到了什么?
What Have the Single Image Based Depth Prediction Models Learnt?
-
摘要: 目前, 基于深度学习的单幅图像景深估计取得了显著的进展, 在一些公开的室内外数据集上均取得了非常高的估计精度. 然而, 不论是基于监督学习还是基于自监督学习的景深估计模型, 这些模型到底学习到了图像的什么性质, 使得其能够对景深进行很好的估计呢? 为此, 从2个侧面对这个问题进行了量化的测试和分析. 对于“无纹理”区域, 通过模型对这些区域与该区域邻域估计的景深之间的关系, 探究了这些区域的估计景深是不是其邻域估计景深的某种“填充”效应; 其次分析了模型对高视觉显著性区域估计的景深是否具有更高的估计精度. 测试结果显示, 无纹理区域与其邻域的估计景深分布确实存在比较高的相似性, 但当前景深估计模型的估计精度和图像视觉显著性的关联性不是很强. 所得结果对“景深估计模型解析”“景深估计模型改进”等相关工作均具有一定的参考价值, 例如, 今后在设计和训练景深估计模型的工作中, 有必要充分考虑输入图像的视觉显著性效应, 从而提高模型对高视觉显著性区域的景深估计精度, 以便更好地服务下游任务.Abstract: Recently, single-image based depth learning via deep learning has achieved tremendous progress, and impressive prediction accuracy has been reported on both indoor and outdoor benchmark datasets. However what have been learnt by such models from single images under either supervised or self-supervised learning framework? It seems this fundamental problem is rarely discussed in the literature up to now. In this work, this problem is investigated from the following two aspects: at first, for those texture-poor or no texture regions, it is tested whether the corresponding predicted depths are somewhat filling-in effect of the depths in their close neighborhood region. Secondly, it is assessed whether the regions with high visual saliency usually have better depth prediction performance. Our test results show that the predicted depths in texture-poor regions indeed have high correlation with the depths in their close neighborhood region. However, the accuracy of the estimated depth is not particularly related to the visual saliency of the input image. The above results could be of reference value for both model analysis and model design, for example, visual saliency of input image could be taken into account in the model design and training to enhance the prediction accuracy of high saliency regions, so as to better serve the down-stream vision tasks.