视听融合耦合坐标自注意的单目深度估计
Monocular Depth Estimation Based on Audio-Visual Fusion Coupling Coordinate Self-Attention
-
摘要: 针对单目图片和声音回波信号都含空间信息这一特点, 提出一种视听融合的单目深度估计方法.首先, 通过池化金字塔模块融合分析回波与材料特征来自适应估计单目图片的离散深度值; 然后, 采用卷积神经网络和Transformer相结合的方法对单目图片进行编码, 改进坐标注意力提出坐标自注意力模块对图片特征解码获得离散深度值的概率分布; 最后, 将像素点的深度值建模为离散深度值的期望来构建最终深度图.实验结果表明, 在仿真数据集Replica和Matterport3D数据集上, 所提方法的均方根误差分别为0.204和0.875, 相对误差分别为0.095和0.161, 均取得具有竞争力的结果; 在真实数据和含噪声数据中, 该方法能够应用于真实场景的深度估计.Abstract: Aiming at the characteristic that both of monocular image and sound echo signal contain spatial information, this paper proposes a monocular depth estimation method based on audio-visual fusion. Firstly, fuse and analyze echo and material features by the pooling pyramid module to estimate adaptively the discrete depth value of the monocular image; secondly, use a method combining convolutional neural network and Transformer to encode monocular images, improve coordinate attention proposes a coordinate self-attention module to decode image features to obtain the probability distribution of discrete depth values; finally, model the depth value of a pixel as the expectation of a discrete depth value to construct the final depth map. Experimental results show that on the simulation data sets Replica and Matterport3D data sets, the root mean square errors of the proposed method are 0.204 and 0.875 respectively, and the relative errors are 0.095 and 0.161 respectively, both achieves competitive results; experimental results on real data and noisy data show that this method can work in monocular depth estimation in real scenes.