视听融合耦合坐标自注意的单目深度估计

马存良; 蒲江川; 许春冬; 易见兵; 嘉明珍

doi:10.3724/SP.J.1089.null.2023-00044

视听融合耦合坐标自注意的单目深度估计

Research on Monocular Depth Estimation Based on Audio-visual Fusion Coupling Improved Coordinate Self-Attention in Indoor Environment

,
pu,
,
,
MingZhen JIA

摘要

摘要: 单目图像和声音回波信号都含空间信息, 针对这一特点, 提出一种视听融合的单目深度估计方法. 该方法包括声音回波与材料特征融合分析的深度区间分类网络和单目图像分析的深度区间概率分布估计. 将深度区间的分类和概率分布的线性结合构建最终深度图. 采用池化金字塔模块将声音回波和材料信息耦合. 在深度区间概率分布预测中, 采用编解码结构. 为了对局部和全局信息的有效提取, 在编码阶段采用卷积神经网络和Transformer相结合的方法. 在解码阶段, 改进坐标注意力模块提出坐标自注意力模块. 实验结果表明, 视听多模态融合分析在Replica和Matterport3D数据集上均取得具有竞争力的结果. 消融实验结果表明, 提出的坐标自注意力模块可以提高深度估计质量.

Abstract: Aiming at the characteristic that both of monocular image and sound echo signal contain spatial information, this paper proposes a monocular depth estimation method based on audio-visual fusion. The proposed approach comprises an interval classification network and an interval probability distribution estimation network. The interval classification network takes the acoustic echo signal and material features as input, while the interval probability distribution estimation network takes the monocular image as input. Classification and probability distribution of depth intervals are combined linearly to construct the final depth map. Pyramid pooling module couples of acoustic echo and material information. Depth interval probability distribution prediction adopts the encoder-decoder structure. In order to effectively extract local and global information, this paper uses the method of combining convolutional neural network and Transformer in the encoder stage. In the decoder stage, this paper improves the coordinate attention module and proposes a coordinate self-attention module. The experimental results show that the audio-visual multimodal fusion analysis has achieved competitive results on both the Replica and Matterport3D datasets. Ablation experiments indicate that the proposed coordinate self-attention module can improve the quality of depth estimation.

HTML全文

参考文献(0)

施引文献

资源附件(0)