Advanced Search
Ma Cunliang, Pu Jiangchuan, Xu Chundong, Yi Jianbing, Jia Mingzhen. Monocular Depth Estimation Based on Audio-Visual Fusion Coupling Coordinate Self-Attention[J]. Journal of Computer-Aided Design & Computer Graphics, 2025, 37(2): 265-276. DOI: 10.3724/SP.J.1089.2023-00044
Citation: Ma Cunliang, Pu Jiangchuan, Xu Chundong, Yi Jianbing, Jia Mingzhen. Monocular Depth Estimation Based on Audio-Visual Fusion Coupling Coordinate Self-Attention[J]. Journal of Computer-Aided Design & Computer Graphics, 2025, 37(2): 265-276. DOI: 10.3724/SP.J.1089.2023-00044

Monocular Depth Estimation Based on Audio-Visual Fusion Coupling Coordinate Self-Attention

  • Aiming at the characteristic that both of monocular image and sound echo signal contain spatial information, this paper proposes a monocular depth estimation method based on audio-visual fusion. Firstly, fuse and analyze echo and material features by the pooling pyramid module to estimate adaptively the discrete depth value of the monocular image; secondly, use a method combining convolutional neural network and Transformer to encode monocular images, improve coordinate attention proposes a coordinate self-attention module to decode image features to obtain the probability distribution of discrete depth values; finally, model the depth value of a pixel as the expectation of a discrete depth value to construct the final depth map. Experimental results show that on the simulation data sets Replica and Matterport3D data sets, the root mean square errors of the proposed method are 0.204 and 0.875 respectively, and the relative errors are 0.095 and 0.161 respectively, both achieves competitive results; experimental results on real data and noisy data show that this method can work in monocular depth estimation in real scenes.
  • loading

Catalog

    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return