跨模态局部对应感知的点云深度学习方法

韩丽; 谢昕洋; 张宇平; 陈雪莹

doi:10.3724/SP.J.1089.2024-00752

跨模态局部对应感知的点云深度学习方法

Local Correspondence Aware Method for Cross-Modal Learning on Point Clouds

摘要

摘要: 针对跨模态学习中存在特征互补性与关联性挖掘不足的问题, 提出一种跨模态局部对应感知的点云深度学习方法. 在双通道学习框架下设计局部对应感知模块, 通过构建图像语义引导矩阵学习点云与图像的细粒度相关性,并结合注意力加权提升点云特征的表达能力; 然后引入残差机制进行语义特征补偿, 有效地增强跨模态特征学习的语义引导性; 再提出跨模态自监督对比学习策略, 通过三维点云对比学习结合二维图像语义特征, 实现模态间与模态内细粒度特征关联, 提升特征表示学习的自适应性; 最后结合重构损失、对比损失和跨模态一致性损失的联合优化,显著地提升模型的学习性能. 在公开数据集ShapeNet, ModelNet上的实验结果表明, 所提方法有效地提高跨模态学习的信息交互性, 增强了点云特征学习的鲁棒性, 采用线性探测评估与基准方法相比较, 在三维形状分类和分割任务中分别取得91.61%和86.4%的精度, 平均提高了5.37个百分点与1.2个百分点.

Abstract: To address the issue of insufficient exploration of feature complementarity and correlation in cross-modal learning, this paper proposes a novel local correspondence aware method for cross-modal learning on point clouds. Based on a dual-channel learning framework, a local correspondence aware module is designed to compute the local semantic correlation between point cloud features and image features based on a constructed image semantic guidance matrix, enhancing point cloud feature representation through attention-weighted mechanisms. A residual mechanism is also introduced for semantic feature compensation, effectively improving the semantic guidance in cross-modal feature learning. Additionally, a self-supervised cross-modal learning strategy is introduced, incorporating 3D points contrastive learning with 2D image semantic features guidance to achieve both inter-modal and intra-modal fine-grained feature association, thereby enhancing the adaptability of feature learning. Finally, by reconstructing the model in the feature spaces of images and point clouds, and leveraging a joint optimization mechanism of reconstruction loss, contrastive loss, and cross-modal consistency loss, the learning performance of the network is significantly improved. Experimental results demonstrate that the proposed method improves the information interaction in cross-modal learning and enhances the robustness of feature learning through image semantic guidance. Evaluated via linear probing, this method achieves 91.61% (classification) and 86.4% (segmentation) on 3D shape tasks, outperforming the baseline by 5.37 percentage points and 1.2 percentage points on average.

HTML全文

参考文献(0)

施引文献

资源附件(0)