Abstract:
In response to the limitations of existing fine-grained 3D shape classification methods, which often focus on enhancing fine-grained feature extraction within individual views while neglecting the inter-view features dependency and the effective fusion of multi-granular features, we propose a fine-grained 3D shape classification network named MSGFormer, based on cross-view message interaction. First, the self-attention mechanism within local patch Tokens in each view achieves local region interaction and view feature extraction. Then, incorporating cross-view message Tokens interaction enables the flow of information between views and captures interactive features. The local patch Tokens selection strategy selects the local dominant features, explicitly highlighting the local fine-grained features. Finally, the global view, interactive, and local dominant features are fused and enhanced, which realizes the fine-grained 3D shape classification. On three subsets of the fine-grained classification dataset FG3D—Airplane, Car, and Chair—the proposed method achieves overall accuracies of 97.40%, 80.30%, and 85.70%, respectively; on the meta-category classification dataset ModelNet40, it achieves an overall accuracy of 97.81%; surpassing the compared methods, demonstrating the proposed network's excellent performance in fine-grained classification and generalization.