To mitigate the problem that the action recognition methods lack the modeling of spatiotemporal feature relationship, an action recognition method based on feature interaction and clustering is proposed. Firstly, a mixed multi-scale feature extraction network is designed to extract spatial and temporal features of continuous frames. Secondly, a feature interaction module is designed based on non-local operation to realize spatiotemporal feature interaction. Finally, based on the triplet loss function, a hard sample selection strategy is designed to train the recognition network, thus realizing spatiotemporal feature clustering and improving the robustness and discrimination of the features. Experimental results show that compared with TSN, the accuracy of on the UCF101 dataset is increased by 23.25 percentage points to 94.82%. On the HMDB51 dataset, the accuracy is increased by 20.27 percentage points to 44.03%.