Abstract:
For current RGB-thermal-infrared (RGB-T) video tracking methods, the bounding box can not properly describe the target shape, which induces the parameter training not fully focus on the target area. In the aspect of feature representation, the single-layer deep learning features have difficulty in balancing both category semantic information and spatial structure information. Therefore, an RGB-T tracking algorithm with salient content perception and deep feature fusion is proposed in this article. Firstly, for the two modalities visible spectrum and thermal-infrared spectrum, the salient maps of the target are extracted and fused. Secondly, the fused salient map is used to optimize the weighting coefficient map of the spatial regularization term to highlight the influence of the training samples in the salient content region on the classifier training. Finally, the pre-trained convolution neural network is used to extract the multi-layer features of the two modalities. These features contain abundant information of sematic category and spatial structure, which are fused at the response level. Compared to the existing tracking algorithms, experimental results on the two RGB-T tracking datasets GTOT and RGBT210 demonstrate the effectiveness of the proposed algorithm. The proposed algorithm achieves the precision rates of 88.4% and 72.7%, respectively, while obtains the success rates of 71.9% and 51.0%.