In the past few years, many frameworks based on convolutional neural networks have been proposed for image splicing forgery detection. However, most of the existing algorithms can not obtain satisfactory performance due to tampered areas with various sizes, especially for objects with large-scale. In order to obtain an accurate forgery localization result, a hybrid Transformer architecture, which integrates both self-attention and cross-attention into U2
-Net, is proposed for image splicing forgery detection. Specifically, self-attention is applied at the last block of encoder to capture long-range semantic information dependencies, so that the network can more completely locate large-scale tampered areas. Meanwhile, in the skip connections, a cross-attention module is designed to enhance the low-level feature maps with the guidance of high-level semantic information, filter out non-semantic features, and achieve more refined spatial recovery.Therefore, the hybrid model, which combines both advantages of self- and cross-attention from Transformer, has the ability to capture more context information and spatial dependencies from different scales. That is to say, the proposed method, fusing the convolution and Transformer together, can locate spliced forgeries with various sizes without requiring pre-training on a large number of images. Compared with four traditional methods and six new deep learning methods based on Casia2.0 and Columbia, the method in this paper achieves the better performance.