In order to segment the foreground objects that users are interested in quickly and accurately, and obtain high-quality and low-cost annotation segmentation data, an interactive image segmentation algorithm based on two-stage feature fusion and Transformer encoder is proposed. Firstly, lightweight Transformer backbone network is adopted to extract multi-scale feature coding for input image, which can make better use of context information. Then, the subjective prior knowledge is introduced by means of click interaction, and the interactive features are integrated into Transformer network through the primary and enhanced stages in turn. Finally, the atrous convolution, attention mechanism and multi-layer perceptron are combined to decode the feature map obtained by the backbone network. Experimental results show that mNoC@90% values of the proposed algorithm on the GrabCut, Berkeley and DAVIS datasets reach 2.18, 4.04 and 7.39 respectively, which is better than other comparison algorithms. And the time and space complexity is lower than that of f-BRS-B. The proposed algorithm has good stability to the disturbance change of interactive click position and click type. It shows that the proposed algorithm can quickly, accurately and stably segment users' interested objects, and improve user interaction experience.