Abstract:
Text image inpainting aims to reconstruct defective textual structures to ensure accurate detection and recognition. Traditional methods based on deep convolutional networks or generative adversarial networks (GANs) are frequently challenged by semantic distortion and stroke fragmentation under severe damage conditions. In industrial scenarios, where defect regions are unpredictable and textual structures impose strict constraints, existing methods struggle to achieve pixel-level fidelity while maintaining semantic con-sistency. To address these limitations, a structure-semantics fusion-guided diffusion model (DiffTIN) is proposed. Specifically, a dual-stream guidance mechanism is innovatively designed: 1) Global text masks are predicted through an image segmentation network in the structure reconstruction module; 2) Semantic priors generated by scene text recognizers are integrated to guide the diffusion process. The fused priors of textual structure-semantics are coupled with latent space representations of the diffusion model, and a pro-gressive inpainting strategy is employed to hierarchically restore stroke details. Experimental results on the TII-ST dataset demonstrate that the proposed method improves word recognition accuracy by 1.15 per-centage points and enhances the peak signal-to-noise ratio (PSNR) by 1.03 dB, outperforming baseline methods such as GSDM and significantly improving the robustness of text recognition.