Abstract:
Aiming at the problem of inaccurate image caption generation results due to the complex attribute information, high similarity of classes and low correlation between semantic attributes and visual information of minority clothing images, a local attribute attention network for minority clothing image caption generation is proposed. Firstly, a national clothing image description generation dataset containing 55 categories, 30 000 images, and about 3 600 MB is constructed; at the same time, 208 kinds of local key attribute vocabulary and 30 089 text information of minority clothing are defined, and visual features are extracted through the local attribute learning module and text information embedding and use multi-instance learning to obtain local attributes. Then, an attention-aware module including semantics, vision, and gated attention was defined based on the double-layer long short-term memory network. And the image caption generation results of minority clothing were optimized by combining the local attributes, attribute-based visual features, and text encoding information. Experimental results on our established dataset for minority clothing image caption generation show that the proposed methods can generate image captions including key attributes such as minority category and clothing style, and can improve the accuracy index Bleu and semantic richness index CIDEr by 1.4% and 2.2% respectively compared with existing methods.