Abstract:
Emotionally guided multimedia content generation is dedicated to enriching the public’s means of expressing emotions and viewpoints, becoming a significant component in propelling the development of controllable artificial intelligence generated content technology. To address the issues of ambiguous emotional attributes and weak interactivity in visual content generated by large models, this work proposes an optimized video generation algorithm FG-ECVG based on text commands, capable of highly controllable and strongly interactive automated generation from text instructions to video content. First, a guidance dictionary is constructed based on the valence-arousal-dominance emotional model, and the input text is analyzed for emotional polarity and matched with emotionally guiding words, achieving emotional control of the overall visual atmosphere. Second, a visual detail expansion framework is built using the retrieval-augmented-generation algorithm, adding structured anthropomorphic emotional visual elements to the user’s text commands, enhancing the emotional granularity of the generated content. By conducting a six-category emotional content generation for five types of scenes in the EmoSet dataset, followed by subjective and objective micro-video evaluations, the results indicate that compared to using generative visual large models alone, the method proposed in this paper yields video content with stronger emotional expressiveness. The accuracy of emotional 2-category and 6-category classifications increased by 23.33 and 20.00 percentage points respectively. Compared to current state-of-the-art visual emotion transfer or generation algorithms, the accuracy of emotional 6-category classification increased by an average of 26.67 percentage points, proving the effectiveness and superiority of the algorithm.