iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning
-
Graphical Abstract
-
Abstract
Automatically describing images or videos with natural language sentences(a.k.a.image/video captioning)has increasingly received significant attention.Most related works focused on generating one caption sentence for an image or a short video.While most videos in our daily life contain numerous actions or objects de facto,it is hard to describe complicated information involved in these videos with a single sentence.How to learn information from long videos has become a compelling problem.The number of large-scale dataset for such task is limited.Instructional videos are a unique type of videos that have distinct and attractive characteristics for learning.Makeup instructional videos are very popular on commercial video websites.Hence,we present a large-scale makeup instructional video dataset named iMakeup,containing 2000 videos that are equally distributed over 50 topics.The total duration of this dataset is about 256 hours,containing about 12 823 video clips in total which are segmented based on makeup procedures.We describe the collection and annotation process of our dataset;analyze the scale,the text statistics and diversity in comparison with other video dataset for similar problems.We then present the results of our baseline video caption models on this dataset.The iMakeup dataset contains information from both visual and auditory modalities with a large coverage and diversity of content.Despite for video captioning,it can be used in an extensive range of problems,such as video segmentation,object detection,intelligent fashion recommendation,etc.
-
-