With this paper, we are releasing a new dataset to allow researchers to benchmark their progress in generating paragraphs that tell a story about an image. The dataset contains 19,561 images from the
Visual Genome dataset. Each image contains one paragraph. The training/val/test sets contains 14,575/2487/2489 images. We show in our paper that the paragraphs are more diverse than their corresponding sentences descriptions with more verbs, co-references
and adjectives.
Since all the images are also part of the Visual Genome dataset, Each image also contains 50 region descriptions (short phrases
describing parts of an image), 35 objects, 26 attributes and 21 relationships and 17 question-answer pairs.