Recent progress on image captioning has made it possible to generate novel sentences describing images in natural language, but compressing an image into a single sentence can describe visual content in only coarse detail. While one new captioning approach, dense captioning, can potentially describe images in finer levels of detail by captioning many regions within an image, it in turn is unable to produce a coherent story for an image. In this paper we overcome these limitations by generating entire paragraphs for describing images, which can tell detailed, unified stories. We develop a model that decomposes both images and paragraphs into their constituent parts, detecting semantic regions in images and using a hierarchical recurrent neural network to reason about language. Linguistic analysis confirms the complexity of the paragraph generation task, and thorough experiments on a new dataset of image and paragraph pairs demonstrate the effectiveness of our approach.

To appear in CVPR 2017


Download the dataset here.

Download the training, val and test splits here.


Download the paper here.


  title={A Hierarchical Approach for Generating Descriptive Image Paragraphs},
  author={Krause, Jonathan and Johnson, Justin and Krishna, Ranjay and Fei-Fei, Li},
  booktitle={Computer Vision and Patterm Recognition (CVPR)},


With this paper, we are releasing a new dataset to allow researchers to benchmark their progress in generating paragraphs that tell a story about an image. The dataset contains 19,561 images from the Visual Genome dataset. Each image contains one paragraph. The training/val/test sets contains 14,575/2487/2489 images. We show in our paper that the paragraphs are more diverse than their corresponding sentences descriptions with more verbs, co-references and adjectives.

Since all the images are also part of the Visual Genome dataset, Each image also contains 50 region descriptions (short phrases describing parts of an image), 35 objects, 26 attributes and 21 relationships and 17 question-answer pairs.


We first decompose the input image by detecting objects and other regions of interest, then aggregate features across these regions to produce a pooled representation richly expressing the image semantics. This feature vector is taken as input by a hierarchical recurrent neural network composed of two levels: a sentence RNN and a word RNN. The sentence RNN receives the image features, decides how many sentences to generate in the resulting paragraph, and produces an input topic vector for each sentence. Given this topic vector, the word RNN generates the words of a single sentence.

Example Results

We compare our approach with numerous baselines, showcasing the benefits of hierarchical modeling for generating descriptive paragraphs. Read our paper to learn more about our baselines (Sentence-concat and template methods). We compare and contrast our methods with these baselines in terms of diversity, co-references (pronouns), verbs, sentence lengths and vocabulary size.

Focused paragraph generation

As an exploratory experiment in order to highlight the interpretability of our model, we investigate generating paragraphs from particular regions in the image. We show that the model generates paragraphs describing the detected regions without much mention of objects or scenery outside of the detections. Taking the top-right image as an example, despite a few linguistic mistakes, the paragraph generated mentions the batter, catcher, dirt, and grass, which all appear in the top detected regions, but does not pay heed to the pitcher or the umpire in the background.