We introduce the dense captioning task, which requires a
computer vision system to both localize and describe salient
regions in images in natural language. The dense captioning
task generalizes object detection when the descriptions
consist of a single word, and Image Captioning when one
predicted region covers the full image. To address the localization
and description task jointly we propose a Fully Convolutional
Localization Network (FCLN) architecture that
processes an image with a single, efficient forward pass, requires
no external regions proposals, and can be trained
end-to-end with a single round of optimization. The architecture
is composed of a Convolutional Network, a novel
dense localization layer, and Recurrent Neural Network
language model that generates the label sequences. We
evaluate our network on the
Visual Genome dataset, which
comprises 94,000 images and 4,100,000 region-grounded
captions. We observe both speed and accuracy improvements
over baselines based on current state of the art approaches
in both generation and retrieval settings.