Dense-Captioning Events in Videos

Abstract

Most natural videos contain numerous events. For example, in a video of a “man playing a piano”, the video might also contain “another man dancing” or “a crowd clapping”. We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with its unique start and end time. Finally, we report performances of our model for dense-captioning events, video retrieval and localization.

@inproceedings{krishna2017dense, title={Dense-Captioning Events in Videos}, author={Krishna, Ranjay and Hata, Kenji and Ren, Frederic and Fei-Fei, Li and Niebles, Juan Carlos}, booktitle={International Conference on Computer Vision (ICCV)}, year={2017} }

Models

Complete pipeline for multi-captioning videos with event descriptions. We first extract C3d features from the input video frames. These frames are fed into our proposal network at varying stride to predict both short as well as long events. Each proposal with it's start and end time and its hidden representation is inputted into the captioning module, which uses context from neighboring events to generate each event description.

Dataset

The ActivityNet Captions dataset connects videos to a series of temporally annotated sentence descriptions. Each sentence covers an unique segment of the video, describing multiple events that occur. These events may occur over very long or short periods of time and are not limited in any capacity, allowing them to co-occur. On average, each of the 20k videos contains 3.65 temporally localized sentences, resulting in a total of 100k sentences. We find that the number of sentences per video follows a relatively normal distribution. Furthermore, as the video duration increases, the number of sentences also increases. Each sentence has an average length of 13.48 words, which is also normally distributed. You can find more details of the dataset under the ActivityNet Captions Dataset section, and under supplementary materials in the paper.

Experiments and Results

A. Adding context can generate consistent captions.

B. Compare online versus full model.

C. Context might add more noise to rare events.

We evaluate our model by detecting multiple events in videos and describing all the events jointly. We refer to this task as Multi-captioning. We test our model on ActivityNet Captions dataset, which was built specifically for this task.

Next, we provide baseline results on two additional tasks that are possible with our model. The first task, localization, tests our proposal model's capability in being able to adequately localize all the events for a given video. The second task, retrieval, tests a variation of our model's ability to retrieval the correct set of sentences given the video or vice versa. Both these tasks are designed to test the event proposal module (localization) and the captioning module (retrieval) individually.

We show examples qualitative results from the variations of our models in the figures on the left. In (a), we see that the last caption in the no context model drifts off topic while the full model utilizes context to generate more reasonable context. However, context isn't always successful at generating better captions; in (b), the middle segment overpowers the neighboring events and causes the full model to repeat the caption for the middle event. Finally, in (c), we see that our full context model is able to use the knowledge that the vegetables are later "mixed in the bowl" to also mention "the bowl" in the third and fourth sentences, propagating context back through to past events.

Dense-Captioning Events in Videos

Abstract

Dataset

Paper

Bibtex

Example video annotations from our dataset.

Models

Dataset

Experiments and Results