The test set has been witheld for evaluation purposes.
Download the C3D video features.
@inproceedings{krishna2017dense, title={Dense-Captioning Events in Videos}, author={Krishna, Ranjay and Hata, Kenji and Ren, Frederic and Fei-Fei, Li and Niebles, Juan Carlos}, booktitle={International Conference on Computer Vision (ICCV)}, year={2017} }
Complete pipeline for multi-captioning videos with event descriptions. We first extract C3d features from the input video frames. These frames are fed into our proposal network at varying stride to predict both short as well as long events. Each proposal with it's start and end time and its hidden representation is inputted into the captioning module, which uses context from neighboring events to generate each event description.
The ActivityNet Captions dataset connects videos to a series of temporally annotated sentence descriptions. Each sentence covers an unique segment of the video, describing multiple events that occur. These events may occur over very long or short periods of time and are not limited in any capacity, allowing them to co-occur. On average, each of the 20k videos contains 3.65 temporally localized sentences, resulting in a total of 100k sentences. We find that the number of sentences per video follows a relatively normal distribution. Furthermore, as the video duration increases, the number of sentences also increases. Each sentence has an average length of 13.48 words, which is also normally distributed. You can find more details of the dataset under the ActivityNet Captions Dataset section, and under supplementary materials in the paper.
We evaluate our model by detecting multiple events in videos and describing all the events jointly. We refer to this task as Multi-captioning. We test our model on ActivityNet Captions dataset, which was built specifically for this task.
Next, we provide baseline results on two additional tasks that are possible with our model. The first task, localization, tests our proposal model's capability in being able to adequately localize all the events for a given video. The second task, retrieval, tests a variation of our model's ability to retrieval the correct set of sentences given the video or vice versa. Both these tasks are designed to test the event proposal module (localization) and the captioning module (retrieval) individually.
We show examples qualitative results from the variations of our models in the figures on the left. In (a), we see that the last caption in the no context model drifts off topic while the full model utilizes context to generate more reasonable context. However, context isn't always successful at generating better captions; in (b), the middle segment overpowers the neighboring events and causes the full model to repeat the caption for the middle event. Finally, in (c), we see that our full context model is able to use the knowledge that the vegetables are later "mixed in the bowl" to also mention "the bowl" in the third and fourth sentences, propagating context back through to past events.