Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).

CVPR 2014 Paper

  title     = {Large-scale Video Classification with Convolutional Neural Networks},
  author    = {Andrej Karpathy and George Toderici and Sanketh Shetty and Thomas Leung and Rahul Sukthankar and Li Fei-Fei},
  year      = {2014},
  booktitle = {CVPR}

Sports-1M Dataset

The Sports-1M dataset is licensed under Creative Commons 3.0 and contains 1,133,158 video URLs which have been annotated automatically with 487 Sports labels using the YouTube Topics API. To download the dataset, check out our Github Repository, or simply use:

$ git clone https://github.com/gtoderici/sports-1m-dataset.git

Then see the attached README for details.

Here is a visualization of some thumbnails for every one of the 487 classes (7MB html page).

Finer details about all videos as JSON (53MB zip). Example entry:

  "stitle": "Improving Sprint Start Technique", 
  "label487": [ 205 ], 
  "thumbnail": "https://i1.ytimg.com/vi/Drdm1WsRQwA/hqdefault.jpg", 
  "width": 640, 
  "duration": 86, 
  "height": 360, 
  "id": "Drdm1WsRQwA", 
  "source487": "train"

A common question is how one can manage data of this scale. We'd like to note that the JSON information we release contains durations of all videos, so it is possible to filter to only videos below some threshold of duration. Another idea is to sample frames/segments from the videos right away and not store the full original files, or even further resize them right away to 227x277 in spatial resolution, for example. A large portion of the dataset (90%+) can thus add up to at most a few TB.

Supplemental Material

Example per-frame classification results overlayed on top of a video. Also available as a direct download (35mb).

Spatio-temporal features learned by the Slow Fusion network on the first layer. Compare to [Le et al. '11] unsupervised spatio-temporal features.

The full confusion matrix on the Sports-1M, and a few diagonal crops.