Processing Terabytes of Video on Hundreds of Machines

TLDR

There are now many state-of-the-art computer vision algorithms which are a git clone away. We found that existing systems for distributed data analysis were not well suited to dealing with the computational challenges of applying computer vision algorithms to terabyte or petabyte sized video collections, so we designed and built a system called Scanner to make efficient video analysis easier.

Starting off as a graduate student at Carnegie Mellon University, I was really excited about the many new types of applications that leaned heavily on computer vision to process videos. As someone interested in the design of high-performance systems, I wondered if these video analysis applications differed computationally from more traditional data analysis – and if so, how we might develop systems that make building future video analysis applications easier.

It turns out that there are several aspects to video analysis applications that make them challenging to execute efficiently, and a few of us built a system called Scanner that tries to address these challenges. In this post, I’ll talk about these challenges, how we designed Scanner to address them, and how we’ve used Scanner to scale out a handful of video analysis applications. But first, let’s be more concrete about this class of applications I’m categorizing as ‘video analysis’ by starting with a few recent examples and describing their computational structure:

Synthesizing Stereo VR Video with Facebook Surround 360

The Surround 360 team at Facebook uses camera rigs with tens of HD cameras to generate 360 stereo videos that can be viewed in VR. These cameras can generate hundreds of gigabytes worth of data per minute. The algorithms for processing the raw footage into the output format (at right) can take days for even a single minute worth of footage. Their first system (described here) made use of a temporally stabilized optical flow algorithm, but they are now using more sophisticated algorithms based on multi-view stereo.
Surround 360 Editing a Surround 360 VR video
(image adapted from Surround 360, Facebook)

Understanding Humans with 2D Pose Estimation

OpenPose is a library for human pose detection developed by researchers in Yaser Sheikh's at Carnegie Mellon University. OpenPose is one of the most efficient and robust implementations for human pose detection publicly available. To give you an example, let's test out OpenPose on a video that it almost certainly was not trained on: the trailer for the animated Pixar film, *Incredibles 2*.

Although OpenPose is one of the fastest algorithms available, generating this video still took ~25 minutes on a single Titan X Pascal GPU when using reasonable accuracy settings (12.5x longer than the video itself).

Understanding Human Social Interactions with the Panoptic Studio

Using over 500 cameras, the Panoptic Studio (shown at right) can generate accurate 3D poses of a large group of people interacting. Researchers at Carnegie Mellon University are using this system to perform quantitative analysis of human social interactions. These 3D poses are generated by performing 2D human pose detection (using OpenPose) on every frame of every camera and then fusing those 2D poses into 3D poses using a voxel grid. Given that this requires evaluating OpenPose on hundreds of cameras for each time instant, reconstructing these 3D poses can take hours per minute of footage on a single GPU.
Surround 360 The Panoptic Studio
(image adapted from Hanbyul Joo et al.)

Large-scale, Automatic Frame Selection

At YouTube, thumbnails are automatically selected for each uploaded video by ranking frames from the video using a neural network trained for image quality (blog post). At Netflix, a system called AVA extracts metadata about each frame -- using face detection, motion estimation, shot identification, and object detection -- to find compelling images to display in the Netflix UI (blog post).
Frame Selection YouTube thumbnails & Netflix annotations
(top from Google, bottom from Netflix)

In order to understand the performance and productivity challenges of these video analysis applications, I spent time reimplementing or optimizing existing implementations of several of them to understand what the common performance issues are and what made building these applications tiresome from a productivity standpoint.

Challenges of Scaling Efficient Video Analysis

First, let me define exactly what I mean by efficient: on a single machine, an efficient implementation of a video analysis application should maximally utilize the resources of that machine when executing a state-of-the-art algorithm; on multiple GPUs or machines, the implementation should introduce minor overheads compared to the the efficient single machine implementation while scaling as linearly as possible. With that said, here are the challenges I experienced while trying to build efficient video analysis applications that scale:

1. How do you organize and store compressed video and derived results? Dealing with thousands of videos, as well as derived intermediates (like precomputed resized frames, depth maps, flow fields, etc), can be tedious and error prone without a structure for organizing this data. Naive solutions, such as storing each output as a single image file on disk, can cause slowdowns from excessive IO.

2. How do you efficiently access compressed video, even when these accesses are sparse? Many data analysis pipelines only look at a subset of the frames in a video, either because they are only interested in a specific region of interest (such as the frames which have people in them) or for improved computational efficiency (only looking at one frame per second, as YouTube does for thumbnail selection). But since video frames are compressed relative to each other, selecting a single frame at a time can require decoding hundreds of unused frames.

3. How do you parallelize operations that have temporal (frame-to-frame) dependencies? Many common video analysis algorithms (tracking, optical flow, or action recognition/detection) need to be able to process a sequence of video frames sequentially. But having to maintain sequential state limits parallelism since only a single frame can be processed at once in a video.

4. How do you design applications that can efficiently make use of machines packed densely with multi-core CPUs, accelerators like GPUs, and hardware video decoders? Most video analysis algorithms need to be able to make use of multi-core CPUs and GPUs for efficiency. In addition, many applications benefit from taking advantage of the hardware video decoders that come with modern GPUs. But optimizing for complex systems with many CPU cores and GPUs can significantly increase the amount of time it takes to develop efficient video analysis applications.

5. How do you scale video analysis while dealing with all of the above challenges?
Because of the computational cost of processing thousands to millions of video frames with expensive computer vision algorithms, it is important that video analysis applications utilize the resources on a single machine well and are also able to scale up to clusters of hundreds of machines.

Since these issues were common problems shared by many video analysis applications, this led us to design…

Scanner

Scanner is a first attempt at designing a system for scaling video analysis applications efficiently. Scanner provides a few key features to ease the challenges we encountered:

Scanner is open-source on GitHub. The project is still early on, and we are actively working with users at Stanford, Intel, Facebook, and University of Washington to improve it.

Check out our SIGGRAPH 2018 paper to see how we’ve used Scanner to scale terabyte-size video analysis applications or the paper video below:

If you think Scanner might be useful to you (or think that it’s lacking in any way), I’d love to hear from you! You can reach me via email or twitter.

Example Code

Scanner tries to make it easy to write a video processing pipeline and then deploy it onto a single machine packed with many cores and GPUs, or scale out the same application to a cluster of hundreds of machines. Here’s a simple example of a full Scanner application that downsamples an input video in time and space by selecting every third frame and resizing it to 640 x 480:

from scannerpy import Database, Job

# Ingest a video into the database (create a table with a row per video frame)
db = Database()
db.ingest_videos([('example_table', 'example.mp4')])

# Define a Computation Graph
frame = db.sources.FrameColumn()                           # Read input frames from database
sampled_frame = db.streams.Stride(input=frame,             # Select every third frame
                                  stride=3)
resized = db.ops.Resize(frame=sampled_frame,               # Resize input frames
                        width=640, 
                        height=480) 
output_frame = db.sinks.Column(columns={'frame': resized}) # Save resized frames as new video

# Set parameters of computation graph ops
job = Job(op_args={
    frame: db.table('example_table').column('frame'), # Column to read input frames from
    output_frame: 'resized_example'                   # Table name for computation output
})

# Execute the computation graph and return a handle to the newly produced tables
output_tables = db.run(output=output_frame, 
                       jobs=[job])

# Save the resized video as an mp4 file
output_tables[0].column('frame').save_mp4('resized_video')

This same code, with essentially no modifications, will run on a Macbook, a beefy machine with four Titan V GPUs, or a cluster of hundreds of machines on the cloud. If you’d like to understand this example in more detail, check out the Quickstart on the Scanner documentation website scanner.run.

Academic Paper

Scanner will appear in the proceedings of SIGGRAPH 2018 as “Scanner: Efficient Video Analysis at Scale” by Poms, Crichton, Hanrahan, and Fatahalian. If you use Scanner in your research, we’d appreciate it if you cite the paper.

Frequently Asked Questions

Who are we?

Scanner is a collaboration between Stanford and Carnegie Mellon University. Alex Poms and Will Crichton are the Ph.D. students who lead the project. Prof. Kayvon Fatahalian and Prof. Pat Hanrahan are their academic advisors.

Why not use existing systems like MapReduce and Spark?

While Frameworks such as MapReduce and Spark handle the “scale-out” scheduling challenges of distributed computing (e.g. work distribution and fault tolerance), they require new primitives and significant changes to their internal implementation to meet the broad set of video analysis challenges we introduced. For a more detailed explanation, check out the related work section of the SIGGRAPH 2018 paper.

Are you concerned about this being used for large-scale surveillance?

Yes, absolutely. By developing Scanner in the public, my hope is that we can have discussions about where and how such tools should be used, and the rules they should adhere to when processing video of people who did not explicitly consent to being recorded. I am very afraid of a world where such tools are developed in private without the public’s understanding of their capabilities and without any oversight from the people it is being used to surveil.