Video Extrapolation in Space and Time
ECCV 2022
Stanford University

Novel view synthesis (NVS) and video prediction (VP) are typically considered disjoint tasks in computer vision. However, they can both be seen as ways to observe the spatial-temporal world: NVS aims to synthesize a scene from a new point of view, while VP aims to see a scene from a new point of time. These two tasks provide complementary signals to obtain a scene representation, as viewpoint changes from spatial observations inform depth, and temporal observations inform the motion of cameras and individual objects. Inspired by these observations, we propose to study the problem of Video Extrapolation in Space and Time (VEST). We propose a model that leverages the self-supervision and the complementary cues from both tasks, while existing methods can only solve one of them. Experiments show that our method achieves performance better than or comparable to several state-of-the-art NVS and VP methods on indoor and outdoor real-world datasets.

Results under different camera settings

We conduct extensive experiments to diverse scenarios: our model may learn from single- and multi-view videos, static and moving cameras, and indoor and outdoor scenes. During inference, our model takes in two input frames from a monocular video sequence, and performs extrapolation in space and time.

I. Learning from multi-view, moving cameras
Spatial extrapolation on the KITTI dataset

Temporal extrapolation on the KITTI dataset
II. Learning from a single-view, moving camera
Spatial extrapolation on the RealEstate10K dataset
Spatial extrapolation on the ACID dataset
III. Learning from multi-view, static cameras
Spatial extrapolation on the dataset from Lin et al. 2021
IV. Learning from a single-view, static camera
Temporal extrapolation on a newly collected cloud dataset

We thank Angjoo Kanazawa, Hong-Xing (Koven) Yu, Huazhe (Harry) Xu, Noah Snavely, Ruohan Zhang, Ruohan Gao, and Shangzhe (Elliott) Wu for detailed feedback on the paper, Kaidi Cao for collecting the cloud dataset, and Samuel Clarke for the supplementary video voiceover. This work is in part supported by the Stanford Institute for Human-Centered AI (HAI), the Stanford Center for Integrated Facility Engineering (CIFE), the Samsung Global Research Outreach (GRO) Program, and Amazon, Autodesk, Meta, Google, Bosch, and Adobe.