Rendering Humans from Object-Occluded Monocular Videos

ICCV 2023

  • Stanford Vision and Learning Lab

  • Stanford University


3D understanding and rendering of moving humans from monocular videos is a challenging task. Despite recent progress, the task remains difficult in real-world scenarios, where obstacles may block the camera view and cause partial occlusions in the captured videos. Existing methods cannot handle such defects due to two reasons. First, the standard rendering strategy relies on point-point mapping, which could lead to dramatic disparities between the visible and occluded areas of the body. Second, the naive direct regression approach does not consider any feasibility criteria (ie, prior information) for rendering under occlusions. To tackle the above drawbacks, we present OccNeRF, a neural rendering method that achieves better rendering of humans in severely occluded scenes. As direct solutions to the two drawbacks, we propose surface-based rendering by integrating geometry and visibility priors. We validate our method on both simulated and real-world occlusions and demonstrate our method's superiority.



OccNeRF functions on video frames and optimizes a neural radiance field for synthesizing novel views of an object-occluded human. With a pre-computed body pose, we first adopt the motion field to map observable ray samples into coordinates in a canonical space. Nearest parameterization vertices of every ray samples are searched and conditioned by our surface-based rendering method. During training, we iteratively update the attention scores for all vertices as indications of their visibility. This ensures more attention on frequently visible vertices to improve rendering quality. The blended vertex along with its signed distance to each ray sample are jointly encoded via a 4D hash grid before being fed into the regression MLP along with the encoded vertices. Photometric and perceptual constraints are enforced against visible pixels, while an additional loss function is designed to encourage geometry completeness in occluded areas.



OccNeRF was primarily evaluated on the ZJU MoCap dataset with simulated occlusions. We achieved more complete renderings compared to the baseline HumanNeRF.


OccNeRF was then evaluated on the OcMotion dataset with real-world occlusions. Our rendered humans are in higher fidelity and less artifacts.


This work was partially funded by the Gordon and Betty Moore Foundation, Panasonic Holdings Corporation, NSF RI #2211258, and Stanford HAI. Tiange thanks Jiaman Li and Koven Yu for their insightful feedback.