Wild2Avatar: Rendering Humans Behind Occlusions

1Stanford University,   2 Panasonic
*Corresponding Authors
arXiv Code

Abstract

Rendering the visual appearance of moving humans from occluded monocular videos is a challenging task. Most existing research renders 3D humans under ideal conditions, requiring a clear and unobstructed scene. Those methods cannot be used to render humans in real-world scenes where obstacles may block the camera's view and lead to partial occlusions. In this work, we present Wild2Avatar, a neural rendering approach catered for occluded in-the-wild monocular videos. We propose occlusion-aware scene parameterization for decoupling the scene into three parts - occlusion, human, and background. Additionally, extensive objective functions are designed to help enforce the decoupling of the human from both the occlusion and the background and to ensure the completeness of the human model. We verify the effectiveness of our approach with experiments on in-the-wild videos.


Method

We parametrize the scene into occlusion, human, and background. We model the human and occlusion/background as two separate neural radiance fields. The human is parameterized in a bounded sphere \(\Pi\) by first deforming ray samples into a canonical space with the help of the pre-computed body pose. The canonical points \(\mathbf{x}\) are passed into a rendering network \(\mathcal{F}^{fg}\) to learn the radiance \(\mathbf{c}\) and distance \(\mathbf{s}\) to the surface of the human, which can then be rendered through SDF-based volume rendering. The unbounded background is represented as coordinates on the surface of \(\Pi\) along with their inverted distances. Another rendering network \(\mathcal{F}^{scene}\) is used to learn the radiance and density for the background ray samples. The space of the occlusion is determined as the interval between the camera and an inner sphere \(\pi\). We parameterize ray samples as coordinates on the surface of \(\pi\) and the negation of the inverted distances to the center of inner sphere. We rely on the same network \(\mathcal{F}^{scene}\) to render the occlusions. The three renderings are sequentially aggregated and supervised on a combination of losses, in which we specifically encourage the decoupling of the occlusion from the human through \(\mathcal{L}_{occ}\) and penalize the incompleteness of human geometry through \(\mathcal{L}_{comp}\).



Results


Original Video (left) vs Rendered Human (right). Try selecting different methods and humans and move the slider!

OcMotion-0011 OcMotion-0013 OcMotion-0038 OcMotion-0039 OcMotion-0041 Custom-01 Youtube-02

Citation

Acknowledgements

This work was partially funded by the Panasonic Holdings Corporation, Gordon and Betty Moore Foundation, Jaswa Innovator Award, Stanford HAI, and Wu Tsai Human Performance Alliance.

The website was built by Adam Sun, template was borrowed from ReConFusion and code was borrowed from OCC-NeRF.