OccFusion: Rendering Occluded Humans with Generative Diffusion Priors

Stanford University

*† denotes equal contribution
arXiv Code

Abstract

Most existing human rendering methods require every part of the human to be fully visible throughout the input video. However, this assumption does not hold in real-life settings where obstructions are common, resulting in only partial visibility of the human. Considering this, we present OccFusion, an approach that utilizes efficient 3D Gaussian splatting supervised by pretrained 2D diffusion models for efficient and high-fidelity human rendering. We propose a pipeline consisting of three stages. In the Initialization stage, complete human masks are generated from partial visibility masks. In the Optimization stage, 3D human Gaussians are optimized with additional supervisions by Score-Distillation Sampling (SDS) to create a complete geometry of the human. Finally, in the Refinement stage, in-context inpainting is designed to further improve rendering quality on the less observed human body parts. We evaluate OccFusion on ZJU-MoCap and challenging OcMotion sequences and found that it achieves state-of-the-art performance in the rendering of occluded humans.


Method

OccFusion achieves occluded human rendering via three sequential stages. In the Initialization Stage, we recover complete binary human masks \(\{\mathbf{\hat{M}}\}\) from occluded partial observations \(\{\mathbf{I}\}\) with the help of segmentation priors \(\{\mathbf{M}\}\) and pose priors \(\{\mathbf{P}\}\). \(\{\mathbf{\hat{M}}\}\) will be further used to help optimize human Gaussians \(\mathbf{\pi}\) in subsequent stages. In the Optimization Stage, we apply \(\{\mathbf{P}\}\) conditioned SDS on both posed human and canonical human to enforce the human occupancy to remain complete. In the Refinement Stage, we use the coarse human renderings \(\{\mathbf{\hat{I}}\}\) from the Optimization Stage to help generate missing RGB values in \(\{\mathbf{I}\}\) through our proposed in-context inpainting. Through this process, both the appearance and geometry of the human are fine-tuned to be in high fidelity. Training of all three stages takes only 10 minutes on a single Titan RTX GPU.



Results


Try selecting different methods and humans (you may need to double click) and move the slider!

DTU/scan31 DTU/scan31 DTU/scan45 LLFF/fern LLFF/horns Re10K/sofa
Re10K/living room CO3D/bench CO3D/bench CO3D/bench CO3D/bench CO3D/bench

Citation

Acknowledgements

This work was partially funded by the Panasonic Holdings Corporation, Gordon and Betty Moore Foundation, Jaswa Innovator Award, Stanford HAI, and Wu Tsai Human Performance Alliance.

The website was built by Adam Sun, template was borrowed from ReConFusion and code was borrowed from OCC-NeRF.