TY - GEN
T1 - DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis
AU - Xu, Yinghao
AU - Chai, Menglei
AU - Shi, Zifan
AU - Peng, Sida
AU - Skorokhodov, Ivan
AU - Siarohin, Aliaksandr
AU - Yang, Ceyuan
AU - Shen, Yujun
AU - Lee, Hsin-Ying
AU - Zhou, Bolei
AU - Tulyakov, Sergey
N1 - KAUST Repository Item: Exported on 2023-08-29
Acknowledgements: We thank Jiatao Gu, Willi Menapace, Jian Ren, Panos Achlioptas, Tai Wang, and Zian Wang for fruitful discussions and comments about this work.
PY - 2023/8/22
Y1 - 2023/8/22
N2 - Existing 3D-aware image synthesis approaches mainly focus on generating a single canonical object and show limited capacity in composing a complex scene containing a variety of objects. This work presents DisCoScene: a 3D-aware generative model for high-quality and controllable scene synthesis. The key ingredient of our method is a very abstract object-level representation (i.e., 3D bounding boxes without semantic annotation) as the scene layout prior, which is simple to obtain, general to describe various scene contents, and yet informative to disentangle objects and background. Moreover, it serves as an intuitive user control for scene editing. Based on such a prior, the proposed model spatially disentangles the whole scene into object-centric generative radiance fields by learning on only 2D images with the global-local discrimination. Our model obtains the generation fidelity and editing flexibility of individual objects while being able to efficiently compose objects and the background into a complete scene. We demonstrate state-of-the-art performance on many scene datasets, including the challenging Waymo outdoor dataset.
AB - Existing 3D-aware image synthesis approaches mainly focus on generating a single canonical object and show limited capacity in composing a complex scene containing a variety of objects. This work presents DisCoScene: a 3D-aware generative model for high-quality and controllable scene synthesis. The key ingredient of our method is a very abstract object-level representation (i.e., 3D bounding boxes without semantic annotation) as the scene layout prior, which is simple to obtain, general to describe various scene contents, and yet informative to disentangle objects and background. Moreover, it serves as an intuitive user control for scene editing. Based on such a prior, the proposed model spatially disentangles the whole scene into object-centric generative radiance fields by learning on only 2D images with the global-local discrimination. Our model obtains the generation fidelity and editing flexibility of individual objects while being able to efficiently compose objects and the background into a complete scene. We demonstrate state-of-the-art performance on many scene datasets, including the challenging Waymo outdoor dataset.
UR - http://hdl.handle.net/10754/693778
UR - https://ieeexplore.ieee.org/document/10204044/
U2 - 10.1109/cvpr52729.2023.00428
DO - 10.1109/cvpr52729.2023.00428
M3 - Conference contribution
BT - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PB - IEEE
ER -