TY - GEN
T1 - SPAD
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
AU - Kant, Yash
AU - Siarohin, Aliaksandr
AU - Wu, Ziyi
AU - Vasilkovsky, Michael
AU - Qian, Guocheng
AU - Ren, Jian
AU - Guler, Riza Alp
AU - Ghanem, Bernard
AU - Tulyakov, Sergey
AU - Gilitschenski, Igor
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - We present SPAD, a novel approach for creating con-sistent multi-view images from text prompts or single images. To enable multi-view generation, we repurpose a pre-trained 2D diffusion model by extending its self-attention layers with cross-view interactions, and fine-tune it on a high quality subset of Objaverse. We find that a naive extension of the self-attention proposed in prior work (e.g., MV-Dream) leads to content copying between views. Therefore, we explicitly constrain the cross-view attention based on epipolar geometry. To further enhance 3D consistency, we utilize Plücker coordinates derived from camera rays and inject them as positional encoding. This enables SPAD to reason over spatial proximity in 3D well. Compared to concurrent works that can only generate views at fixed azimuth and elevation (e.g., MVDream, SyncDreamer), SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. Finally, we demon-strate that text-to-3D generation using SPAD prevents the multi-face Janus issue.
AB - We present SPAD, a novel approach for creating con-sistent multi-view images from text prompts or single images. To enable multi-view generation, we repurpose a pre-trained 2D diffusion model by extending its self-attention layers with cross-view interactions, and fine-tune it on a high quality subset of Objaverse. We find that a naive extension of the self-attention proposed in prior work (e.g., MV-Dream) leads to content copying between views. Therefore, we explicitly constrain the cross-view attention based on epipolar geometry. To further enhance 3D consistency, we utilize Plücker coordinates derived from camera rays and inject them as positional encoding. This enables SPAD to reason over spatial proximity in 3D well. Compared to concurrent works that can only generate views at fixed azimuth and elevation (e.g., MVDream, SyncDreamer), SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. Finally, we demon-strate that text-to-3D generation using SPAD prevents the multi-face Janus issue.
KW - diffusion
KW - novel view synthesis
UR - http://www.scopus.com/inward/record.url?scp=85201973066&partnerID=8YFLogxK
U2 - 10.1109/CVPR52733.2024.00956
DO - 10.1109/CVPR52733.2024.00956
M3 - Conference contribution
AN - SCOPUS:85201973066
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 10026
EP - 10038
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
PB - IEEE Computer Society
Y2 - 16 June 2024 through 22 June 2024
ER -