NeurIPS2022

Controllable 3D Face Synthesis with Conditional Generative Occupancy Fields

Keqiang Sun, Shangzhe Wu, Zhaoyang Huang, Ning Zhang, Quan Wang, Hongsheng Li

62 citations

Abstract

Capitalizing on the recent advances in image generation models, existing controllable face image synthesis methods are able to generate high-fidelity images with some levels of controllability, e.g., controlling the shapes, expressions, textures, and poses of the generated face images. However, these methods focus on 2D image generative models, which are prone to producing inconsistent face images under large expression and pose changes. In this paper, we propose a new NeRF-based conditional 3D face synthesis framework, which enables 3D controllability over the generated face images by imposing explicit 3D conditions from 3D face priors. At its core is a conditional Generative Occupancy Field (cGOF) that effectively enforces the shape of the generated face to commit to a given 3D Morphable Model (3DMM) mesh. To achieve accurate control over fine-grained 3D face shapes of the synthesized image, we additionally incorporate a 3D landmark loss as well as a volume warping loss into our synthesis algorithm. Experiments validate the effectiveness of the proposed method, which can generate high-fidelity face images and shows more precise 3D controllability than state-ofthe-art 2D-based controllable face synthesis methods. Find code and more demo at https://keqiangsun.github.io/projects/cgof . Recent success of Generative Adversarial Networks (GANs) [13] has led to tremendous progress in face image synthesis. State-of-the-art methods, such as StyleGAN [21, 22, 20] , are capable of generating photo-realistic face images. Apart from photo-realism, being able to control the appearance of the generated images is also key in many real-world applications, such as face animation, reenactment, and free-viewpoint rendering. Early works on controllable face synthesis rely on external attribute annotations to learn an attribute-guided face image generation model [27, 11, 55] . However, these attributes, such as "big nose", "chubby" and "smiling" in CelebA dataset [26] , can only provide abstract semantic-level guidance on the generation, and the generated faces often lack 3D geometric consistency. Moreover, it is often much harder to obtain low-level geometric annotations beyond semantic labels for direct 3D supervision. Recently, researchers have attempted to incorporate 3D priors from parametric face models, such as 3D Morphable Models (3DMMs) [2, 38] , into StyleGAN-based synthesis models, allowing for more precise 3D control over the generated images, including facial expressions and head poses [8, 52, 39] . Despite their impressive image quality, these models still tend to produce 3D inconsistent faces under large expression and pose variations due to the lack of a 3D representation, as shown in Fig. 1 .