Mohaghegh, Hoda and Rahmani, Hossein and Bennamoun, Mohammed (2023) Robust monocular 3D face reconstruction under challenging viewing conditions. Neurocomputing, 520. pp. 82-93. ISSN 0925-2312
Elsevier_s_CAS_Revision_Final.pdf - Accepted Version
Available under License Creative Commons Attribution-NonCommercial-NoDerivs.
Download (2MB)
Abstract
Despite extensive research, 3D face reconstruction from a single image remains an open research problem due to the high degree of variability in pose, occlusions and complex lighting conditions. While deep learning-based methods have achieved great success, they are usually limited to near frontal images and images that are free of occlusions. Also, the lack of diverse training data with 3D annotations considerably limits the performance of such methods. As such, existing methods fail to recover, with high fidelity, the facial details especially when dealing with images captured under extreme conditions. To address this issue, we propose an unsupervised coarse-to-fine framework for the reconstruction of 3D faces with detailed textures. Our core idea is that multiple images of the same person but captured under different viewing conditions should provide the same 3D face. We thus propose to leverage a self-augmentation learning technique to train a model that is robust to diverse variations. In addition, instead of directly employing image pixels, we use a set of discriminative features describing the identity and attributes of the face as input to the refinement module, making the model invariant to viewing conditions. This combination of self-augmentation learning with rich face-related features allows the reconstruction of plausible facial details even under challenging viewing conditions. We train the model end-to-end and in a self-supervised manner, without any 3D annotations, landmarks or identity labels, using a combination of an image-level photometric loss and a perception-level loss that is identity and attribute-aware. We evaluate the proposed approach on CelebA and AFLW2000 datasets, and demonstrate its robustness to appearance variations despite learning from unlabeled images. The qualitative comparisons indicate that our method produces detailed 3D faces even under extreme occlusions, out of plane rotations and noise perturbations where existing state-of-the-art methods often fail. We also quantitatively show that our method outperforms SOTA with more than 30.14%, 9.87% and 11.3% in terms of PSNR, SSIM and IDentity similarity, respectively.