Creating State-of-the-Art NeRF Head Avatars in Minutes

manvatar-MAIN
manvatar-MAIN

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

If time were no object, Neural Radiance Fields (NeRF) might by now have made greater inroads into potential commercial implementations – particularly in the field of human avatars and facial recreation.

As it stands, even the fastest recent implementations of NeRF-based facial recreation from user-provided video come in at around twenty minutes for training time, which puts pressure on the narrow window to capture and consolidate casual consumer interest.

Therefore a great deal of effort has been expended by the NeRF-avatar research subsector over the past 18 months to speed up the process of creating usable and dynamic NeRF face/head avatars, for possible use in AR/VR environments, virtual communications, and in other applications.

The latest breakthrough, just published in a new paper from the Department of Automation at China’s Tsinghua University, offers a usable NeRF avatar in two minutes, and a state-of-the-art NeRF

ManVatar (second from right) achieves state-of-the-art convergence in a fraction of the time of previous methods (left, and second from left). The ground truth (source) video is on the far right. Source: https://www.liuyebin.com/manvatar/manvatar.html
ManVatar (second from right) achieves state-of-the-art convergence in a fraction of the time of previous methods (left, and second from left). The ground truth (source) video is on the far right. Source: https://www.liuyebin.com/manvatar/manvatar.html

With increased research along these and similar lines, near-instantaneous photoreal self-representations seem to be on the horizon within the next year or so, depending on the compromises that will need to be struck between processing/training time, quality and versatility of the final result.

ManVatar

The new work is titled ManVatar : Fast 3D Head Avatar Reconstruction Using Motion-Aware Neural Voxels. The speed increase achieved in the work was accomplished through the use of multiple 4D tensors in a  3D Morphable Face Model (3DMM).

As we’ve discussed before, 3DMM models are ‘regular’, parametric CGI models which are used to communicate with more problematic neural representations of faces, such as the photogrammetry-based NeRF, and the arcane latent space of Generative Adversarial Networks (GANs).

3DMMs are vector-based CGI faces which can act as a bridge between the user and the often difficult-to-control embeddings of NeRF, GANs, and other neural representations. Source: https://hal.inria.fr/hal-02280281v2/document
3DMMs are vector-based CGI faces which can act as a bridge between the user and the often difficult-to-control embeddings of NeRF, GANs, and other neural representations. Source: https://hal.inria.fr/hal-02280281v2/document

The source for the trained ManVatar representation is, as with prior works, a monocular (i.e. not 3D or dual-lensed) portrait video, such as one might take of oneself on a smartphone.

Conceptual overview of the ManVatar workflow.
Conceptual overview of the ManVatar workflow.

The captured head pose and facial expressions are mapped onto a 3DMM template by identification of facial landmarks (using OpenSeeFace), and pre-processed. Each expression is then converted into a voxel grid, which is similar to the pixels represented in a JPEG or other type of image, except that the mapping is 3D and volumetric:

Voxels are volumetric representations of 3D points. If you remove or hide enough of them, you can create novel shapes in a manner similar to a sculptor.
Voxels are volumetric representations of 3D points. If you remove or hide enough of them, you can create novel shapes in a manner similar to a sculptor.

The sum of these calculated expressions is then converted and averaged into a complete motion voxel grid, in which, obviously, the voxels (represented in the squares of the images above) may not necessarily remain at their initial fixed points.

training-data

ManVatar also calculates a ‘canonical’ appearance – a representation of the face that contains a ‘neutral’ pose and expression – a ‘base’ against which deformations (i.e. changes in facial and head pose) can be calculated.

The processed data in this workflow is finally passed to a very slim 2-layer multilayer perceptron, which facilitates the final portrait image via volume rendering. Training high-volume MLPs has been the traditional bottleneck in NeRF generation, and using such a scant layer of MLPs for ManVatar, and concentrating the locus of effort on derived voxels, is key to the speed of the system.

manvatar_optimized

The optimized nature of the workflow means that at inference time (i.e., when processing is done and it’s time to animate the head), it’s now possible to generate photoreal portraits from mere expression coefficients and base head poses; and very, very quickly.

Prior approaches have not sought or been able to separate facial expressions from the base geometry of the captured subject. This has obstructed previous attempts at fast convergence, due to the high volume of data entailed in this enmeshment.

Instead, ManVatar creates pose and expression as divergences from a base ‘neutral’ default, allowing for a more lightweight implementation, where the voxel grids are doing the work previously undertaken, at greater expense of time and resources, by MLPs.

The process is further optimized by background and body/neck removal, which produces a representation from the lower neck upwards. The researchers found that the motion-aware neural voxels obtained by the process were useful not only in representing expressions, but as contributors to the ‘base’, neutral expression of the canonical pose – a further optimization of resources.

Method and Tests

Entirely implemented in PyTorch, ManVatar uses the initial 32 base expressions in the Basel Face Model.

Registration stages on the Basel face model used as a 3DMM intermediary for ManVatar. Source: https://arxiv.org/pdf/1709.08398.pdf
Registration stages on the Basel face model used as a 3DMM intermediary for ManVatar. Source: https://arxiv.org/pdf/1709.08398.pdf

The tests were conducted on a NVIDIA 3090 GPU, with the model trained for 10,000 iterations under the Adam optimizer. In an increasingly common practice in computer vision, initial training took place at 256×256 pixel resolution images, with the last 4000 iterations using 512x512px resolution.

Eight training videos were used, including four from the HDTF dataset. The remaining videos were created by the researchers using a hand-held mobile phone.

ManVatar was tested against three comparable SOTA approaches: Deep Video Portraits (DVP), which synthesizes 2D images instead of reconstructing a full head model; I M Avatar, which creates an implicit Signed Distance Field (SDF) based on a FLAME model; and NerFACE, which also reconstructs a NeRF model from images with 3DMM-based data, similar to ManVatar.

NerFACE, like ManVatar, uses 3DMM as an intermediary to gain control of the reconstruction process: Source: https://gafniguy.github.io/4D-Facial-Avatars/

All three competing methods were trained to the same level of convergence as ManVatar (i.e., trained enough so that the models were considered usable and high-quality). Though it presents no corresponding graph, the paper reports that I M Avatar and DVP each took an entire day to converge; that NerFACE took 12 hours; and that ManVatar took five minutes.

In qualitative terms, the authors assert that NerFACE achieves comparable results to ManVatar, but at a greatly-increased training time.

Qualitative results from the researchers' tests.
Qualitative results from the researchers' tests.

The authors state:

‘The results validate that ManVatar achieves the highest render quality while the training time is far less than the other methods. IMAvatar reconstructs an implicit model based on a FLAME template, yet the expressiveness is insufficient. Therefore, they can hardly learn person-specific expression details. DVP inherits the GAN framework and relies on a 2D convolutional network to generate images. But in many cases, the generated details are not appropriate.’

Quantitative tests were also conducted on four popular metrics: Mean Squared Error (MSE); Peak-Signal-to-Noise-Ratio (PSNR); Structural Similarity Index (SSIM); and Learned Perceptual Image Patch Similarity (LPIPS). Here, ManVatar achieved comparable results to NerFACE on SSIM and LPIPS, and superior results across other metrics.

Quantitative tests across four popular metrics. ManVatar achieves parity with NerFACE, despite requiring just a tiny fraction of that framework's training time.

Further tests were conducted regarding training speed, this time against two directly comparable NeRF-based methods: NeRFBlend-Shape and, again, NerFACE. In this case, the finish-line was defined by the time it took ManVatar to complete convergence:

In the far right column are the ground truth (source) images.
In the far right column are the ground truth (source) images.

For these tests, the researchers used an official video from NeRFBlend-Shape as a guideline to its development through training.

The researchers note that despite the stated claims of 12 hours for training, NerFACE was able to converge within ‘a few hours’. Nonetheless, this is still a huge increase over the five-minute training time for ManVatar. The authors note that the final three minutes are primarily for ‘finish’, and that the ManVatar avatar is essentially usable after only two minutes.

More To Explore

Loss Functions in Machine Learning
Knowledge base

Loss Functions in Machine Learning

Loss functions are the processes that tell a machine learning network, during training, if it’s getting any better at making predictions. This article looks at the broad current landscape of loss functions, and some of the new trends that are emerging, such as a greater reliance on human-informed evaluation of images.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle