Using Diffusion Models to Create Superior NeRF Avatars

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

A new research paper from the Hong Kong University of Science and Technology (HKUST) and Microsoft Research offers a rational and less resource-intensive way to generate 3D representations of people, by combining latent diffusion and NeRF-based approaches into a single framework.

As seen in the leftmost lower image above, the system can use real-world, genuine facial images to produce versatile avatars, but also allows the user to control the appearance of the representation through text-based prompting for semantic manipulation, in a manner similar to various strands of functionality in Stable Diffusion. Source: https://arxiv.org/pdf/2212.06135.pdf
As seen in the leftmost lower image above, the system can use real-world, genuine facial images to produce versatile avatars, but also allows the user to control the appearance of the representation through text-based prompting for semantic manipulation, in a manner similar to various strands of functionality in Stable Diffusion. Source: https://arxiv.org/pdf/2212.06135.pdf

The end user can not only ‘invent’ ad hoc avatars and representations by describing the person to be depicted (text-to-avatar), but can also use text prompts to revise the appearance of the final product.

Text-guided avatar manipulation with RODIN.
Text-guided avatar manipulation with RODIN.

Crucially, the new system, titled Roll-Out Diffusion Network (RODIN), is trained not on web-scraped faces, but on synthetic data created by the open source Blender project (via the Microsoft Fake It Til You Make It initiative), obviating the various eventual legal implications that could emerge in developing synthesis systems from random, publicly posted images, or from datasets whose legality and applicability for computer vision uses is beginning to be questioned.

Examples of faces created in Blender, which solves issues around diversity and landmark accuracy by using traditional CGI techniques to inform neural frameworks. Source: https://arxiv.org/pdf/2109.15102.pdf
Examples of faces created in Blender, which solves issues around diversity and landmark accuracy by using traditional CGI techniques to inform neural frameworks. Source: https://arxiv.org/pdf/2109.15102.pdf

For the project, 100,000 synthetic, Blender-originated human avatars were used as training data.

Tackling the 3D Avatar

In the last few years there has been a proliferation of projects using Generative Adversarial Networks (GANs) to achieve these kinds of syntheses, using portrait inversion (a broadly applicable technique that can ‘project’ a novel image of a person into a trained network so that it can be viewed and, in some cases, edited within the latent space).

Examples of portrait inversion with RODIN, where real-world images are converted into configurable and editable neural avatars. The grey 'CGI' style images are meshes extracted from 2D information with 'marching cubes', a digital reconstruction technology first proposed by General Electric in 1987 (see http://fab.cba.mit.edu/classes/S62.12/docs/Lorensen_marching_cubes.pdf).
Examples of portrait inversion with RODIN, where real-world images are converted into configurable and editable neural avatars. The grey 'CGI' style images are meshes extracted from 2D information with 'marching cubes', a digital reconstruction technology first proposed by General Electric in 1987 (see http://fab.cba.mit.edu/classes/S62.12/docs/Lorensen_marching_cubes.pdf).

However, GAN-based projects along similar lines have struggled either to develop adequate instrumentality (i.e. the means to control and manipulate the output), or to adequately disentangle the various attributes of an image (i.e., changing the hair color of a person may also have changed non-hair elements of the image, etc.).

Conversely, most of the NeRF-based approaches, while offering greater editability (because NeRF contains more explicit and addressable 3D-centric information than GANs), have tended to either be computationally resource-intensive and/or time-consuming to train; have failed to reproduce detail adequately, or to output a suitably high-resolution product; or else have had similar problems to GAN in terms of entanglement.

The researchers compared their new approach to three prior works that blended GAN methodologies with NeRF output: Stanford University’s Pi-GAN; GIRAFFE, from the Max Planck Institute for Intelligent Systems and the University of Tubingen; and EG3D, an academic collaboration also led by HKUST – as well as to a generic autoencoder approach, obtaining notably lower Frechet Inception Distance (FID) results than the older works.

Frechet Inception Distance results against former approaches (lower numbers are better). To calculate FID, features extracted from the CLIP stage of the pipeline were used as data, in accordance with prior work (https://arxiv.org/pdf/2203.06026.pdf)
Frechet Inception Distance results against former approaches (lower numbers are better). To calculate FID, features extracted from the CLIP stage of the pipeline were used as data, in accordance with prior work (https://arxiv.org/pdf/2203.06026.pdf)

Addressing the Quality Gap

RODIN, among the earliest systems to incorporate latent diffusion into an avatar-compatible generative pipeline, is able to output 1024×1024 resolution, thanks to a complex hierarchical generation architecture, which includes upscaling modules, as well as the leveraging of a large array of prior works.

RODIN's results compared to the prior methods.
RODIN's results compared to the prior methods.

In the last few month’s Google’s Imagen Video project illustrated the current trend for multilayer upscaling architectures, to bridge the resolution gap between almost universally-native 512px or lower training pipelines, and the need to generate HD content. Imagen Video upscales from a paltry 24x48px native resolution to 1280x768px across three layers of upscaling modules.

Likewise, RODIN uses a hierarchy of upscaling algorithms to arrive at 1024px resolution, preceded directly by upscaling layers that increase resolution from 64px to 256px, and then 512px.

During the ‘fitting’ stage, where the input material (such as a photo of a user) is being adapted to the network, the upscaling routines randomly resample the image data to 64px and 256px, in order to make sure that the encoder (which will generate the ‘useful imagery’ that comprises the avatar) is robust to this ‘triplane’ workflow.

The benefits of applying random scaling.
The benefits of applying random scaling.

Among several other ‘traditional’ bottlenecks addressed by RODIN, the researchers have made economies by adopting  patch-wise training, where the convolutional neural network (CNN) being used operates on representative sections of an image, instead of on entire images.

The authors believe that RODIN’s approach may eventually be implementable for more than just avatars, and conclude:

‘While this paper only focuses on avatars, the main ideas behind the Rodin model are applicable to the diffusion model for general 3D scenes. Indeed, the prohibitive computational cost has been a challenge for 3D content creation. An efficient 2D architecture for performing coherent and 3D-aware diffusion in 3D is an important step toward tackling the challenge.’

Three-Pronged Approach

The system operates by taking a neural volume representation of a face and unpacking it into a series of 2D feature planes:

From a prior paper, 'Neural Volume Super-Resolution', we see feature planes concatenated into a 3D volume (Source: https://arxiv.org/pdf/2212.04666.pdf). This approach is used in RODIN. Inset right, we see a 'feature map' extracted from source imagery in a trained convolutional neural network (source: https://medium.com/@chriskevin_80184/feature-maps-ee8e11a71f9e). Systems that manipulate trained networks for neural image synthesis rely on 'feature activation' – the targeting of specific facets inside the latent space, either so that they can be isolated and presented in some way in an interface, or so that other activated features can be 'projected' through them, allowing for transformations to be enacted.
From a prior paper, 'Neural Volume Super-Resolution', we see feature planes concatenated into a 3D volume (Source: https://arxiv.org/pdf/2212.04666.pdf). This approach is used in RODIN. Inset right, we see a 'feature map' extracted from source imagery in a trained convolutional neural network (source: https://medium.com/@chriskevin_80184/feature-maps-ee8e11a71f9e). Systems that manipulate trained networks for neural image synthesis rely on 'feature activation' – the targeting of specific facets inside the latent space, either so that they can be isolated and presented in some way in an interface, or so that other activated features can be 'projected' through them, allowing for transformations to be enacted.

RODIN relies on three core architectural features: 3D-aware convolution; latent conditioning; and, as we have already seem, hierarchical synthesis.

Regarding 3D convolution, this is the process where a CNN rationalizes the 2D inputs and enables cross-plane communication to help to assemble the source material into 3D-aware data. This helps to synchronize the details that are common across all the images and to form a coherent 3D representation (see images above for examples of ‘planes’, from a prior and unrelated project).

However, this is not enough for coherency in avatar output. Therefore RODIN uses an additional image encoder trained on the Blender-generated avatars. The extensive latent information drawn from 100,000 images provides a consistent rationale by which the user-submitted image (whether a real image of a text-to-image avatar) can be conformed to a consistent visual standard. This process is called, by the researchers, latent conditioning

Examples of Blender-generated synthetic faces, from the original Microsoft paper. The new RODIN research does not include examples of the 100,000 avatars generated for the latent conditioning phase. Source: https://arxiv.org/pdf/2109.15102.pdf
Examples of Blender-generated synthetic faces, from the original Microsoft paper. The new RODIN research does not include examples of the 100,000 avatars generated for the latent conditioning phase. Source: https://arxiv.org/pdf/2109.15102.pdf

Regarding latent conditioning, the authors state:

‘The latent conditioning not only leads to higher generation quality but also permits a disentangled latent space, thus allowing semantic editing of generated results.’

The tri-plane representation technique used in RODIN is derived from Neural Volume Super-Resolution, a recent collaboration from Princeton University and the University of Siegen.

Keeping the source data (i.e., for a submitted portrait inversion) in the same domain requires some additional effort, so that the fitting stage produces coherent output. This is accomplished in RODIN with a shared multi-layer perceptron (MLP) decoder (see our article on autoencoder synthesis for more details on shared encoders) that ‘pushes’ the tri-plane features into the shared latent space.

Since the data is likely to exhibit at least some inconsistencies, the MLP decoder has to be tolerant of these. The aforementioned random upscaling and downscaling helps this component to become more robust to abstract differences between the planes, pulling all the data into a coherent representation.

From the paper, non-cherrypicked avatar generations from RODIN.
From the paper, non-cherrypicked avatar generations from RODIN.

Beyond Memorization

One useful feature of having control over the training data (i.e., not having to rely on unbalanced, web-scraped datasets of real people, such as LAION or ImageNet), is that the resulting trained systems exhibit genuine diversity, and are capable of creating truly diverse representations.

The opposite of this is memorization, where the data is either too scant or too similar to provide the system with enough choices to generate diverse output, and where, instead, it will tend to repeat the data that it knows about.

The RODIN researchers have tested this capacity by generating the ‘nearest neighbors’ for a series of avatars (seen in the image below). If the system had become subject to memorization (a form of overfitting), the adjacent images to the avatars would be ‘variations’ on them; but as we can see, the nearest neighbors are very diverse indeed:

CLIP, CFG and Editability

To allow the end-user to make text-based adjustments to generations, the researchers have used OpenAI’s CLIP (Contrastive Language-Image Pre-training) system. CLIP equates images and derived features with related text, so that it’s possible to use natural language to create or alter images.

The same mechanism allows RODIN to create avatars out of thin air by simple description:

Avatars summoned into existence with text.
Avatars summoned into existence with text.

The CLIP encoder used is ‘frozen’, which means that it will render decisions that are unaffected by its position in workflows that may be gaining information from new data, but rather provides a fixed outcome based on its own prior training. Used centrally in OpenAI’s DALL-E 2, CLIP is a core feature in latent diffusion models.

Another important latent diffusion functionality, and one that will be familiar to users of Stable Diffusion, is Google Research’s Classifier Free Guidance (CFG), which we took a look at in our recent examination of the Instruct2Pix framework.

CFG allows the user to boost the fidelity of the generative system by forcing it to adhere to the user’s text prompts, and restricting its ability to freely interpret the submitted text. The flipside of this useful feature is that as the amount of CFG is increased, the quality of the output is likely to become more taut, stylized, or even to begin to tear and notably degrade (see our article on full-body deepfakes for examples). Used with restraint, however, CFG allows the user to strike a balance between accuracy of interpretation (i.e., fidelity to the prompt) and authenticity of the result.

In a qualitative ablation study test, the RODIN researchers found that removing CFG had a deleterious effect on output:

Lower numbers are better. Among other results in a qualitative study, removing CFG from the RODIN avatar generation process adversely affected the outcomes.
Lower numbers are better. Among other results in a qualitative study, removing CFG from the RODIN avatar generation process adversely affected the outcomes.

Architecture

The diffusion model used by RODIN makes use of the U-Net model employed in OpenAI’s Guided Diffusion research. The diffusion model was trained with the AdamW optimizer at a batch size of 48 and a learning rate of 5-e5, while the upsampling diffusion model was trained similarly, but on a batch size of 16.

The base diffusion model used 1,000 steps (i.e., the number of times that it iterates over the training data), while the upsampling model used 100 steps, with a linear noise schedule.

For inference, both models used 100 diffusion steps (somewhere between 75-150 is a common range for inference). All the tests were performed on NVIDIA Tesla V100 GPUs with 32G of VRAM.

The project has an associated website (though it was pending any actual content at the time of writing).

More To Explore

Loss Functions in Machine Learning
Knowledge base

Loss Functions in Machine Learning

Loss functions are the processes that tell a machine learning network, during training, if it’s getting any better at making predictions. This article looks at the broad current landscape of loss functions, and some of the new trends that are emerging, such as a greater reliance on human-informed evaluation of images.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle