3D Morphable Models (3DMMs) are parametric human-focused CGI models that are increasingly being used as a way to interact with the content of the latent space of trained neural image synthesis networks.
3DMMs were devised and introduced by Volker Blanz and Thomas Vetter of the Max Planck Institute in 1999, in the paper A Morphable Model For The Synthesis Of 3D Faces.
The late 1990s had seen the rise of consumer-level 3D programs, particularly for the Metacreations software stable, which would eventually be divided and sold among a number of companies, including Adobe and Daz.
Among these, the Poser software (which has survived many acquisition rounds) particularly captured the public imagination at the time, for its ability to recreate human faces and bodies, giving rise to various long-lived enthusiast communities, and causing many to believe that convincing recreation of dead movie stars would be a CGI (rather than AI) achievement.
Though that was not to be, the Poser-style parametric heads introduced in 1999 would evolve with the computer vision and synthesis research sector, while Poser itself is currently enjoying a revival, along with other pose-synthesis applications, as a template-generator for figurative poses in Stable Diffusion output.
The process of conforming a particular identity to a specific image for a 3DMM involves the initial use of a ‘generic’ template, and the gradual algorithmic customization of the ‘blank’ face with the target face.
Once some correlation is established, the revised 3DMM can be used as a method of projecting control points into the latent space, so that movements and changes in characteristics are (hopefully) reflected in the neural model.
The 2019 GANFIT approach, a collaboration led by Imperial College London, uses a GAN to train a UV texture generator in a neural space (two years later the method was adapted into Fast-GANFIT, an optimized version with lower latency).
During the conforming process, the traditional UV coordinates from the model are turned into vectors (mathematical representations) and passed through Principal Component Analysis (PCA). Prior works, including works from the same group of researchers, have used diverse active Appearance Models (AAMs) to create this mapping, such as Scale Invariant Feature Transform (SIFT) and Histogram of Oriented Gradients (HOG).
Taming the Latent Space with CGI
A 3DMM or other parametric model has a fixed, definable and controllable set of parameters, in stark contrast to the latent codes inside the latent space of a neural network, where the underlying semantic relationships are still being studied. Fixing these known coordinates to the approximate equivalent latent code enables a kind of crude puppeteering inside the latent space.
The features that are extracted during model training of more abstract neural networks such as latent diffusion and Generative Adversarial Networks (GANs) do not come with rational, voxel-style 3D coordinates. In the case of latent diffusion models, the visual aspects are deeply associated with descriptive text content, making using of the labels on the images that were trained, which adds an additional layer of complexity in ‘targeting’ a specific latent code.
Though Neural Radiance Fields (NeRF) does encode geometric data, it is not directly accessible in terms of explicit control over the geometry, which is learned from observing pixels from real and simulated viewpoints. Therefore, though 3DMM usage has been predominant in projects that seek control over a GAN’s latent space, a growing number of initiatives are also leveraging parametric approaches as an intermediary for NeRF representations.
In the case of both NeRF and GAN, the research community turned to parametric methods only after a long and largely fruitless search for implicit methods of control that might be more native to the trained networks, and for less complex approaches to instrumentality. Ultimately the general consensus formed that these architectures are not easily susceptible to external control, and a certain initial resignation and disappointment about this fact has evolved into a more vigorous pursuit of superior 3DMM interfaces for otherwise ‘closed’ and paradoxical networks.
Most of the concrete examples of 3DMM use as a neural control system are in the GAN space, if only because GAN is among the older of the current crop of facial synthesis architectures, and the sector has been trying to make it more interpretable and governable for quite a long time.
In 2021 Mitsubishi released MOST-GAN, which uses non-linear 3DMMs as a facial control interface, offering a solid but far from definitive attempt at disentanglement (i.e., being able to edit a facet of a face without changing other aspects, currently an equally fervent pursuit in latent diffusion).
The use of CGI heads as a neural interface has been perhaps most extensively explored by Disney Research, notably in their 2022 offering MoRF: Morphable Radiance Fields for Multiview Neural Head Modeling.
Though the Disney Research paper covers strict 3DMMs only in its Related Work section, the methodology is very similar, with controllable and parametrized 3D rigs influencing trained facets in a neural network – this time a NeRF, or, as the paper names it, a Morphable Radiance Field (MoRF).
MORF uses a ‘deformation field’ to interact with the neural space, but a number of projects have used traditional 3DMMs to control the much older Signed Distance Function|Field (SDF) approach.
Now 24 years old, 3DMM is getting to be quite a venerable technology in computer vision, but has enjoyed a recent resurgence as an ‘off the shelf’ approach to facial movement synthesis in otherwise intractable architectures. There has, however, been a certain amount of innovation in 3DMMs themselves.
Besides Disney’s re-conceptualization of the role of CGI in facial synthesis (see above), one 2020 project, titled i3DMM, extended the capabilities of a traditional 3DMM model by adding full-head capture and extended features, including hair:
As the notion of full body synthesis becomes more achievable and gains greater traction in the generative image synthesis field, the need for extended human representations has inspired the Max Planck Institute, the original creator of the 3DMM approach, to develop full-body parametric models such as the Sparse Trained Articulated Human Body Regressor (STAR) system.
However, the prior offering, Skinned Multi-Person Linear Model (SMPL), features more prominently in research papers, perhaps because of the wider body of literature concerning its use in neural synthesis, or because it has a more extensive like-for-like history in testing rounds.
3DMM in Latent Diffusion Models
3DMM has historically represented the ‘last chance’ to impose instrumentality and composability into neural systems that have been found, after much research, to lack such ‘easy’ mechanisms. By the time 3DMM interfaces are being investigated, practically every other potential ‘native’ method of achieving these results with less complicated approaches has been exhausted.
For latent diffusion models such as Stable Diffusion, at the time of writing, the research sector still holds out hope that semantic or other similarly ‘in-built’ approaches could transform diffusion systems from static image generators to semantically-complete 3D environments. However, a small number of projects are beginning to experiment with the 3DMM ‘plan B’ approach.
The DiffFace system from Vive Studios and Korea University, for instance, has stated that its diffusion-based face-swapping method is amenable to a 3DMM approach; however, 3DMM may eventually prove useful in diffusion models more as an additional or primary method of obtaining facial landmarks, for the growing number of current projects that are taking an interest in quantifying and governing diffusion-based facial content through semantic segmentation and facial analysis.