Excluding ‘traditional’ CGI methods, which date back to the 1970s, there are currently three mainstream AI-based approaches to creating synthetic human faces, only one of which has attained any widespread success or societal impact: autoencoder frameworks (the architecture behind current viral deepfakes); Generative Adversarial Networks (GANs); and Neural Radiance Fields (NeRF).
Of these, NeRF — a late entrant that’s also capable of recreating the entire human form — is at the most rudimentary stage in terms of its facial generation capabilities; GANs can create the most convincing faces, but are still too volatile and ungovernable to easily output realistic video footage; and autoencoder frameworks, which have captivated (and, arguably, menaced) the world, require ‘host’ footage, and are largely confined to the inner areas of the face, which adds the further burden of finding a ‘target’ who closely resembles the ‘injected’ identity.
Last time, we took a look at the challenges facing NeRF as a future contender for the deepfake crown; in the next article, we’ll examine how the most popular current autoencoder-based deepfake approaches work, and whether they can maintain a vanguard position in face replacement.
For now, let’s see where Generative Adversarial Networks, among the most celebrated image synthesis techniques of the last five years, might fit into the future of deepfakes.
During training, a Generative Adversarial Network extracts high-level features from thousands of images in order to develop the capacity to reproduce similar images in the same domain as the dataset (i.e. ‘faces’, ‘cars’, ‘churches’, etc.).
The process is at least collaborative, arguably combative: during training, the Generator (below left in the graph above) uses random noise to attempt to recreate images similar to the training data, while the Discriminator (above right) grades the Generator’s hundreds of thousands of attempts in terms of how closely those attempts resemble the input images.
Slowly, the Generator learns to recreate the source images with more fidelity, even though it never gets access to the ‘real’ pictures, and only improves based on how the Discriminator scores its latest attempt.
The scoreboard for accuracy is loss (far right), which records the current shortfall between the Generator’s efforts and the original training material.
In this way the Generator slowly builds up a map of relationships between all the features that it has been able to extract from the source data, based on the Discriminator’s constant feedback.
When the training has converged (i.e. has reached a stage where the generator’s results are not likely to improve any further), the Discriminator’s work is done, and we’re left with a GAN model that can ‘mix-and-match’ variations of the original input source material with photorealistic clarity.
The complex cloud of feature relationships that powers this extraordinary capability in the final generator model is called the latent space.
The latent space is a mathematical construct which stores all the features (and any associated labels/classes) in the most orderly and logical manner that the GAN was able to devise during training.
If you can identify the latent code of an assimilated object (such as a face) in this labyrinthine net of interrelationships, you can begin to manipulate it by ‘sending’ it to different regions of the latent space that contain other related concepts, such as old, young, male, and female – or even directly to alternate identities and poses stored elsewhere in the latent space.
Mysteries of The Latent Space
One of the key problems with a GAN-based image synthesis approach is that the latent space is difficult to control, navigate, or even to understand. By analogy, a GAN’s latent space could be considered a toolbox where many of the tools have ‘melted into’ each other.
Granular, language-based domain and sub-domain concepts (such as human > female > blonde > young) can be baked into the system alongside the image-based training data, and can help to obtain some basic level of instrumentality, by providing a kind of ersatz ‘search’ function for this sprawling web of latent space information.
However, this usually offers only a crude level of control, because the flexibility and functionality of the final neural network entails a high degree of entanglement, meaning that these visual and language encodings are not separated and sorted in an orderly, predictable, and granular way.
Every potential training image and derived feature contains concepts that are difficult to ‘disentangle’. In the illustration above, the GAN is challenged with deriving distinctive and applicable features from the source training image without dragging in associated features that are not desired.
For example, distilling the concept (and calling up the associated pixels/features) of smile in the image above may in this case result in the entanglement of male attributes that are not applicable to female subjects.
In the case of face synthesis and editing, this problem can extend to ethnicity and gender, as well as many other sub-features.
To boot, high-level concepts such as ‘young’ or ‘ethnic’ can be ill-posed by those who label the data, or by automatic annotation processes that obtain the same or similar labels in an equally biased and subjective way.
For example, what age-range should young cover, and how can that quality be discretely quantized across different ethnicities, ages and genders?
There is a notable pressure for new research initiatives to produce comparable results without establishing the burden of an entirely new set of metrics or methodologies. In practical terms, this can mean constraining the labels for a new face-based GAN project to the existing classes in (for instance) the ImageNet dataset.
According to some voices in the research community, this reliance on ‘industry standard’ reference sets (and their often outdated or biased annotations) is holding back innovation, and encouraging researchers to address prior benchmarks instead of novel and bold new challenges.
This ‘technical debt’ can also extend to the metrics used by the Generator in a GAN (or the equivalent mechanisms in competing architectures, such as autoencoder) to judge the fidelity of the synthesized images to the source material in the dataset.
The ‘gold standard’ metric Frechet Inception Distance (itself derived from the ImageNet object recognition challenge) has come under criticism recently as a potentially flawed way of judging realism in an image.
These disputes and controversies in the computer vision research community represent obstacles to the development of versatile GAN-based face synthesis frameworks.
Finding Paths in the Latent Space
Text-based classes are not the only method of effecting transformations in the GAN’s latent space. Images can effectively be used as queries by sending the extracted latent code for a particular face back into the GAN workflow as a guideline or ‘filter’ for transformations.
This process allows the potential recreation of the original faces in the dataset, as well as assembling the ‘features’ that define that face into a configurable collection of characteristics that can be manipulated.
The latent space in a typical StyleGAN-derived network contains 512 dimensions into which, once successfully trained, all the derived features of the thousands of input training images have been assimilated.
If we can establish the ‘route’, known as the latent direction, between an encoded facial identity and an encoded ‘quality’ (more female, more Asian, etc.), we can ‘scrub’ between these two points in the latent space in a manner similar to sliding a video marker forward or backward in a timeline.
The dividing white line in the above illustration represents the linear hyperplane that separates these encoded vectors. There are multiple linear hyperplanes in a single latent space, and many more possible destinations and transformations that can be effected besides the ‘gender’ example in the above image.
The latent code representing a specific face can, for instance, be processed into another specific identity, as well as diverse traits such as blonde, old, Asian, et al.
Below, we see various latent direction traversals from the 2019 InterFaceGAN project, where many of the transformations seem more akin to the ‘digital dissolves’ of 1980s ‘morph’ technologies than the more temporally consistent output of viral deepfakes:
While complex hair styles are the most obvious undesirable artifacts of these journeys through latent direction, it’s worth noting that neither autoencoder systems nor early NeRF frameworks have been able to definitively or trivially solve this problem either (except through the ‘easy option’ of depicting subjects with very short hair).
Can a Pure GAN-Based System Generate Convincing Video Deepfakes?
Though third party tools such as Gradient-weighted Class Activation Mapping (Grad-CAM) can help us to map these emerging latent space relationships better (and even to facilitate specific operations such as face restoration), the lack of ‘watershed’ progress in GAN-only deepfake video techniques over the last three years has begun to inspire a slew of ‘hybrid’ approaches – architectures that incorporate rather than depend on GANs for deepfake-style manipulations (as we’ll see later, some of these approaches make use of traditional CGI).
A recent academic collaboration (including a contribution from Adobe Research), dubbed InsetGAN, proposes ‘nesting’ GAN output into a wider host framework consisting of multiple GANs (a scalable compromise similar to the nested NeRF implementation CityNeRF).
However, as is nearly always the case with latent space approaches to human simulation, the results are static images more suited to fashion e-commerce portals than photorealistic video output.
Another fairly recent paper, titled Latent to Latent: A Learned Mapper for Identity Preserving Editing of Multiple Face Attributes in StyleGAN-generated Images, provides an impressive interface for transiting between latent space locations, and even makes some incremental progress towards disentanglement:
However, this instrumentality and level of discretization represents relatively little advance upon 2019’s InterFaceGAN (see above).
In late 2021, the US research arm of ByteDance debuted SemanticStyleGAN, which corrals the results of multiple GANs into a single face image using semantic segmentation as a boundary guideline.
In general, adjunct or ‘guiding’ technologies are now seen as potential ways through the challenges of GAN-centric facial synthesis.
For instance, in April of 2022, a new GAN face synthesis system dubbed Multi-View Consistent Generative Adversarial Networks for 3D-aware Image Synthesis (MVCGAN) suggested the use of Generative Radiance Fields (GRAF) to provide geometric constraints that can help a facial synthesis system achieve improved continuity:
Despite notable improvements in flow and pose control, temporal continuity in depicting human hair remains elusive.
CGI Guidance for Neural Human Synthesis
Increasingly, research groups are investigating the more minor roles that GANs could play in face generators and face-editing systems that are based on more stable and familiar technologies.
In November of 2021, Disney Research posited Rendering with Style, a method that injects traditional ray-traced CGI textures into a StyleGAN2 network, in an attempt to create the textural consistency (at least, for skin) that pure latent space manipulation cannot currently achieve.
Though this approach (which shares some central concepts from a NVIDIA paper of the previous year), solves some of GAN’s temporal issues with skin rendering, it does not entirely solve the aforementioned problem of generating convincing and consistent hair, and also produces rather stilted facial reproductions (check out the full video for more examples).
Primitive CGI Guides
The Skinned Multi-Person Linear Model (SMPL) CGI primitives developed by the Max Planck Institute and famed effects house ILM in 2015, designed as an aide to new computer vision research, is frequently used in GAN-based and other types of generative architecture, as a compromise between the high instrumentality of traditional CGI and the new possibilities that neural networks offer.
Though SMPL has become a research standard, the Max Planck Institute has since improved on it with the Sparse Trained Articulated Human Body Regressor (STAR) system, which features improved deformations and notable optimizations, and which can address a wider range of human articulations, including expressive hands and faces.
MOST-GAN offers disentangled lighting, face shape, expression and pose manipulation via a StyleGAN2 generator, and is only one example of the current interest in shifting the incredible realism of GAN-generated faces into a controllable 3D environment, pending improved methods for native navigation and transformation in the latent space.
The Limitations of 'Generic' Transformations in GAN Facial Synthesis
A more recent example of 3DMM-aided facial GAN projects is South Korea’s One-Shot Face Reenactment on Megapixels, which exploits 3DMM as a potential source of instrumentality in a StyleGAN2 architecture, and attempts to maintain consistent identity across poses and expressions.
In the image above, we see examples of ‘frontalization’ under OSFR, where the system fails (bottom row) to infer an authentic likeness from an ‘off-center’ angle in a source photograph, and where the degree of occlusion (i.e., how far the subject is looking away from camera) seems to accord directly with the degree of inaccuracy in the final result.
Fed into the ClarifAI celebrity face recognition engine, the frontalized synthetic image of Mathew Rhys (top row, second from right) scores a respectable 0.061 likelihood of being an image of the actor; however, the frontalized Ursula Andress (bottom row, second from right), whose input source image (bottom left) is at a pretty acute 45-50° angle from the camera, is interpreted by ClarifAI as singer Kacey Musgraves (0.089 probability).
The pose transformations in OSFR are not informed by multiple views, but rather inferred from generic pose knowledge across multiple identities (in datasets such as CelebA-HQ, a typical training source in a wide-ranging GAN framework).
Likewise, expression transformations are powered by ‘baseline’ transformations that are not specific to the identity in an image that you might want a GAN to alter, and therefore cannot take account of the unpredictable ways that the resting human face will distort and transform across a range of expressions.
Most GAN initiatives that attempt expression alterations publish test results of ‘unknown’ subjects, where it’s not possible for the viewer to know whether the expressions are faithful to the source identity. In May of this year one braver group of GAN facial synthesis researchers from Denmark found the courage to include some more recognizable test subjects, illustrating in the process the limits of applying ‘prototypical’ emotions to familiar faces:
Single-image novel viewpoint inference is difficult, and ‘generic’ data can’t easily compensate for the lack of multiple views or multiple examples of expression and disposition of the same subject. Such estimations are geometrically inferred from multiple viewpoint photos in NeRF (though rather rigidly), and much more elegantly abstracted in autoencoder-based deepfake systems, from custom datasets containing thousands of images of the same person.
Perhaps inevitably, the greater interpretational powers of NeRF are beginning to be used in GAN facial synthesis research, at least as some kind of variation from a pure 3DMM-style parametric control system. Last month, a collaboration between China and the UK proffered a new approach called Conditional Generative Occupancy Field (cGOF).
The new system hamstrings itself in much the same way as any typical GAN-based facial synthesis framework, in that it relies only on single-view input.
cGOF generates explorable facial scenarios by imposing 3D losses on a NeRF model through the ministrations of (you guessed it) a 3DMM model. The researchers concede that it cannot produce high-fidelity details for challenging facial aspects such as wrinkles and hair, and is heavily dependent on the 3DMM component.
Generative Adversarial Networks are capable of producing the most convincing still images of ‘non-real’ human faces, compared to any other machine learning method; but using GANs to create plausible videos of people is proving to be a problem.
Purely GAN-based facial synthesis approaches have not yet succeeded in producing temporally consistent and qualitatively convincing deepfake video output without the use of secondary technologies. Even with such aids, GAN-based video deepfakes trail far behind the quality achieved by even an ‘off the shelf’ autoencoder system like DeepFaceLab or FaceSwap.
Though the GAN’s latent space is slowly revealing its mysteries, entanglement and targeted manipulation remain formidable obstacles.
To date, there is no ‘perfect’ approach to facial synthesis: autoencoder systems such as DeepFaceLab and FaceSwap operate on restricted areas of the face in pre-existing or specially-shot real-world video footage; NeRF is effectively ‘neural CGI’, and offers, as yet, only nascent forays into hyper-realistic facial synthesis and genuine content editability; and GAN remains hamstrung by the intractability of the latent space.
Therefore, despite the ongoing efforts of the research community, the Generative Adversarial Network is currently coming into focus as a supporting technology for new deepfake initiatives, rather than a pivotal architecture.