Contact
Thanks! We'll get back in touch with you soon.
Oops! Something went wrong while submitting the form.

WORLD LEADERS IN GENERATIVE AI CONTENT THAT LOOKS REAL.

CLOSE

What’s The Difference between CGI and AI?

Martin
Anderson
January 18, 2024
Since the advent of Stable Diffusion in August of 2022, a growing crisis of terminology has emerged in regard to the use of AI to simulate alternate human identities, or to overwrite original human faces in images and video with AI-generated faces.

For instance, prior to this time, the term deepfake had become embedded in the culture as a reference to the autoencoder-based code that was released, controversially, in late 2017, and which was to form the basis of the popular DeepFaceLab, DeepFaceLive and FaceSwap distributions.

A typical face-swap, this time using the ROOP framework, of the type that has bedecked YouTube since 2017. Source: https://www.youtube.com/watch?v=7L8Kt4LLKOE
A typical face-swap, this time using the ROOP framework, of the type that has bedecked YouTube since 2017. Source: https://www.youtube.com/watch?v=7L8Kt4LLKOE

If you used the term deepfake, you were, until quite recently, probably talking about video content where the original face of a person was substituted with the face of another person, especially in the areas of political deepfakes and deepfake porn.

This particular technology had stagnated notably by around 2019, and remains unable to deliver a truly convincing close-up, despite efforts to throw exorbitant computing power at it. Indeed, the core DeepFaceLab and DeepFaceLive distributions were archived and frozen (essentially, abandoned) by the lead developer last November.

By contrast, the recent and ever-growing ability of Stable Diffusion to create far more convincing faces of specific people, using methods such as DreamBooth and LoRA, has begun to take over the term.

Just a few of the more photorealistic examples available to Stable Diffusion users, for free, at the civit.ai website. In nearly all cases, the models were trained on less than a hundred images, on consumer hardware†.
Just a few of the more photorealistic examples available to Stable Diffusion users, for free, at the civit.ai website. In nearly all cases, the models were trained on less than a hundred images, on consumer hardware†.

However, since temporal stability remains an unsolved problem for Stable Diffusion-based video, we now have at least two systems which qualify to be referred to as ‘deepfake’ approaches: the 2017 code (and its descendants and forks), which has superb temporal stability, but which can take weeks to train even one model, and which cannot reproduce intricate detail that well, usually; and Stable Diffusion, which can create far superior representations of targeted individuals, but which (currently) has extreme difficulty animating them.

Besides the growing conflation of intent that’s likely to become apparent as Stable Diffusion gets better at making videos, new and emerging approaches such as Gaussian Splatting seem set to further encroach upon the deepfake reference.

A recent paper uses the rasterization technique Gaussian Splatting to attach 'neural pixels' to the exact vertices of a CGI head, courtesy of the interstitial FLAME framework. Source: https://arxiv.org/pdf/2312.02069.pdf
A recent paper uses the rasterization technique Gaussian Splatting to attach 'neural pixels' to the exact vertices of a CGI head, courtesy of the interstitial FLAME framework. Source: https://arxiv.org/pdf/2312.02069.pdf

Though inadequate or non-exact terminology may eventually hinder meaningful and clear discussion of what’s happening in the world of AI human synthesis, it could be argued that most people have limited interest in deepening their understanding of the enabling technologies, and that the deepfake term is likely to be pulling extra shifts for some time to come.

Semantic Accuracy

Much as the term deepfake did not anticipate later, alternative approaches, neither did the term Computer-Generated Imagery (CGI) take account of the coming of generative AI. We’ve had CGI for a long time, and generative AI for far less time; but since both approaches ‘generate images with computers’, arguably, the term ‘CGI’ can remain applicable as a simple development of these much older (and very different) technologies.

However, the reason that ‘traditional’ CGI has never incited the current level of political and societal concern, in contrast to the current wave of generative AI, is that it rarely manages to get past the ‘uncanny valley’, in terms of depicting humans; and its best state-of-the-art resources require the combined efforts of scores, or even hundreds of skilled and trained contributors.

For the 2016 Star Wars prequel 'Rogue One', ILM elaborately recreated the 1970s incarnation of actor Peter Cushing, using traditional CGI techniques. Sources: https://www.indiewire.com/awards/industry/rogue-one-visual-effects-ilm-digital-grand-moff-tarkin-cgi-princess-leia-1201766597/ and https://www.youtube.com/watch?v=xMB2sLwz0Do
For the 2016 Star Wars prequel 'Rogue One', ILM elaborately recreated the 1970s incarnation of actor Peter Cushing, using traditional CGI techniques. Sources: https://www.indiewire.com/awards/industry/rogue-one-visual-effects-ilm-digital-grand-moff-tarkin-cgi-princess-leia-1201766597/ and https://www.youtube.com/watch?v=xMB2sLwz0Do

By contrast, some of the best lone deepfake practitioners have not only equaled or improved upon Hollywood’s CGI efforts, in viral YouTube videos since 2017, but have also gone on to work at major visual effects houses:

In our severely reductive news climate, only the broadest and most easily assimilated facts tend to penetrate the culture. Therefore the way that generative AI systems such as Stable Diffusion can turn textual instructions into images (i.e., you instruct the AI to generate ‘a picture of a grey cat’, and it then makes you a pretty good photorealistic image of a grey cat) is probably the only widely-understood fact about the new generation of gen-AI models.

Consequently, numerous comments on viral posts around generative AI’s current capabilities reveal how many people consider that CGI, with all its effortful workflows, will be replaced by pure text-to-video processes following this paradigm, in the very near future.

This could happen, eventually; and in a period that’s close to the most recent quantum leap in any technology, such further leaps not only seem inevitable, but also imminent.

However, they fail to take account of what’s likely to happen on the road from A to B. As it stands, currently, CGI is emerging more and more as a central driving force for controllable, production-level AI-centric output.

A recent project used CGI template footage as a baseline for AI interpretations – and this ‘image-to-image’ approach is becoming increasingly popular both in the hobbyist and professional VFX community. Source: https://anonymous456852.github.io/

If one had taken the advent of early seminal CGI outings such as TRON (1982) in the same epochal light as the public suddenly considers suitable for generative AI, Jurassic Park (1993) would have been a mid-1980s movie.

Instead, the state-of-the-art paced more gingerly towards that level of CGI capability, with interstitial leaps such as the groundbreaking visual effects produced for James Cameron’s The Abyss (1989) and Terminator 2 (1991) – gaining momentum also from the democratic impetus of consumer-level computing, and from the digital revolution that ensued.

Just as AI has its ‘winters’, new and exciting developments in AI often need to wait far longer than anticipated for hardware innovation, economic forces, technical breakthroughs, and various other factors to align in their favor.

CGI as a Driving Force for Generative AI

Does this mean that AI systems such as Stable Diffusion and Generative Adversarial Networks (GANs) will become ‘mere’ texture generators for systems that center on old-school CGI?

As an interstitial stage, akin to the rapid development of CGI under Cameron’s 1980s/90s projects, and the innovations of other people around that period, probably yes.

Visual effects companies are frequently forced to develop prototypical techniques, in order to advance the state of the art and to solve particular problems, as ILM employee Steven “Spaz” Williams would eventually do for Jurassic Park, despite internal opposition at the company.

The opposition, in that case, occurred because Williams’ ideas were unproven, the movie had a release date to meet, and old-school stop-motion animation was a known and quantifiable technique, despite its shortcomings.

Likewise, putting cutting-edge generative AI procedures into a deadline-driven production environment is currently a risky business. The nature of the latent space (the interior functioning of a trained generative AI model) makes it hard to navigate and exploit, and frequently makes dazzling new results either difficult to obtain or (crucially) difficult to replicate a second time.

A visualization of the relative position of concepts trained into a latent space. In systems such as Stable Diffusion, words may be associated with visual concepts in ways that are not explicit or easy to manipulate. Source: http://projector.tensorflow.org/

Therefore if generative AI can indeed be integrated into long-established CGI techniques, so that VFX practitioners gain increased control and instrumentality in the short term, this seems a likely and pragmatic compromise.

Making a CGI Human

Traditional CGI creates objects, including humans, by sculpting a mathematical mesh, composed of thousands, or even millions of polygons (there are other types of unit, such as Ngons, and also parametric approaches, but polygons are the most common approach, because they are the most explicitly editable).

From top left: wireframe visualization in a 3D application; simple texture visualization; and a basic render, which takes too long to process in real time.
From top left: wireframe visualization in a 3D application; simple texture visualization; and a basic render, which takes too long to process in real time.

The polygons represent the joining together of points in an X/Y/Z 3D space. The shapes created by these webs can have almost infinite definition, and can represent straight and curved surfaces.

X/Y/Z space represents possible movement and presence in three directions. Source: https://help.autodesk.com/view/MOBPRO/2024/ENU/?guid=GUID-F90A9BF3-1A41-4FB7-AE58-53D73BDDEF6B
X/Y/Z space represents possible movement and presence in three directions. Source: https://help.autodesk.com/view/MOBPRO/2024/ENU/?guid=GUID-F90A9BF3-1A41-4FB7-AE58-53D73BDDEF6B

Once the mesh is created, it can be textured, often using real-life photos of body and facial skin, and any other necessary visual material, such as clothing.

In the bottom row, we see the unfolded mesh on the left and the corresponding facial texture on the right.
In the bottom row, we see the unfolded mesh on the left and the corresponding facial texture on the right.

It’s an unambiguous and exact process that offers pixel-level control to VFX practitioners; but it almost never delivers a truly photorealistic result, no matter how much effort and talent is thrown at the workflow.

Making a Generative AI Human

CGI processes involve a great deal of human interpretation and manual intervention, which is not the least of the reasons that results can look fake with these methods – because spontaneity is lacking.

Generative AI, on the other hand, in systems such as Stable Diffusion, can assimilate the central defining features of a person into a broader and more categorical understanding of human appearance. Since such systems are trained on datasets that include millions (or billions) of faces, the output for any particular person is supported by several lifetimes’ worth of understanding about the way faces work in general.

This is a stark contrast to the creation of a CGI human, which is informed at multiple stages by personal evaluation of the people working on the project, which ends up with multiple inputs and, inevitably, a curated and often-disputed set of opinions about what looks ‘right’.

However, Latent Diffusion Models (LDMs) such as Stable Diffusion have no intrinsic understanding of X/Y/Z space – rather, the training process teaches the system how to place groups of pixels (or, more specifically machine learning-based features) together, based on text input and other word-based (semantic) methodologies, that are likely to look convincingly like the millions of real images that the model was exposed to during training.

This leaves the gen-AI practitioner with a limited and clumsy set of tools with which to manipulate the output, such as descriptions that prescribe ‘profile view’, ‘young’, ‘male’ – or, in the case of the aforementioned personalization systems such as LoRA, with ‘trigger words’ that were deliberately trained into a secondary system.

You don’t need to train a custom model yourself to see this effect; it’s enough to use the name of any adequately famous celebrity, in a basic install of Stable Diffusion, or in any online implementation that permits such keywords. The data/text association for celebrity data (even in the base models) becomes immediately apparent.

On the left, the prompt 'Clint Eastwood' demonstrates that the latest version of Stable Diffusion favors the older incarnation of the actor, likely because the spread of digital photography after Y2K made photographers less frugal, and images of any one individual more abundant. On the right, specifying 'young' as a positive prompt and 'old' as a negative prompt produces some sketchy but prompt-accurate reproductions of a younger Eastwood. Source: https://huggingface.co/spaces/stabilityai/stable-diffusion
On the left, the prompt 'Clint Eastwood' demonstrates that the latest version of Stable Diffusion favors the older incarnation of the actor, likely because the spread of digital photography after Y2K made photographers less frugal, and images of any one individual more abundant. On the right, specifying 'young' as a positive prompt and 'old' as a negative prompt produces some sketchy but prompt-accurate reproductions of a younger Eastwood. Source: https://huggingface.co/spaces/stabilityai/stable-diffusion

With customization methods like LoRA and DreamBooth (as well as textual inversion and other methods), you can perform primary or secondary fine-tuning that will obtain far more accurate results, by gathering and training a small dataset of specific photos, and setting apposite trigger words.

Left, what the latest version of Stable Diffusion outputs for a 'Jennifer Connelly' prompt by default; right, the superior results from adding a free LoRA from civit.ai that has been specifically trained on a select number of images of the actress.
Left, what the latest version of Stable Diffusion outputs for a 'Jennifer Connelly' prompt by default; right, the superior results from adding a free LoRA from civit.ai that has been specifically trained on a select number of images of the actress.

Seed of Destruction

The trouble is, you never know exactly what you are going to get when using words to create images with generative AI frameworks such as Stable Diffusion.

The system concatenates the text embeddings derived from your prompt and sends it through the latent space looking for the most suitable trained latent embeddings that will satisfy the text prompt request. Because the generation process works by imposing order on an image of random noise, a random seed is used to decide the pathway of the request.

The relationship between the text prompt input and the denoising process, which begins with random noise set by a random seed. Source: https://poloclub.github.io/diffusion-explainer/
The relationship between the text prompt input and the denoising process, which begins with random noise set by a random seed. Source: https://poloclub.github.io/diffusion-explainer/

If you want to, you can make a note of that seed and use it a second time, to create exactly the same image in exactly the same circumstances – but changing that seed just a little bit will give you a radically different image.

This is the heart of Stable Diffusion’s problem with temporal continuity, and the ongoing difficulty of using it to create or alter video. Currently there is great interest in developing new prompt-based systems that will give the user more control, and certain extant approaches, such as prompt traveling, attempt to accomplish some level of temporal continuity through prompt-editing alone.

Likewise, every week brings new research papers that offer new methods for improving the semantic targeting of the latent space. None, to date, have brought forth methods that even vaguely approach the level of control available via CGI.

CGI and AI as Partners

At the time of writing, a newly-released video from ILM reveals the extent to which CGI aided the impressive recreation of an early 1980s Harrison Ford for the 2023 release Indiana Jones and the Dial of Destiny.

The tell-tale depth maps (top right) and polygons (bottom left), together with the characteristic blurred edges of a deepfake preview window (to the right of all images except the last) reveal the extent to which CGI has aided this impressive AI recreation of a young Harrison Ford. Source: https://www.youtube.com/watch?v=Q4P8FuQJIdM
The tell-tale depth maps (top right) and polygons (bottom left), together with the characteristic blurred edges of a deepfake preview window (to the right of all images except the last) reveal the extent to which CGI has aided this impressive AI recreation of a young Harrison Ford. Source: https://www.youtube.com/watch?v=Q4P8FuQJIdM

Here, as far as we can tell from the video, a CGI version of young Ford is aiding the ‘hook’ for the deepfake content (as well as providing apposite landmarks for mapping, toning, color-grading and other necessary VFX processing), and helping to marshal the forces of the AI system into the correct configuration.

Additionally, facial detail which must remain 100% consistent, such as beard growth, can be difficult for a trained AI model to reproduce consistently, since even the older 2017-style deepfakes system assimilates facial representations and surface detail from thousands of images where such factors are unlikely to be entirely consistent.

Therefore the use of a ‘transformed’ (i.e., ‘young’) version of the face will be necessary to reintegrate less critical detail through older processing techniques.

In essence, things which CGI cannot recreate convincingly, such as human eyes, can be handled by AI, which does a far better job, in these cases; and things which AI cannot create consistently, such as beard growth and hair, can be handled with authenticity by CGI, since our critical faculties are far lower for such peripheral details.

Finally, in theory, the shortfall between the lineaments of the faces of the 80 year-old vs. the 40 year-old actor are extreme enough that the deepfake model by itself would have difficulty in consistently mapping the features, without a more controllable and consistent facial model to key on.

3DMMs Come of Age

Since the late 1990s, CGI face models have been used as an aide to research into neural  face synthesis and recognition (and long before Nicolas Cage and Tom Cruise became the poster boys for facial synthesis, the actor Tom Hanks was adopted for this purpose).

A timeline of the development of 3DMMs since 1999, with some projected predictions into the future. Source: https://arxiv.org/pdf/1909.01815.pdf
A timeline of the development of 3DMMs since 1999, with some projected predictions into the future. Source: https://arxiv.org/pdf/1909.01815.pdf

3D Morphable Models (3DMMs) were in the forefront of this movement, which has largely been led by the Max Planck Institute and ETI Zurich – though Disney research has also made a long-term commitment to the use of CGI heads and bodies as an adjunct to neural synthesis.

To find out more about 3DMMs and their descendants, check out our overview article – but, to summarize, this approach devised the notion that the exact properties and coordinates of CGI models could be used as a layer of instrumentality for neural techniques, by calculating equivalent locations between areas of the CGI head/body, and a neural representation technique (which typically lacks any native method of control).

An example of using 3DMM fitting to create instrumentality for otherwise ungovernable neural synthesis techniques. Source: https://github.com/Yinghao-Li/3DMM-fitting
An example of using 3DMM fitting to create instrumentality for otherwise ungovernable neural synthesis techniques. Source: https://github.com/Yinghao-Li/3DMM-fitting

Recently, the use of 3DMMs – and subsequent systems such as STAR and FLAME – has grown in popularity in the run of new neural human synthesis scientific papers, with the emergence of projects such as Tax-free’ 3DMM Conditional Face Generation, Fake It Without Making It, and AffUNet, among many others.

Currently, the aforementioned FLAME framework is most popular with the research community, and is making inroads into workflows designed to bring greater levels of control to Stable Diffusion and other LDMs.

Nascent use of 3DMMs to control Stable Diffusion output. Source: https://arxiv.org/pdf/2307.04859.pdf
Nascent use of 3DMMs to control Stable Diffusion output. Source: https://arxiv.org/pdf/2307.04859.pdf

Models such as FLAME can also be used to generate accurate and temporally consistent 3D equivalents of footage of actors, and can even operate this way in a live environment. Indeed, the FLAME model powers a very recent and impressive new project that uses Gaussian Splats as a ‘deepfake texture’ technique for such an approach:

The Gaussian Avatars projects uses a FLAME CGI model in a neural workflow to attach Gaussian Splats to estimated spatial equivalents of coordinates on the CGI mesh – to great effect. Source: https://www.youtube.com/watch?v=lVEY78RwU_I

In this way, a dynamic CGI ‘bridge’ can be created between obtained footage and transformed neural representations – an AI version of a technique that has long been used in traditional CGI, for instance, with the transformation of Brad Pitt’s captured facial motion into a CGI pipeline for David Fincher’s 2008 VFX-laden outing The Curious Case of Benjamin Button.

Brad Pitt's captured facial performance was transposed onto a CGI model in order to effect remarkable transformations for 'The Curious Case of Benjamin Button'. Source: https://www.youtube.com/watch?v=offEigRdSGQ
Brad Pitt's captured facial performance was transposed onto a CGI model in order to effect remarkable transformations for 'The Curious Case of Benjamin Button'. Source: https://www.youtube.com/watch?v=offEigRdSGQ

Image-to-Image

While such approaches can potentially provide control methods for otherwise ungovernable or intractable AI-based neural synthesis techniques, a simpler approach, and one that can be used when it is provides a quick and effective solution, is to use image-to-image techniques – where an unconvincing render of a CGI model is used as input for a more effective neural process such as Stable Diffusion.

Using a simple 'traditional' CGI model, free with DAZ studio, to effect a simple image-to-image transformation on modest domestic hardware, in about five minutes. In the leftmost image, we see the polygon mesh and vertices of the hair model; second from left, the live preview in Daz Studio; second from right, the best result that Daz can obtain in terms of a realistic render; and, far right, the same image passed through an equally free LoRA downloaded from civit.ai and run through a local install of Stable Diffusion.
Using a simple 'traditional' CGI model, free with DAZ studio, to effect a simple image-to-image transformation on modest domestic hardware, in about five minutes. In the leftmost image, we see the polygon mesh and vertices of the hair model; second from left, the live preview in Daz Studio; second from right, the best result that Daz can obtain in terms of a realistic render; and, far right, the same image passed through an equally free LoRA downloaded from civit.ai and run through a local install of Stable Diffusion.

Once again, such an approach is likely to be of use in a selective manner, as needed, rather than as an all-in-one solution – in much the same way that the new Indiana Jones footage has married together multiple new and older techniques to arrive at an effective solution. Rotoscoped selections of image-to-image output may be used, for instance, in cases where adequate temporal continuity can be achieved with diffusion or similar techniques.

However, the current likeliest use case for this approach is to generate consistent synthetic data for training different types of models, including older autoencoder-based deepfake approaches.

A typical 2017-style deepfake dataset will be drawn from multiple sources, and will likely present the actor or subject in a variety of lighting conditions, and possibly at a variety of ages. Though a traditional deepfake model trained on such variegated data will perform well across a variety of target clips, many hobbyist deepfakers go the extra mile and train a model from scratch on only the data from the clip that they want to transform.

Two approaches to deepfake dataset curation. Ad hoc curation (image left) will lead to a versatile model, but clip-specific curation (image right) will obtain the best results for the target video, even if the model will be far less effective for other clips.
Two approaches to deepfake dataset curation. Ad hoc curation (image left) will lead to a versatile model, but clip-specific curation (image right) will obtain the best results for the target video, even if the model will be far less effective for other clips.

Since training times for such models can vary between 24 hours and two weeks (on domestic hardware), this is a massive commitment of time and resources. Nonetheless, at scale, and with greater availability of better hardware than the typical hobbyist has, a ‘clip specific’ approach can be simulated by training a model in this more exacting way, using synthetic data that transforms an unconvincing CGI model of the actor into convincing single frames, using a prior ‘general’ model.

Further, the image-to-image approach can help provide extreme angles of an actor for the dataset, which are frequently unavailable either at scale or in good quality (or both). In such cases, CGI>AI transformations can at least augment the limitations of the real data, particularly in cases where no new data can be obtained (i.e., in reconstructing an actor at a much younger age than they currently are).

A comparison between prior techniques for obtaining extreme angles from limited data, vs. a new approach (bottom row) – a case where image-to-image, CGI>AI methods can potentially be helpful in providing better training data. Source: https://bomcon123456.github.io/efhq/
A comparison between prior techniques for obtaining extreme angles from limited data, vs. a new approach (bottom row) – a case where image-to-image, CGI>AI methods can potentially be helpful in providing better training data. Source: https://bomcon123456.github.io/efhq/

As it stands, CGI>AI deepfaking is now a common occurrence using the Metahuman avatar framework, which removes a great deal of the effort in animating CGI humans. The output can then be passed through a deepfake process:

Click to play. The templates in the Metahuman framework can be easily used as deepfake targets. Source: https://www.youtube.com/watch?v=KdfegonWz5g

Conclusion

To summarize, then – the chief difference between CGI and AI in visual effects workflows is that CGI is a primarily mathematical technique, dating back at least to the 1970s, that offers pixel-level control – but at the cost of authenticity, where rendering humans is concerned.

Generative AI, instead, is a far more recent technique that can render humans brilliantly – but tends to do what it wants, and not exactly what you want; and furthermore, draws on a trained latent space whose intricacies, potential and pathways remain a stubborn mystery to the computer vision research sector.

The advent of text-to-image systems such as Stable Diffusion has impressed the general public so much that a common conception is forming that visual effects pipelines of the very near future will be as simple as describing a scene and waiting for a perfect and photorealistic video to emerge – even though there are multiple obstructions to the emergence of such a dazzling ‘workflow’; which is likely to be decades, rather than months away.

At the moment, as is clear from very recent trends in scientific publications, and from early AI-aided outings such as the recent Indiana Jones movie, there is not only a clear place for autoencoder and generative (LDM) systems within more conventional CGI pipelines, but a clear place for CGI within ‘purer’ neural workflows.

The evidence suggests that in the interim between the emergence of generative AI and the (currently) science-fictional text-to-movie system that so many feel is just around the corner, the two technologies are set to deepen a fruitful and potentially wonderful collaboration.

† Sources:
https://civitai.com/images/5060081
https://civitai.com/images/3934889
https://civitai.com/images/1781492
https://civitai.com/images/1645961