If you have any interest in image synthesis architectures such as Generative Adversarial Networks (GANs) or latent diffusion frameworks like Stable Diffusion, you’ll have noticed references to the ‘latent space’ of these systems.
The latent space, which we’ll look into more deeply here, is the ‘subconscious’ and overarching understanding of relationships between learned data points that a machine learning system has been able to derive from the information that it gets fed.
Arguably, we also put our own picture of the world together in this way, according to imperfect information that enters our developing consciousness either in an ordered way (didactic education) or an unordered way (random chance and happenstance).
With our very survival at stake, we’re likewise challenged to categorize and order incoming information into a cohesive and functional network of relationships which will eventually inform our rational processes and cognitive abilities.
Though we can all probably individuate pivotal events and data that defined our development, most of us have little better than an intuitive or vague understanding of our own model of the world. Like trained AI systems, we demonstrate these nascent connections most in our choices and interpretations of events and data. The operations of the supporting mechanisms, many of which were formed before our age of reason, are not always clear to us.
So it is with the latent space, which is not a pre-formed ‘mail-room’ waiting for parcels to put into cubby-holes, but which will be architecturally defined by the data that fills it – and which tends to resist active investigation.
The Utility of the Latent Space
Being able to operate at will in the latent space is a powerful method of gaining increased and more granular control over a trained system, because you can potentially tell a process or agent where to start, and what it is that you want to achieve.
For example, in the image below, we observe the process of cycling through embeddings of faces trained into the latent space of a Wasserstein GAN with Gradient Penalty (WGAN-GP) model:
In the animation below, we see researchers from the Chinese University of Hong Kong and the Australian National University cycling through church designs in a trained GAN using a simple ‘hand’ cursor – a feat made possible by using ‘heat maps’ to illuminate the otherwise-hidden routes between embeddings in the latent space, and to instrumentalize them:
A little closer to our own area of interest, greater control of the latent space means potential mastery of the transformational powers of a trained image synthesis system, such as changing just one aspect of a facial representation, by knowing exactly where, for instance, the ‘hair’ or ‘mouth expression’ codes are located, and operating exclusively on them:
The Latent Maze
However, the latent space is a very strange place. You can’t explicitly design one, except to the extent that you refine and curate the data that it feeds on. Rather, it’s assembled by the training algorithm over a period of some hours (or even some weeks or months), and represents a vast, multi-dimensional array of values that the AI was able to extract from the data that you gave it.
The latent space is ‘multi-dimensional’ for two reasons: firstly, because many of the embeddings (i.e. the extracted information) that occupy it belong in more than one place, and can’t be accommodated by something as simple as a Venn diagram or a magic quadrant. Secondly, and more importantly, because it can cohesively represent a very broad grouping (such as ‘people’), a narrower group (such as ‘women’), and also represent a single embedding that belongs to all these categories (such as ‘Taylor Swift’, a valid entry in all three levels of dimensionality).
The diversity of categories in a complex latent space, such as a general latent diffusion image synthesis system like Stable Diffusion, can place embeddings in multiple ‘drill-down categories’ of this kind, such as people>women>Taylor Swift, workers>entertainers>Taylor Swift, or composers>modern>Taylor Swift.
In the animated image above, we see several searches, for the terms ‘man’, ‘woman’ and ‘person’, drilling down into the latent space of the Word2Vec 10K Natural Language Processing dataset, and revealing where these values lie in what otherwise appears to be a vast and messy self-assembled cloud of embeddings.
If we just stay out of the latent space of a trained system and run queries on it, we can get the useful results that were the objective of training the system in the first place, such as generating novel photorealistic faces, or obtaining conversational reasoning from a system such as OpenAI’s GPT-3.
However, compared to the extent to which we can control information in pre-AI systems such as Photoshop, or in CGI workflows, or in a program such as Microsoft Word, we have remarkably little ability to intervene in a latent space once it has formed, or even to understand how the interrelationships between the data points operate.
This makes for powerful but opaque systems, an uncomfortably Druidic workflow, and for systems that cannot easily be investigated for bias. The latter issue has made explainable AI (XAI) a driving concern for state and responsible private sector use of machine learning systems trained on under-curated, web-scraped data that’s likely to embed undesirable biases into the output of trained systems.
A Crude Flashlight in the Latent Space
Actually tracing the sequence of events that define an AI’s decisions at inference time is more challenging, due to the ‘holographic’ nature of the latent space, and the complexity of the interrelationships between the embeddings that it contains.
One popular and (by now) quite mature solution, frequently featured in new research aimed at demystifying the latent space, and also used in the ‘morphing church’ project featured earlier, is Gradient-weighted Class Activation Mapping (Grad-CAM), an academic collaboration between the Chinese University of Hong Kong, the Australian National University, and the University of California at Los Angeles.
Grad-CAM uses guided backpropagation to generate ‘heat-maps’ that can make explicitly visible the way that the trained neurons formed associations in response to a request from the user, thus providing some kind of ‘justifying rationale’ that could help researchers not only identify the ways in which undesirable bias is formed, but also prove an aide for image synthesis systems that wish to individuate and disentangle certain visual elements, without adversely affecting other elements of a generated image or video.
However, Grad-CAM was released five years ago now, and the majority of new research into XAI and conceptual mapping of the latent space continues to be passive and interpretive, and the latent space itself a challenging mystery.
Since one objective of machine learning research is to understand the world as it is, ‘anti-bias’ latent space exploration systems are operating under a kind of ideological conflict: if they permit generative AI systems to become entirely bowdlerized (or, in the terminology of the worst of Reddit and Twitter, ‘woke’), then whatever appeal or unfettered creative impetus popularized the system will presumably migrate to less locked-down frameworks and outlets.
On the other hand, if they do nothing but observe and analyze the output of freely available latent spaces, such as Stable Diffusion, the resulting controversies seem likely only to proliferate. This characterizes the challenge as a cultural and ethical one, far beyond the purview of the ‘enabling’ technologies, which would seem to be a barometer of culture rather than a ‘villainous’ transformative technological phenomenon.
Conclusion: The Price of Automation
If you ask Mary Poppins to go in and magically clean up your messy bedroom, you’re saving a lot of personal labor; but you’re not going to know where anything is later, because she follows her own system. It’s probably a better system than yours, but that doesn’t help you find your clean socks.
So it is with the latent space, which, in formation, has traversed and categorized millions, or even billions of data points at a post-human scale and speed, deriving the requisite compositional logic and relationships from cheaply, often badly-labeled data which would by unconscionably expensive to curate manually (if far more accurately and fairly).
Whether for addressing bias or gaining greater control over image synthesis, new methodologies and more transparent tools than Grad-CAM will be needed to gain a deeper understanding of the way that the latent space functions. The challenge is compounded by the fact that the architecture of a latent space emerges directly from the data, instead of according to some predefined and universal logic, the imposition of which could hinder a machine learning system’s most fruitful and inspired insights – as well as some of its most objectionable standpoints.