Contrastive Language-Image Pre-training (CLIP)

contrastive language image pretraining (CLIP)
contrastive language image pretraining (CLIP)

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Released in January of 2021, the source code for OpenAI’s Contrastive Language-Image Pre-Training (CLIP) framework has, at the time of writing, been forked into 1,700 branches, and obtained 11,200 stars on GitHub. CLIP crops up increasingly in computer vision research papers, most particularly in research related to image and video synthesis – but also as a tool for a variety of related tasks, such as auto-captioning.

CLIP was trained on 400 million image/text pairings, mostly obtained from minimally-curated web-scraped data. This means that the quality and appropriateness of the pairing depends on how appositely and accurately the images were captioned, and the extent to which the captions may reflect any biases or agenda of the captioner.

Since manually checking those captions (and the quality of their relationship to the associated images) is a logistically and financially prohibitive task, CLIP is not a perfect system, and reflects some of the biases inherent in this ‘free’ data – but since it solves so many traditional problems related to image synthesis, and in such an adroit way, CLIP has captured the imagination of the research community, and now represents a notable transformative power in generative frameworks.

One of many online resources where the interpretive powers of CLIP can be tested. This one hosted at Hugging Face, uses your text prompts to match up and present the most closely-related images that CLIP can find for the query. Source: https://huggingface.co/spaces/vivien/clip
One of many online resources where the interpretive powers of CLIP can be tested. This one hosted at Hugging Face, uses your text prompts to match up and present the most closely-related images that CLIP can find for the query. Source: https://huggingface.co/spaces/vivien/clip

CLIP has been adapted into non-English languages, including Chinese; has been described as a ‘strong yet embarrassingly simple baseline’ for many of the thorniest problems in continual learning; has been used for NeRF-based 3D model generation; is being leveraged across various projects for zero shot semantic segmentation; is proving a useful resource in robotics, by bridging the semantic gap between what a machine sees and related natural language concepts; has, as mentioned, been adapted into a text-only image captioning system; and is becoming a mainstay and quality-driver in the hugely popular Stable Diffusion text-to-image latent diffusion system (through the open source OpenCLIP).

Let’s take a look at what CLIP can do – and some of the things that it can’t do; at least, yet.

How CLIP Works

CLIP attempts to form relationships between images and text, by learning text/image associations from very large amounts of data pairs of this type (i.e. images that are drawn from public internet resources, and which have associated text, either in the form of metadata, such as alt-tags, or more expressly, which have been explicitly captioned).

Examples of the kind and quality of captions with which CLIP must assemble its semantic logic. As is clear, the captions are frequently recursive (lower-most example) or in some other way non-descriptive, due to the lack of resources available for more insightful and granular annotation. Source: https://wandb.ai/dalle-mini/dalle-mini/reports/OpenAI-CLIP-Score-exploration--VmlldzoxNjMwODM1
Examples of the kind and quality of captions with which CLIP must assemble its semantic logic. As is clear, the captions are frequently recursive (lower-most example) or in some other way non-descriptive, due to the lack of resources available for more insightful and granular annotation. Source: https://wandb.ai/dalle-mini/dalle-mini/reports/OpenAI-CLIP-Score-exploration--VmlldzoxNjMwODM1

Because it is trained on such a vast and diverse corpus of data, CLIP is able to make zero-shot predictions relating to user queries; which is to say, that it has a good chance of associating the user-submitted query ‘cat’ with a picture of a cat (see image above), or of selecting an appropriate image from the user-submitted text-query cat – without having been explicitly trained to detect cats, animals, or any related higher-level category.

Since the prior standard in computer vision research is that a model would be trained on specific data (for instance, a recognition model exclusively designed to detect intruders, in a security system), CLIP is a notable departure and augmentation of existing research culture.

Besides being a multimodal (in this case, text+image) system, CLIP’s utility mainly lies in the sheer breadth of the data that it was trained on. This means that CLIP is capable of performing useful operations on out-of-distribution (OOD) data, i.e., images/concepts that it has never been exposed to.

Examples of predictions from diverse 'traditional' datasets such as ImageNet, where CLIP has chosen the most apposite caption from several possible contenders. Source: https://openai.com/blog/clip/
Examples of predictions from diverse 'traditional' datasets such as ImageNet, where CLIP has chosen the most apposite caption from several possible contenders. Source: https://openai.com/blog/clip/

In this way, CLIP can act as a functional intermediary for far more limited computer vision and also image synthesis systems, since their specific interests (such as faces, bridges, churches, etc.) are likely to already have  been well-incorporated into CLIP.

The architecture of CLIP is centered around twin encoders – a text encoder and an image encoder.

From the official release paper for CLIP, the architectural workflow for the system. Source: https://arxiv.org/pdf/2103.00020.pdf
From the official release paper for CLIP, the architectural workflow for the system. Source: https://arxiv.org/pdf/2103.00020.pdf

During training, image and associated text data is fed at volume into the system until common features are identified. In this sense, a feature is a persistent impression of a concept such as ‘dog’, where the most common visual canine traits form into an impressionistic embedding. This embedding becomes co-related to the words that accompanied the pictures that created it.

In this way, the system now has two methods of extracting (or even creating) ‘dog’-related content – through words and through images, and can function as pre-training in novel systems.

Working With Available Material

For SEO purposes, much of the text data for the images on which CLIP was trained features ‘keyword stuffing’ – an old trick whose effectiveness has diminished greatly over the years, where available text space is used to include as many ‘related words’ and concepts as can be included, in the hope that the image, when it is indexed by search engines, will crop up in as wide a range of results as possible – which the uploader also hopes will lead to additional clicks, traffic, and better search engine results ranking.

Examples of the diversity of caption quality contributing to AI systems powered by uncurated or under-curated public data. Source: https://jalammar.github.io/illustrated-stable-diffusion/
Examples of the diversity of caption quality contributing to AI systems powered by uncurated or under-curated public data. Source: https://jalammar.github.io/illustrated-stable-diffusion/

Since there is little scope to manually edit so many captions, inaccurate or tangential captions remained in place during CLIP training. In the right-most image above, we see some typical ‘black hat’ SEO tricks, where the uploader has appended the absolutely unrelated ‘home design ideas’ and other similar tags to a popular gaming/outdoors image.

If this kind of unrelated alt-tag is infrequent, it won’t become damagingly embedded into a core concept for what the image is truly about – but it does represent ‘noisy’ data, and can affect CLIP’s accuracy in some edge cases; and even in general usage.

Likewise, captions or alt-tags may be overly minimal or unhelpfully lyrical – for example, the middle-image above contains information about what the eagle depicted is doing (‘soaring’), but does not contain the words ‘eagle’ (or the higher-level concept/class ‘bird’), leaving CLIP to re-associate that concept by itself, based on whatever other ‘eagle’ or avian images it has assimilated.

For this reason, among others, CLIP-dependent text-to-image generative systems such as Stable Diffusion may exhibit strange or unexpected associations with particular phrases, or to present bizarre conjunctions of concepts that most people would not have associated with the user’s text-prompt.

Ironically, since CLIP is currently becoming popular as a potential automated way of generating image captions, any such inaccuracies in its text/image associations risk to become multiplied and truly ‘ingrained’ over time, as future versions of CLIP and its many derived forks begin to ‘feed’ on web-based images that were themselves captioned by CLIP.

Limitations of Prediction Capabilities in CLIP

CLIP is so impressive, and so useful, that it’s easy to forget that it’s only regurgitating human-annotated images; that the ‘intelligence’ on display is actually human intelligence, iterated systematically; and that, by itself, CLIP can make some rather unhelpful assumptions, associations and predictions in regard to image data.

One recent paper, a collaboration between Columbia University and Microsoft Research, observes ‘We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions.’

In other words, CLIP cannot easily explain why it is presenting an association or an image, because it is simply a very complex web of image/word associations, albeit containing complex arrays of detailed class hierarchies (i.e., animal>dog>golden retriever)

The aforementioned paper offers an approach missing in CLIP – a method of justifying the way that contributing elements can lead a visual recognition system to make a prediction:

Explaining predictions. Source: https://arxiv.org/pdf/2212.06202.pdf
Explaining predictions. Source: https://arxiv.org/pdf/2212.06202.pdf

In the above example from the paper, it’s possible that CLIP is capable of recognizing every single constituent part of the image, such as individual vegetables; but it is more likely to make a correct ‘Greek salad’ prediction based on its trained knowledge of whole pictures of Greek salads (i.e., based on pixel-derived features of a particular formation of colors and shapes falling into a ‘food’ class hierarchy) than by identifying the individual facets of the salad and recognizing the correct association that leads to ‘Greek salad’.

This is due, again, to the nature of the training data, which is more likely to have broad and encompassing text/image captions, rather than fine-grained multiple captions for constituent parts of any particular image. It falls to CLIP to decide whether or not such relationships will be noticed and embedded in their own right, and depends on the amount of available images for any given concept.

Where the data is scant, such relationships are unlikely to form. If there are twenty ‘Greek salad’ images captioned ‘Greek salad’, and only one that actually describes and annotates the ingredients, the latter will probably be treated as outlier data (unless the ‘inner objects’ described correspond enough to other points in the training data).

Reading the Room

Further to the challenge of inferring objects/concepts from their context and relationship to other objects, another recent paper, this time a collaboration between KAUST and Snap Inc., explores the extent to which an object’s relationship with other objects may help us to define what that object is – or, put more simply, the extent to which we more easily recognize things when they are in a recognizable rather than abstract context.

The KAUST/Snap collaboration takes a deeper look at how context and adjacent or nearby objects could be incorporated more into visual recognition and multimodal frameworks such as CLIP, building on the popularity of the ScanRefer and ReferIt3D datasets. Source: https://arxiv.org/pdf/2212.06250.pdf
The KAUST/Snap collaboration takes a deeper look at how context and adjacent or nearby objects could be incorporated more into visual recognition and multimodal frameworks such as CLIP, building on the popularity of the ScanRefer and ReferIt3D datasets. Source: https://arxiv.org/pdf/2212.06250.pdf

However, inference through context can be a trap rather than a benefit for CLIP, which tends to entangle objects and concepts with their environments and/or adjacent concepts; and which will usually, operating as a component in a generative model such as Stable Diffusion, produce ‘obvious’ contexts.

For example, people in swimwear will normally be on the beach, and  food will normally be on a table (and, due to the large amount of Instagramming of food that’s making its way into AI-facing datasets, will often be represented in an aerial view).

In the example (non-cherrypicked) images below, from a vanilla Stable Diffusion local installation, not one of the men is out of a beach context, and not one of the ‘meals’ is presented in the sense of a ‘family meal’. Instead, dominant trends in the training data seem to have dictated the overriding ambience and style, respectively, of well-voted ‘holiday’ images and Instagram/menu content.

Standard CLIP-aided Stable Diffusion output, uncurated, for 'a man in swimwear' and 'a meal', revealing in each case a generic and quite entangled context and style.
Standard CLIP-aided Stable Diffusion output, uncurated, for 'a man in swimwear' and 'a meal', revealing in each case a generic and quite entangled context and style.

Additional limitations, outlined by OpenAI in the original presentation of the framework, include the fact that CLIP is not resilient enough to interpret imagery not covered in its original training data. This does not mean, necessarily, that such images were necessarily absent from the training data, but could indicate instead that the captioning was not of a sufficient quality to categorize the imagery correctly.

One example of this is that CLIP achieves only 88% accuracy on character recognition, crucial functionality on interpreting text that may exist in images, such as in signs, menus or instructions.

OpenAI also notes that CLIP’s categories and classes are not necessarily granular enough for all types of object that it may encounter, such as particular models of car or other types of vehicles, or species of flowers, and so forth.

More To Explore

Loss Functions in Machine Learning
Knowledge base

Loss Functions in Machine Learning

Loss functions are the processes that tell a machine learning network, during training, if it’s getting any better at making predictions. This article looks at the broad current landscape of loss functions, and some of the new trends that are emerging, such as a greater reliance on human-informed evaluation of images.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle