InstructPix2Pix: Accurate, AI-Based Image-Editing With GPT-3 and Stable Diffusion

instructpix2pix-MAIN

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

New research from the University of California at Berkeley improves notably on recent efforts to create AI-powered image-editing procedures – this time by combining the considerable calculative forces of OpenAI’s GPT-3 Natural Language Processing (NLP) model with the latest version of Stability.ai’s world-storming Stable Diffusion text-to-image and image-to-image latent diffusion architecture.

Left, 'Lady in a Garden' by Edmund Leighton, transformed by simple instruction to a scene within a grocery store, where the position of the orange flowers has been retained, but the flowers transformed into abundant tomato displays. Right, a scenic panorama gets some boats, and then a city skyline. Source: https://arxiv.org/pdf/2211.09800.pdf
Left, 'Lady in a Garden' by Edmund Leighton, transformed by simple instruction to a scene within a grocery store, where the position of the orange flowers has been retained, but the flowers transformed into abundant tomato displays. Right, a scenic panorama gets some boats, and then a city skyline. Source: https://arxiv.org/pdf/2211.09800.pdf

The new work, titled InstructPix2Pix, was trained on nearly half a million completely synthetic images, along with triplets of associated and derived text, yet is capable of operating well on real-world images, and of preserving compositionality and detail while enacting extraordinary changes.

The overriding priority of the system is to respect the existing composition while accommodating the requested edit. Therefore, unlike the Img2Img functionality present across various implementations of Stable Diffusion, InstructPix2Pix will, if possible, not treat the text instruction as a ‘point of departure’, or an ‘inspiration’ for a different kind of image than the one provided, but will prioritize the continuity of placement and context between the original and the edited image.

Further examples of text-based edits from InstructPix2Pix.
Further examples of text-based edits from InstructPix2Pix.

Note, for instance, in the image above, that the prompt ‘Turn it into a still from a western’ does not lead to an entirely photorealistic cowboy replacing Toy Story‘s Woody, as would likely happen by default with a Stable Diffusion img2img operation, but that the photorealistic element conforms instead to the existing caricatured composition of the original face.

The Power of Two

The power of InstructPix2Pix is based on paired image data, a method of training image synthesis systems by multiple A>B examples.

Paired data is commonly used for training sketch-to-image systems, allowing the generation of photorealistic object depictions from relatively crude daubs, because the system has learned core relationships between representative scribbles and high-resolution output. In the example above, the paired data is shown in the first two images, with the resulting inference on the right. Source: https://learnopencv.com/paired-image-to-image-translation-pix2pix/
Paired data is commonly used for training sketch-to-image systems, allowing the generation of photorealistic object depictions from relatively crude daubs, because the system has learned core relationships between representative scribbles and high-resolution output. In the example above, the paired data is shown in the first two images, with the resulting inference on the right. Source: https://learnopencv.com/paired-image-to-image-translation-pix2pix/

In a paired image training workflow, high volumes of data ‘couplets’ are needed to demonstrate a ‘before and after’ transformation. Since the images must therefore be associated and transformative, the datasets are usually quite small, due to the effort needed to create such demonstrative transformations at scale.

Synthetic data, traditionally, has often led to synthetic-looking results on out-of-distribution data (i.e., when a model trained on synthetic data is used on real data). However, InstructPix2Pix, fueled by the photorealistic generative powers of the latest (V1.5) Stable Diffusion checkpoint, is able to quite easily straddle the margins between stylized, artificial and photorealistic editing results.

The iconic cover of the Beatles' 'Abbey Road' album, with variations provided by InstructPix2Pix.
The iconic cover of the Beatles' 'Abbey Road' album, with variations provided by InstructPix2Pix.

The paired data used for training InstructPix2Pix consists of one unedited image and one altered image, with the image’s semantic content reimagined by GPT-3.

training-data-generation

For the initial language element of this workflow, OpenAI’s GPT-3 model is fine-tuned on a modest dataset of human-originated captions, in three parts (see image above): the standard accompanying caption for the image; potential edit instructions (such as ‘Have her ride a dragon’, instead of the horse in the original image); and an altered caption that’s apposite for the edited image (such as ‘woman riding a dragon’).

The fine-tuning data was taken selectively from the V2 6.5+ version of the LAION Aesthetics dataset that powers Stable Diffusion.

Here, the text highlighted in green has been generated by GPT-3 for the data creation and curation process.
Here, the text highlighted in green has been generated by GPT-3 for the data creation and curation process.

Once these triplets were generated, the researchers used Stable Diffusion to generate image pairs for the raw and amended captions. Since latent diffusion systems are prone to generate wildly different successive images, often even from the same prompt, the authors used the recent Google Research-led prompt-to-prompt system, which limits excessive ‘rethinking’ of a photo by forcing the image synthesis system to used text as a strict ‘mask’ for the areas that are to be changed, leaving the rest of the picture largely as it was before:

Examples of minor and major changes executed by the prompt-to-prompt framework. Source: https://arxiv.org/pdf/2208.01626.pdf

Prompt-to-prompt achieves this level of fidelity and detail retention by duplicating cross-attention weights in certain parts of the denoising process.

Keeping it Real

As seen in the earlier examples of the Beatles album cover, and in the lower-left example in the image above, some edits may require a major or overwhelming change in style, which should nonetheless retain the composition and essentiality of the submitted original picture.

Prompt-to-prompt can control this factor by controlling the percentage of denoising steps that feature shared attention weights.

Even then, however, cross-image fidelity proved not to be accurate enough to provide an adequate level of quality for the training data. Therefore the new system used a CLIP-based metric (StyleGAN-NADA), i.e. a method of determining image similarity based on OpenAI’s hugely popular multimodal image interpretation system, which can determine and regulate the relationship between an image and the words or phrases that are (or might be) associated with it.

The Stable Diffusion images were created with the Euler ancestral sampler, with both the ‘before’ and ‘after’ images proceeding from identical initial noise states. For the second image (‘after’), Prompt-to-prompt was employed to perform attention weight displacement, ensuring semantic and compositional parity between the images.

The results were filtered by CLIP so that data to be included for training would have a minimum image-image CLIP threshold of 0.75, and an image-caption threshold of 0.2, to ensure that the system was also being faithful to the text portion of the data.

Training

The fine-tuned Stable Diffusion model for InstructPix2Pix was initialized on the standard base weights of the latest available checkpoint (i.e., the trained model that powers Stable Diffusion’s synthesis capabilities), with the addition of new input channels on the first convolutional layer (all initialized at zero).

The revised, fine-tuned model was trained for 10,000 steps on eight NVIDIA A100 GPUs, each with 40GB of VRAM, for a total training time of over 25 hours, and at 256x256px resolution at a batch size of 1024, at a learning rate of 10-4.

The paper notes that the model performs well at 512x512px, even though trained on images half that size. The base Stable Diffusion checkpoint is trained, at greater expense of time and resources, at a native 512x512px; though the paper does not bring up the point, SD’s most stable native performance resolution (512px512) seems to have had a beneficial effect on the fine-tuned system, despite the lower input resolution.

Classifier-Free Guidance (CFG)

InstructPix2Pix uses a novel implementation of Google’s classifier-free guidance (CFG) in order to bolster the system’s disposition to retain original structure and detail. CFG is a familiar ‘slider’ or parameter in a Stable Diffusion distribution, and controls the extent to which the generated image should be faithful to the model.

In cases where the trained data simply does not include enough material or adequate data relationships to do the user’s bidding, the user can increase CFG, usually at a penalty of image quality, so that the image is accordant with the command (though the result may often be unusable at the higher values).

Thus InstructPix2Pix features two guidance scales to accommodate this tension between fidelity to the edit instruction, and the original image.

Classifier-free guidance operating at various strengths and inputs in diverse attempts to transform Michelangelo's David into a cyborg.
Classifier-free guidance operating at various strengths and inputs in diverse attempts to transform Michelangelo's David into a cyborg.

Results

The researchers pursued qualitative comparisons with prior methods SDEdit (by Stanford and Carnegie Mellon universities) and Text2Live (a collaboration between the Weizmann Institute of Science and NVIDIA), in addition to qualitative comparisons to SDEdit.

Qualitative comparisons of InstructPix2Pix, comparing the system's performance against SDEdit and Text2Live. The extent to which the new system can operate discretely on elements of a source image is perhaps best demonstrated by the 'add a crown' command in the second row down. SDEdit results are shown both conditioned on the output caption and also on the edit string.

In the qualitative results above, we can see best in the first two rows that ‘adjacent data’ is holding back the full power of the transformation, as the prior systems are unable to disentangle the requirements of the new instructions from the existing composition and detail of the submitted source image.

The authors observe:

‘We notice that while SDEdit works reasonably well for cases where content remains approximately constant and style is changed, it struggles to preserve identity and isolate individual objects, especially when larger changes are desired.

‘Additionally, it requires a full output description of the desired image, rather than an editing instruction.

‘On the other hand, while Text2Live is able to produce convincing results for edits involving additive layers, its formulation limits the categories of edits that it can handle.’

In terms of image consistency (as determined by CLIP), the trade-off between edit-fidelity and the original image was best-handled by the new system, as opposed to both the tested implementations of SDEdit:

With higher scores better, InstructPix2Pix beats SDEdit in terms of retaining essential details and composition from the original image, in the edited version.

The authors further note:

‘[We] find that when comparing our method with SDEdit, our results have notably higher image consistency (CLIP image similarity) for the same directional similarity values.’

Conclusion

Though the project page of InstructPix2Pix has placeholders only for a code and apparently forthcoming online demo, there is already talk of integrating the system into various popular distributions of Stable Diffusion (such as the AUTOMATIC1111 repo), or – as occurs when code is laggard or the source code closed – of recreating the described approach for a potential open source implementation.

The paper states that edits are returned in an average of nine seconds on a single A100 GPU, which, historically, indicates that consumer-level setups could be expected to wait somewhere between 30-90 seconds for a result from a moderate domestic GPU.

Considering the power and faithfulness of the granular edits and amendments of which InstructPix2Pix is capable, the new scheme may bring the AI image synthesis scene a notable step closer to a post-Photoshop milestone in semantic editability.

More To Explore

AI ML DL

Research Proposes ‘Moral’ Sanitization for Text-To-Image Systems Such as Stable Diffusion

New research from Korea and the United States has proposed an integrated method for preventing text-to-image systems such as Stable Diffusion from generating ‘immoral’ images – by manipulating the generative processes within the system to intercept ‘controversial’ content and transform the generated content into what the authors characterize as ‘morally-satisfying’ images instead.

manvatar-MAIN
AI ML DL

Creating State-of-the-Art NeRF Head Avatars in Minutes

If time were no object, Neural Radiance Fields (NeRF) might by now have made greater inroads into potential commercial implementations – particularly in the field of human avatars and facial recreation.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle