New research from the University of California at Berkeley improves notably on recent efforts to create AI-powered image-editing procedures – this time by combining the considerable calculative forces of OpenAI's GPT-3 Natural Language Processing (NLP) model with the latest version of Stability.ai's world-storming Stable Diffusion text-to-image and image-to-image latent diffusion architecture.
The new work, titled InstructPix2Pix, was trained on nearly half a million completely synthetic images, along with triplets of associated and derived text, yet is capable of operating well on real-world images, and of preserving compositionality and detail while enacting extraordinary changes.
The overriding priority of the system is to respect the existing composition while accommodating the requested edit. Therefore, unlike the Img2Img functionality present across various implementations of Stable Diffusion, InstructPix2Pix will, if possible, not treat the text instruction as a 'point of departure', or an 'inspiration' for a different kind of image than the one provided, but will prioritize the continuity of placement and context between the original and the edited image.
Note, for instance, in the image above, that the prompt 'Turn it into a still from a western' does not lead to an entirely photorealistic cowboy replacing Toy Story's Woody, as would likely happen by default with a Stable Diffusion img2img operation, but that the photorealistic element conforms instead to the existing caricatured composition of the original face.
The power of InstructPix2Pix is based on paired image data, a method of training image synthesis systems by multiple A>B examples.
In a paired image training workflow, high volumes of data 'couplets' are needed to demonstrate a 'before and after' transformation. Since the images must therefore be associated and transformative, the datasets are usually quite small, due to the effort needed to create such demonstrative transformations at scale.
Synthetic data, traditionally, has often led to synthetic-looking results on out-of-distribution data (i.e., when a model trained on synthetic data is used on real data). However, InstructPix2Pix, fueled by the photorealistic generative powers of the latest (V1.5) Stable Diffusion checkpoint, is able to quite easily straddle the margins between stylized, artificial and photorealistic editing results.
The paired data used for training InstructPix2Pix consists of one unedited image and one altered image, with the image's semantic content reimagined by GPT-3.
For the initial language element of this workflow, OpenAI's GPT-3 model is fine-tuned on a modest dataset of human-originated captions, in three parts (see image above): the standard accompanying caption for the image; potential edit instructions (such as 'Have her ride a dragon', instead of the horse in the original image); and an altered caption that's apposite for the edited image (such as 'woman riding a dragon').
The fine-tuning data was taken selectively from the V2 6.5+ version of the LAION Aesthetics dataset that powers Stable Diffusion.
Once these triplets were generated, the researchers used Stable Diffusion to generate image pairs for the raw and amended captions. Since latent diffusion systems are prone to generate wildly different successive images, often even from the same prompt, the authors used the recent Google Research-led prompt-to-prompt system, which limits excessive 'rethinking' of a photo by forcing the image synthesis system to used text as a strict 'mask' for the areas that are to be changed, leaving the rest of the picture largely as it was before:
Prompt-to-prompt achieves this level of fidelity and detail retention by duplicating cross-attention weights in certain parts of the denoising process.
As seen in the earlier examples of the Beatles album cover, and in the lower-left example in the image above, some edits may require a major or overwhelming change in style, which should nonetheless retain the composition and essentiality of the submitted original picture.
Prompt-to-prompt can control this factor by controlling the percentage of denoising steps that feature shared attention weights.
Even then, however, cross-image fidelity proved not to be accurate enough to provide an adequate level of quality for the training data. Therefore the new system used a CLIP-based metric (StyleGAN-NADA), i.e. a method of determining image similarity based on OpenAI's hugely popular multimodal image interpretation system, which can determine and regulate the relationship between an image and the words or phrases that are (or might be) associated with it.
The Stable Diffusion images were created with the Euler ancestral sampler, with both the 'before' and 'after' images proceeding from identical initial noise states. For the second image ('after'), Prompt-to-prompt was employed to perform attention weight displacement, ensuring semantic and compositional parity between the images.
The results were filtered by CLIP so that data to be included for training would have a minimum image-image CLIP threshold of 0.75, and an image-caption threshold of 0.2, to ensure that the system was also being faithful to the text portion of the data.
The fine-tuned Stable Diffusion model for InstructPix2Pix was initialized on the standard base weights of the latest available checkpoint (i.e., the trained model that powers Stable Diffusion's synthesis capabilities), with the addition of new input channels on the first convolutional layer (all initialized at zero).
The revised, fine-tuned model was trained for 10,000 steps on eight NVIDIA A100 GPUs, each with 40GB of VRAM, for a total training time of over 25 hours, and at 256x256px resolution at a batch size of 1024, at a learning rate of 10-4.
The paper notes that the model performs well at 512x512px, even though trained on images half that size. The base Stable Diffusion checkpoint is trained, at greater expense of time and resources, at a native 512x512px; though the paper does not bring up the point, SD's most stable native performance resolution (512px512) seems to have had a beneficial effect on the fine-tuned system, despite the lower input resolution.
InstructPix2Pix uses a novel implementation of Google's classifier-free guidance (CFG) in order to bolster the system's disposition to retain original structure and detail. CFG is a familiar 'slider' or parameter in a Stable Diffusion distribution, and controls the extent to which the generated image should be faithful to the model.
In cases where the trained data simply does not include enough material or adequate data relationships to do the user's bidding, the user can increase CFG, usually at a penalty of image quality, so that the image is accordant with the command (though the result may often be unusable at the higher values).
Thus InstructPix2Pix features two guidance scales to accommodate this tension between fidelity to the edit instruction, and the original image.
The researchers pursued qualitative comparisons with prior methods SDEdit (by Stanford and Carnegie Mellon universities) and Text2Live (a collaboration between the Weizmann Institute of Science and NVIDIA), in addition to qualitative comparisons to SDEdit.
In the qualitative results above, we can see best in the first two rows that 'adjacent data' is holding back the full power of the transformation, as the prior systems are unable to disentangle the requirements of the new instructions from the existing composition and detail of the submitted source image.
The authors observe:
'We notice that while SDEdit works reasonably well for cases where content remains approximately constant and style is changed, it struggles to preserve identity and isolate individual objects, especially when larger changes are desired.
'Additionally, it requires a full output description of the desired image, rather than an editing instruction.
'On the other hand, while Text2Live is able to produce convincing results for edits involving additive layers, its formulation limits the categories of edits that it can handle.'
In terms of image consistency (as determined by CLIP), the trade-off between edit-fidelity and the original image was best-handled by the new system, as opposed to both the tested implementations of SDEdit:
The authors further note:
'[We] find that when comparing our method with SDEdit, our results have notably higher image consistency (CLIP image similarity) for the same directional similarity values.'
Though the project page of InstructPix2Pix has placeholders only for a code and apparently forthcoming online demo, there is already talk of integrating the system into various popular distributions of Stable Diffusion (such as the AUTOMATIC1111 repo), or – as occurs when code is laggard or the source code closed – of recreating the described approach for a potential open source implementation.
The paper states that edits are returned in an average of nine seconds on a single A100 GPU, which, historically, indicates that consumer-level setups could be expected to wait somewhere between 30-90 seconds for a result from a moderate domestic GPU.
Considering the power and faithfulness of the granular edits and amendments of which InstructPix2Pix is capable, the new scheme may bring the AI image synthesis scene a notable step closer to a post-Photoshop milestone in semantic editability.