A researcher from Spain has developed a new method for users to generate their own styles in Stable Diffusion (or any other latent diffusion model that is publicly accessible) without fine-tuning the trained model or needing to gain access to exorbitant computing resources, as is currently the case with Google's DreamBooth and with Textual Inversion – both methods which are primarily intended to insert objects or people into the Stable Diffusion universe, rather than impose environmental ambience or styles (i.e. 'in the style of Van Gogh/Kubrick/Mapplethorpe', etc.).
The new method does not specifically require Stable Diffusion, and would in theory work as well on other noise-based image generation architectures, such as DALL-E 2, Imagen or Parti, if one only had the same kind of extraordinary access to them that Stability.ai has allowed by open-sourcing Stable Diffusion.
The new system works by the user training a novel and distinct adjunct file almost momentarily on a limited number of photos and a single text embedding (rather than a text embedding for each photo, as is the case with the competing methods).
Since the object, in most cases, is to recreate a style rather than a specific object, only a single phrase or word is necessary, because the intention is for the user-created style to permeate and completely influence the image that results from the text prompt.
The system is titled aesthetic gradients, and is not only capable of imposing novel styles that the latent diffusion model is unaware of, but also can 'boost' existing styles which are (in the opinion of the end-user) too scantly-represented in the dataset that trained the model which powers the latent diffusion architecture.
In experiments, the paper's researcher 'augmented' some already-existent styles in Stable Diffusion by adding additional image material.
In the above image, which apparently shares a frozen seed across all renders, the left-most image is basic Stable Diffusion output; the prompt for the middle image, appends a 'style-summoning' keyword, but is mostly unaffected by this; and the final image, far-right, which uses the aesthetic gradients approach, changes the image notably, because it is invoking far more associated material for the prompt than is extant in the standard Stable Diffusion model – material that has been supplied by the user.
In other words, if you add even more Van Gogh data to Stable Diffusion, as adjunct material, your '…in the style of Van Gogh' output will be much more…well, Van Gogh-y; and you won't, apparently, have to trash the conventional functionality of the model by fine-tuning it (a 'hijacking' exemplified by the recent Waifu Diffusion 'fork' model); buy a high-end video card; or resort to hiring external cloud-based GPU resources from the likes of Google Colab Pro and vast.ai, among others.
The method, which has been released on GitHub as a Stable Diffusion fork, only modifies the weights of the CLIP encoder in Stable Diffusion – the functionality that associates images with their labels, and acts as an interpretive layer between the user's text-prompt, then synthesizes an apposite image that's based on similar word/text associations trained into the model
Aesthetic gradients essentially intervene in the standard prompt>CLIP>noise>image process by interposing the aesthetic embedding generated by the user (i.e., by their contributed images and single text definition).
The contributed images are 'averaged out' in the pipeline and finally normalized to the 'unitary norm' of the standard Stable Diffusion text2img process – augmenting rather than substituting it.
The paper notes:
'The similarity between the two embeddings, computed as the dot product ceᵀ, can be used to measure the agreement between CLIP representation of the textual prompt and the preferences of the user.
'Thus, the previous expression can be used as a loss and we can perform gradient descent with respect to CLIP text encoder weights to drive the prompt representation towards the aesthetics of the user.'
In this way, the 'standard' output of Stable Diffusion is essentially being used as a loss metric, so that CLIP's weights can be modified by gradient descent to steer the final result towards the user's preference rather than the default preference.
In the author's experiments, this process was very efficient, requiring only 20 gradient steps to make the embedding compatible with the standard CLIP encoder in Stable Diffusion (though the exact hardware used was not specified).
The author notes:
'The resulting [representation] is more aligned to the user preference, while preserving the original semantics…Note that only the weights of the CLIP text encoder are modified, nor the visual encoder nor any other component of the diffusion model.'
The paper also observes that since the final output requires just a single embedding, the user saves storage space, and that this economy makes sharing much easier.
Though the very short paper gives scant details of the creation or training process, it does provide diverse examples of the system in use.
In initial tests, the researcher employed two sets of images: SAC8+, which is a subset of Simulacra Aesthetic Captions.
…and LAION7+, a subset of LAION Aesthetics v1, which can score images effectively based on aesthetics originally derived from human-based scores, and generalized into an applicable model:
In the latter case, users can see for themselves the difference that LAION's aesthetic score makes by activating the 'aesthetic score' drop-down menu (the feature is off by default) for 'cat' at the CLIP retrieval site for LAION.
For LAION7+, images were filtered for a rating of 7 (out of 10) or higher.
The author tested several aesthetic embeddings with a collection of prompts (extensively detailed in the source paper) of diverse complexity and length, and observes that SAC8+ produces more 'fantasy-like' imagery, while LAION7+ produces more floral patterns, exemplifying the extent to which the user can potentially gain control over a suitable environment for objects and people depicted (see 'Potential Applications' below):
If 'attractive' pictures are the primary objective (and sometimes 'accuracy' or 'photorealism' may be more important, the new system quantifiably improves the aesthetic appeal of output pictures, at least for Stable Diffusion:
The paper emphasizes that the personalized model produces improved aesthetic scores without in any way modifying the source model or architecture.
Please refer to the paper itself for further qualitative results, including the appendix material, which includes tests wherein existing and unknown terms were calculated into embeddings, and either enabled new types of style to emerge, or to augment the aesthetic appeal and/or detail of existing styles.
In these secondary experiments, 100 images each for the terms cloudcore, gloomcore and glowwave were scraped from Pinterest using these target terms, while five images of paintings by the 19thC Romantic Russian painter Ivan Aivazovsky were added to another dedicated embedding.
As is often the case with image synthesis system innovations (particularly those based on Stable Diffusion), there are potential other applications for these user-specified styles than simply generating one-shot, attractive pictures.
For instance, in the problematic task of achieving temporal coherence over a series of contiguous frames in Stable Diffusion, one of the chief issues is that lighting and environment are difficult to control as the sequence progresses.
Potentially, a lightweight system such as aesthetic gradients could allow a user to cheaply and quickly integrate very specific environmental conditions that are easily summoned into Stable Diffusion by the apposite embedded phrase, effectively creating a 'locked set' in which other creations, locked into their own styles by other stochastic methods, can expect consistent lighting and reflections.