Research Proposes ‘Moral’ Sanitization for Text-To-Image Systems Such as Stable Diffusion

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

New research from Korea and the United States has proposed an integrated method for preventing text-to-image systems such as Stable Diffusion from generating ‘immoral’ images – by manipulating the generative processes within the system to intercept ‘controversial’ content and transform the generated content into what the authors characterize as ‘morally-satisfying’ images instead.

An image of a child smoking is converted by the new system into an innocuous photo of the child blowing bubbles, roughly preserving the composition while removing what the authors describe as 'immoral' imagery. Source: https://arxiv.org/pdf/2212.03507.pdf
An image of a child smoking is converted by the new system into an innocuous photo of the child blowing bubbles, roughly preserving the composition while removing what the authors describe as 'immoral' imagery. Source: https://arxiv.org/pdf/2212.03507.pdf

Instances of ‘immorality’* are identified via image-based recognition, using OpenAI’s resurgent Contrastive Language–Image Pre-training (CLIP) framework, and also through Natural Language Processing (NLP) approaches, which, by contrast, can identify text prompts that are likely to result in immoral images, and transform the text before it is passed to the generative processes.

The 'Immoral Attribute Identifier' comprising the first layer of defense in the new system splits the text prompt into individual tokens and rates them, based on existing trained datasets that contain moral rankings for words. In this case, we see that a scene of gunfire in a public place can result in several non-controversial outputs, including blurred imagery, and also inpainted imagery, where the identified moral 'hot spots' have been used as masks for the substitution of less controversial content.
The 'Immoral Attribute Identifier' comprising the first layer of defense in the new system splits the text prompt into individual tokens and rates them, based on existing trained datasets that contain moral rankings for words. In this case, we see that a scene of gunfire in a public place can result in several non-controversial outputs, including blurred imagery, and also inpainted imagery, where the identified moral 'hot spots' have been used as masks for the substitution of less controversial content.

The new method was implemented in Stable Diffusion, but could potentially be applied to other latent diffusion frameworks, and even to different types of architecture.

In the absence of definitive large-scale datasets that rationalize and define morality in both semantic and related visual terms, the authors have used the ETHICS dataset, a 2021 collaboration between UC Berkeley, Columbia University, The University of Chicago, and Microsoft.

Remediating a morally objectionable image generation: the immoral attributes are first identified and individuated, and then either blurred or inpainted, in this case using Stable Diffusion itself. For relative semantic integrity, we can see that the blood generated in the original image has been transformed into a small decorative flower on the bridal dress. It's possible that such 'tokens' of this kind of editing could be a future indicator of the type of content that was altered. Please note that the first two images appear to have been blurred by the paper's authors, to avoid presenting offensive content.
Remediating a morally objectionable image generation: the immoral attributes are first identified and individuated, and then either blurred or inpainted, in this case using Stable Diffusion itself. For relative semantic integrity, we can see that the blood generated in the original image has been transformed into a small decorative flower on the bridal dress. It's possible that such 'tokens' of this kind of editing could be a future indicator of the type of content that was altered. Please note that the first two images appear to have been blurred by the paper's authors, to avoid presenting offensive content.

Though the paper deals with some of the limitations or pitfalls of this kind of approach, it does not deal extensively with the innate difficulties of defining immorality in itself, particularly at an international level, and, due to lack of available data, places this burden entirely on the 2021 ETHICS work which underpins the framework’s rationale for intervention.

Compositionality is preserved during the 'moral editing' process.
Compositionality is preserved during the 'moral editing' process.

Nonetheless, in the absence of a concerted global effort to either harmonize a bare minimum of standards in this regard, or else provide a per-country rating system for ‘unacceptable’ content, the new work can perhaps be seen as an applicable proof of concept that could be adjusted according to varied circumstances (such as geolocation, or whether a web-based generative platform has the capabilities to at least prove the age of those using the system).

The new work, titled Judge, Localize, and Edit: Ensuring Visual Commonsense Morality for Text-to-Image Generation, comes from two researchers at Korea University, and one from the University of California at Berkeley.

Feasibility

The paper levels criticism at the limited ability of Stable Diffusion to provide an effective filtering system, citing instances where, even when SD’s content filter is active, morally questionable output can still be created by end-users.

From the paper: examples (some blurred) of morally questionable images produced by a default installation of Stable Diffusion with the safety checker not disabled. It's arguable that the US-centric LAION database underpinning Stable Diffusion's generative power is more inclined to intercept sexual rather than violent imagery, in accord with an apparent US-based tolerance for violent over sexual imagery.
From the paper: examples (some blurred) of morally questionable images produced by a default installation of Stable Diffusion with the safety checker not disabled. It's arguable that the US-centric LAION database underpinning Stable Diffusion's generative power is more inclined to intercept sexual rather than violent imagery, in accord with an apparent US-based tolerance for violent over sexual imagery.

Since the code for the paper’s suggested revisions is not currently available, it’s uncertain to what extent the proposed morality-filtering system could be disabled at code level (which was always a trivial matter regarding Stable Diffusion’s own NSFW filter).

The only two potentially reliable methods would be to compile the source code and dependent directories and libraries into an opaque executable file, or to hide the code behind a web-facing API, where users would have no access to the generative architecture (as OpenAI does with DALL-E 2).

The latter may become a compelling use case as the new generation of latent diffusion systems scale up and mature beyond the reasonable capabilities of domestic GPUs, and where the quality of content might eventually race so far ahead of local open source distributions (such as the hugely popular AUTOMATIC1111) that the improved output becomes a compelling factor for end users, despite the presence of automated morality filters. 

Architecture, Data and Approach

The new system offers four approaches to the generation of morally questionable content: the blurring of content that has been identified as immoral, which is likely to leave enough evidence of the original content that users may reasonably guess at it; the inpainting of regions of the image identified as immoral, where the need to preserve compositionality may also leave some less obvious ‘clues’ as to the original intent (see third image down, above); text-driven image manipulation, where correlated words (such as prepending ‘water’ to the word ‘gun’) are used to ameliorate the moral quality of the image; and the use of text-driven ‘moral image captions’, which leverage a corpus of data that contains moral distinctions on content. This process can recaption the ‘first draft’ of potentially problematic submissions, and then create an ‘improved’ version from that novel caption.

In the image above, we see word-level annotations highlighted in red in the lower row, with a gradient from yellow to blue indicating the level of moral peril identified in the output image.

In addition to the blurring and inpainting approaches to moral editing, we see in the right-most two columns the different text-based approaches to the challenge. In the 'Immoral attribute identifier' approach (second from right), the submitted words are broken down and rated, and, if necessary, swapped out for less objectionable words; in the 'Image captioning model' approach, alternative related captions are generated that can transform the objectionable image into an acceptable image using an image-to-image approach (presumably, at some notable cost of latency).
In addition to the blurring and inpainting approaches to moral editing, we see in the right-most two columns the different text-based approaches to the challenge. In the 'Immoral attribute identifier' approach (second from right), the submitted words are broken down and rated, and, if necessary, swapped out for less objectionable words; in the 'Image captioning model' approach, alternative related captions are generated that can transform the objectionable image into an acceptable image using an image-to-image approach (presumably, at some notable cost of latency).

The system’s immorality classifier is trained on the aforementioned ETHICS dataset, resulting in a transformative module informed by 13,000 pairs of sentences containing binary annotations concerning morality.

The joint embedding space of a frozen (i.e. the training will not affect it) CLIP encoder is used as a qualitative metric for morality, and the resulting component is named by the authors Visual Commonsense Immorality Recognizer – obviously, entirely calibrated by the qualitative standards of the ETHIC dataset.

Training schema for the Visual Commonsense Immorality Recognizer.
Training schema for the Visual Commonsense Immorality Recognizer.

The areas of the output image that contain objectionable content are identified via a masking approach, which evaluates the importance of words by testing the extent to which their removal affects the outcome of a process.

In the left column in the image below, we can see the effect of various such redacted words in a multi-stage generative process, with the omissions affecting the types of images produced, and the morality of the images subsequently judged by the custom immorality classifier.

Left, the workflow for textual attribute identification; right, the equivalent workflow for image-based attribute distinction.
Left, the workflow for textual attribute identification; right, the equivalent workflow for image-based attribute distinction.

In the right column in the same image, we see the same process taking place through image (instead of word) analysis, where the process is essentially reversed, produced a visual ‘heat map’ of contentious content instead of just a text-based ranking, as in the left column.

The system was trained on a lone NVIDIA A100 GPU, with 40GB of VRAM, at a learning rate of 0.002, under AdamW.

Tests

The qualitative tests for the project are principally represented by the various examples presented in the paper, many of which are reproduced here. It’s notable that sexual or sexualized content does not seem to be included in the criteria, which concentrates principally on violent imagery.

The authors note that their system is well-able to distinguish items such as cigarettes, blood, or firearms, and to intervene along the lines mentioned.

Regarding the ‘recaptioning’ functionality, the authors state:

‘Given immoral images generated by Stable Diffusion, we apply the off-the-shelf image captioning model that is trained with the MS-COCO [dataset]. This produces descriptive captions from a moral perspective. For example, an image of “a bride is bleeding” is described as “a painting of a woman in a red dress” and an image of “I shot my gun into the crowd” is described as “a man in a black shirt is holding a black dog”.

‘Using these generated captions as a condition, we can successfully manipulate them into a moral [scene].’

Finally, the researchers conducted a human study with 65 evaluators, who were asked to rank the immorality of generated images on a Likert scale from 1 (not immoral) to 5. Here they found that the inpainting approach, where objectionable content was re-imagined rather than blurred-out, received the most favorable scores.

Repeating the tests with the system’s own Visual Commonsense Immorality module (right, in the image below), the authors found the results very similar to the human evaluations, and comment:

‘We observe trends similar to our human evaluation, and this further confirms that our visual commonsense immorality recognizer matches human perception.’

Results from human and automated evaluation.
Results from human and automated evaluation.

However, the criteria for selection of the survey group is not included in the new paper, and supplementary material cited in it, which may contain such pertinent details, was not available at the time of writing. We have asked the authors for access to this material.

In closing, the researchers state:

‘Our human study and detailed analysis demonstrate the effectiveness of our proposed ethical image manipulation model. We believe our work is the first to address ethical concerns for text-to-image [generation].’

* Since morality is subjective, it’s necessary to initially acknowledge that the authors’ interpretation in this regard is a judgement call rather than an objective quality. To avoid repetition, no further acknowledgement of this is included in reference to ‘morality’ after this point.

My conversion of the authors’ inline citations to hyperlinks.

More To Explore

Loss Functions in Machine Learning
Knowledge base

Loss Functions in Machine Learning

Loss functions are the processes that tell a machine learning network, during training, if it’s getting any better at making predictions. This article looks at the broad current landscape of loss functions, and some of the new trends that are emerging, such as a greater reliance on human-informed evaluation of images.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle