A new collaboration between Korea and the US offers a surprising fait accomplis to the frenetic image synthesis scene: a text-to-image framework based not on latent diffusion (such as Stable Diffusion), but on the older and now often-dismissed Generative Adversarial Network (GAN) model.
Prior works with similar scope have always been trained on limited datasets, while the new system, titled GigaGAN, has been trained on subsets of the hyperscale LAION dataset that powers Stable Diffusion.
Until now, GAN networks tended to become unstable when training very high volumes of data, but the researchers for the new project have used multiple adjunct and additional technologies to bolster and reinforce a StyleGAN-based architecture so that GigaGAN can not only produce fine quality text-to-image photos in a fraction of the time that latent diffusion models (LDMs) take, but also produce images at extraordinarily high resolutions by default (rather than on the greatest and most powerful GPUs).
The latter achievement has been enabled by a novel upscaling architecture, which is capable of producing high-resolution detail, and which, the authors state, can also be dropped into other generative systems – even those that do not use GANs, such as DALL-E 2 and Stable Diffusion.
The multi-scale training used for GigaGAN, combined with some palliative tricks to overcome GAN's traditional scaling issues, several borrowed from diffusion-based architectures, has enabled a 1-billion parameter GAN based on LAION, which natively produces 'immediate' images, instead of trawling through a denoising process that's more or less tied to available GPU resources, but which carries innate high latency, in systems such as Stable Diffusion.
The authors concede that while many of the results are impressive, many are also of a lower perceptual standard than can be achieved with DALL-E 2. However, the same could be said of Stable Diffusion itself, whose massive popularity is attributable more to its widespread availability than to any visual superiority to OpenAI's more restricted framework.
The researchers claim that GigaGAN offers three major advantages over diffusion and autoregressive models:
'First, it is orders of magnitude faster, generating a 512px image in 0.13 [seconds]. Second, it can synthesize ultra high-res images at 4k resolution in 3.66 seconds. Third, it is endowed with a controllable, latent vector space that lends itself to well-studied controllable image synthesis applications, such as style [mixing], prompt interpolation…and prompt [mixing].'
Prompt-mixing in Stable Diffusion is currently a problematic pursuit, since one part of the prompt tends to dominate, and most of the solutions offered by the research sector come with caveats. Therefore perhaps the most interesting facet of the project is the extent to which the architecture can disentangle style from content, and content from content.
The authors conclude:
'[Our] model is the first GAN-based method that successfully trains a billion-scale model on billions of real-world complex Internet images. This suggests that GANs are still a viable option for text-to-image synthesis and should be considered for future aggressive scaling.'
The new work is titled Scaling up GANs for Text-to-Image Synthesis, and comes from seven researchers across the Pohang University of Science and Technology (POSTECH), Carnegie Mellon University, and Adobe Research. There is an additional dedicated website for the project. Code for GigaGAN does not appear to be available at the time of writing.
In the fast-moving image synthesis sector, the core functionality here is well-established, and, perhaps until now, has even been considered a little 'archival'. Here, the input image is mapped into a style vector, which feeds the processed information into a series of upsampling convolutional layers. This generates a constant tensor (basically an array of values, like a crossword grid) that will be used to influence the final image through the various upsampling processes.
What's different is that GigaGAN adds a sample-adaptive kernel selection (SAKS, right, in the image above). This module has been designed expressly to allow a StyleGAN-based system to process data at the kind of scale that's more common in latent diffusion systems.
The SAKS creates a bank of filters, instead of just a single filter, each of which handles a processed feature at each layer. Subsequently, the processed data passes through a fully connected layer (or affine layer), and predicts a group of weights to average across all the newly-created filters.
This produces, finally, a single aggregated filter, which will then be passed into the more familiar processes of StyleGAN2.
The authors note that at the schema level, this softmax-based weighting effectively constitutes a differentiable filter selection process that's specific to the input (i.e., the varying text/image data that's in play during the generation process).
The authors further observe:
'[Since] the filter selection process is performed only once at each layer, the selection process is much faster than the actual convolution, decoupling compute complexity from the resolution.'
These innovations alone cannot adapt GANs to the task at hand, however. Therefore the researchers have integrated additional attention layers directly into the convolutional architecture at the core of StyleGAN2, and have also used Lipschitz continuity to enable stable training (in accordance with a 2021 DeepMind paper) – a hurdle at which many previous high-scale GAN systems have fallen. Also, as word-based embeddings pass through the generative process, each benefits from a separate cross-attention mechanism.
The earlier-mentioned tensor, finally, becomes the 'query', and the text embeddings become the key and value pairs for the central attention mechanism.
As mentioned in our article last year, GANs use a kind of 'ongoing conflict' between a generator and discriminator module, where the discriminator tells the generator at each iteration how well it has performed, and challenges it to do better – but without saying exactly what it did wrong in the first place: a kind of 'blind' didactic method, similar to the exploratory logic of the Battleship game.
In contrast to the original design of StyleGAN, GigaGAN uses both the text-based style code to orchestrate the synthesis framework, and the word-embeddings, which are used as fodder for the cross-attention mechanisms.
For the generator, text embeddings are extracted from the user-supplied text prompt (i.e., 'a picture of a woman working at a computer') and the extracted tokens passed through a frozen CLIP-based feature extractor. CLIP maintains the relationship between the derived semantic concepts and their associated images, which are originally trained into a multimodal model (traditionally, a latent diffusion model).
The discriminator is split into varying branches that deal with text and images. Additional discernment functions are provided by CLIP and from vision-aided GAN training, a technique developed by Carnegie Mellon and Adobe in 2022.
For this stage, the CLIP model effectively acts as a central backbone extracting features from the incoming intermediate layers, and passing them through further discriminating layers which make predictions as to the 'authenticity' of the generated images. As per usual, with GANs, these predictions are systematically used to let the generator know how well it performed this time.
The GAN-based upsampler baked into the architecture is a discrete architecture that, the authors state, can potentially be leveraged in other systems.
Many of the recent breed of diffusion-based image synthesis models, such as PARTI, feature multi-level upsampling, where the generated material (still in latent form) is passed through successive enlargement layers until it reaches the desired output resolution, at which point the latent codes are translated into an actual image.
This is like the photocopier effect in reverse, in that the image, at least in theory, should be actively improving at each stage of upscaling. Since the image information is still in vector form (i.e., it's mathematical, and not yet a rendered bunch of pixels), none of the usual degradation that comes with generational copying are in force, and each upscaling layer will add detail related to the semantic terms (i.e., lizard [skin], etc.).
The authors note that this process is different to traditional upsampling, in that the text prompt or related word embeddings will be considered as input factors affecting the result of the enlargement.
GigaGAN's upsampler is an asymmetric U-Net which processes the initial 64px resolution images, before passing them on to a formidable six further upscaling layers, finally arriving at the 512px resolution. This process can be concatenated and extended to enable much higher resolutions.
The paper observes that the upsampler does not use the aforementioned Vision-Aided GAN, but applies a small amount of Gaussian noise between layers. Adding this noise 'generalizes' each upsampled image, reducing the perceptual gap between real and generated images.
Here the authors observe:
'Our GigaGAN framework becomes particularly effective for the superresolution task compared to the diffusion-based models, which cannot afford as many sampling steps as the base model at high resolution.'
The superresolution upscaling model in GigaGAN is trained on the same material as the base model, and is systematically compared with the LPIPS metric to the ground truth, during training. The authors describe the influence of LPIPS as a 'stable learning signal', and one of the main reasons (they believe) that GigaGAN's impressive upscaling architecture could function as a 'drop-in' module for other generative systems, such as DALL-E 2 (and, presumably, Stable Diffusion).
Using the StudioGAN PyTorch library, GigaGAN is trained using standard Fréchet Inception Distance (FID). For the text-to-image functionality, the system is trained on a mix of LAION2B-en and COYO-700M. The 128>1024 upsampler, however, is trained on Adobe internal stock images.
The image/text pairs are preprocessed based on CLIPScore and CLIP+MLP Aesthetic Score Predictor. Watermarked images were removed. For evaluation, the researchers used 40,504 and 30,000 real and generated images taken from the Microsoft's COCO2014 validation dataset, following the method used for Google's Imagen.
The testing round conducted for the paper (as well as the multitude of metrics and adjacent technologies incorporated into the project) is complex and multi-faceted, and we refer the reader to the original paper, and particularly the appendices therein, for full details.
However, let's take a look at how GigaGAN squared up to its most obvious rivals, namely: DALL-E; DALL-E 2; GLIDE; Stable Diffusion (signified as 'LDM', i.e., Latent Diffusion Model, in results); Imagen; eDiff-I; Parti (750M, 3B, and 20B variants, and one of only two autoregressive transformer architectures tested); and LAFITE (besides GigaGAN itself, the only GAN-based architecture tested).
The above architectures were tested at 256px resolution (GigaGAN's results having been expressly reduced from 512px for the purpose.
Additional tests were conducted on Stable Diffusion V1.5 (the prior-mentioned version being 1.4), and on Muse-3Bm, the other autoregressive transformer architecture included besides LAFITE.
Of these results, the authors state:
'[GigaGAN] exhibits a lower FID than DALL·E 2, Stable Diffusion, and Parti-750M. While our model can be optimized to better match the feature distribution of real images than existing models, the quality of the generated images is not necessarily [better].
'We acknowledge that this may represent a corner case of zero-shot FID on COCO2014 dataset and suggest that further research on a better evaluation metric is necessary to improve text-to-image models. Nonetheless, we emphasize that GigaGAN is the first GAN model capable of synthesizing promising images from arbitrary text prompts and exhibits competitive zero-shot FID with other text-to-image models.'
Additional tests (see paper for full details) found that GigaGAN outperformed also the various available varieties of Distilled Stable Diffusion, using the same FID metrics, and CLIP scores.
Though we refer you to the paper for the results of further tests, and though GigaGAN generally appears to lead the board across all results, it has to be noted that all these tests represent a compromise between quality and inference time, as the authors themselves have noted.
The authors conclude, with commendable circumspection:
'Our experiments provide a conclusive answer about the scalability of GANs: our new architecture can scale up to model sizes that enable text-to-image synthesis. However, the visual quality of our results is not yet comparable to production-grade models like DALL·E 2. Figure 9 [IMAGE BELOW] shows several instances where our method fails to produce high-quality results when compared to DALL·E 2, in terms of photorealism and text-to-image alignment for the same input prompts used in their paper.
'Nevertheless, we have tested capacities well beyond what is possible with a naive approach and achieved competitive visual quality with autoregressive and diffusion models trained with similar resources while being orders of magnitude faster and enabling latent interpolation and stylization.
'Our GigaGAN architecture opens up a whole new design space for large-scale generative models and brings back key editing capabilities that became challenging with the transition to autoregressive and diffusion models. We expect our performance to improve with larger [models].'
Though, as the authors concede, prior projects such as StyleGAN-T and GALIP have approached the same idea, GigaGAN seems to be the first truly fruitful text-to-image GAN-based architecture that can handle hyperscale training volumes while producing some marvelous output (among the various noted failure cases).
Additionally, the upscaling system, if it performs as impressively in the wild (and in other systems) as the authors claim, could offer a new level of quality for high resolution output.
Ultimately, nothing much more can be known until a public implementation (if it ever comes) reveals the extent of necessary local computing resources, and the various other factors that can make the difference between a quantum leap and a short hop.