State of the Art: Text to Image Art Generation

My art is constrained to processing my own photographs using AI. But what if I could dispense with my photographs? What happens when you use just AI?

There have been recent advances in AI text to image generation. Examples include Dall-E, Dall-E 2, Nvidia Gaugan and more recently Google’s Imagen.
The open source Dall-E has contributed to variants that use it and improve in various ways such as Dall-E Mini, Artbreeder, Midjourney, NightCafe and Pollinations.

In most cases generation is limited to one to a few users at a time because the processing is intensive. The latest, Dall-E 2 and Imagen, aren’t yet ‘open to all’ due to concerns regarding the ethical challenges of misuse, social biases and stereotyping. Some have restrictive T&Cs limiting the types of images, for example, no faces, you can publish publicly.

My explorations in this area revolve around trying to answer some questions. Is text to image any good? What are the limitations? Is it any good for creating art? Is it worth incorporating into any of my workflows?

I have found these generators are amazing at knowing textually described things. They are great at composing things together, even in ‘impossible’ scenarios they have certainly never seen (trained on) before. They are also good at filling in missing information.

With the exception of Imagen, that concentrates on creating photorealistic images, the resultant images aren’t usually very realistic. Yes, there are examples of great realism shared on places like Twitter but these have been cherry picked from what are often what I can only describe as mainly mutant versions of things, especially when it comes to animals and people.

However, when it comes to art, these generators tend to work very well when you add something like ‘in the style of ‘. The imperfections in the image almost become art in themselves rather than distractions.

Some Experiments

Here are some examples using Dall-E Mini. Dall-E Mini can be used online so you can also try it for yourself. It’s model is x27 smaller than the original DALL-E and was trained in a few days. It’s poor compared to the latest Dall-E 2 and Imagen but demonstrates enough to show the strength and weaknesses of text to image generation and demonstrate that art can be created.

First a British shorthair cat playing the guitar:

It’s obviously understood what’s needed but the images are poor. The results are smeary, jelly-like and of low resolution (256×256). Even the very latest Dall-E 2 and Imagen generators are upscaling to produce output that’s still very small compared to, for example, my artworks that are 10,800 pixels across. Notice the poor eyes that seem to be a problem with most generators.

Cubism is an area I have been exploring unsuccessfully with my photo-AI work so I thought I’d give this a go with text to image AI:

These aren’t that bad. Upscaling them to 1024×1024 also isn’t that bad because Cubism uses blocks of colour rather than detail. However, my chosen (4th) image is still poor quality, has aberrations and looks ‘dirty’:

Interestingly, I was able to process using part of my AI workflow to improve considerably:

Nevertheless, it’s still only 1024×1024 so isn’t suitable for printing nor display.

Here’s another Cubism image of a man next to a tree, upscaled and processed:

Here’s another I created of sunflowers in the style of Vincent Van Gogh, upscaled and processed:

While these examples look ok, I feel they might be missing something. Maybe they are artistically one dimensional? The sunflower image could have better texture, light, depth and more content. Dare I say they are a bit ‘sterile’ or ‘soulless’? But then some people might like this minimalism. Having the foliage (is it foliage?) off the top of the image is bad form and leads the eye out of the picture. I guess I should have edited it out.

It’s possible to add more text to better describe what’s wanted and include aspects such as light direction. However, you can only go so far because text to image generators aren’t very good with very long descriptions.

There’s the well known phrase “A picture is worth a thousand words” – but not in text to image AI. It’s not possible to use a ‘thousand words’ to describe an image. If much of the variety in output can’t be defined how can you fully describe what you want? In most cases, apart than with Pollinations and Dall-E 2, that take a image to modify, you are currently limited to the AI’s relatively narrow (input) view.

Is Text-based Generative Art Creative?

There are questions whether Text to Image AI can be art. There’s a recent research paper The Creativity of Text-based Generative Art that tries to answer the question whether text-based generative art is creative. It turns out the practice of prompt engineering is an art in itself. While you can obtain images of high aesthetic quality, this is difficult for novice practitioners:

“Writing effective prompts is a skill linked to a person’s knowledge of the training set and the neural networks’ latent space, but also the person’s knowledge of and experience with prompt modifiers. Together, this knowledge and the skills constitute the practice of prompt programming or prompt engineering”

The AI generated image is often just the start for most practitioners. As I have demonstrated above, additional post-processing, that’s also an art, is often needed to create an acceptable image.


The systems described in this article require massive processing and large memory and yet most only produce small images with poor image size and hence detail.

I am currently less attracted to text to image generation as there’s less of ‘me’ in it. I prefer using photograph-generated art that provides more compositional structure and in my opinion, more artistic merit. I also prefer art with higher resolutions and more detail so that it can be printed and displayed. I’ll keep experimenting with this type of generator to see if I can use it to improve my work.

The Future

Obviously hardware and software will advance quickly and some of my reservations will be overcome. However, I see more far-reaching changes that could make use of text to image generation. Metaverses might make art more dynamic. Imagine a gallery that creates art to your own taste. Imagine art that changes over time. Imagine art that changes with your mood. Perhaps we need to stop trying to mimic what’s come before and instead try to work in new dimensions.

Further reading:

Wired – DALL-E 2 Creates Incredible Images—and Biased Ones You Don’t See

The Verge – All These Images Were Generated by Google’s Latest Text-to-Image AI