In the wake of Open AI’s DALL-E 2 or, in another style, Microsoft’s XiaoIce, the text-image pairing is currently in the spotlight, served by quite surprising artificial intelligence (AI) algorithms. This is the case of Imagen, a new project from Google that creates images from descriptive texts…

Do you know Imagen?

It is a Google R&D project that allows, on the basis of a description taking into account a number of concepts, to create images representative of this source of information.

Here is what is explained on the official website: Imagen is ” a text-to-image model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen leverages the power of large transformative language models for text understanding and the strength of diffusion models for high-fidelity image generation. Our main finding is that large generic language models (e.g., T5), pre-trained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen improves both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new peak FID score of 7.27 on the COCO dataset, without ever training on COCO, and human evaluators find Imagen’s samples to be equivalent to the COCO data itself in image-text alignment. To evaluate text-to-image models more thoroughly, we introduce DrawBench, a comprehensive and challenging reference for text-to-image models. Using DrawBench, we compare Imagen to recent methods, including VQ-GAN+CLIP, latent diffusion models, and DALL-E 2, and find that human evaluators prefer Imagen to other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.

A system that is still very basic and not very usable

For now, the system is quite basic and only allows to create images that meet certain criteria chosen from a predetermined list. Here are a few examples in images, with the text that allowed to create them below:

A majestic oil painting of a raccoon Queen wearing red French royal gown. The painting is hanging on an ornate wall decorated with wallpaper. The painting is hanging on an ornate wall decorated with wallpaper.) Source : Imagen

A marble statue of a Koala DJ in front of a marble statue of a turntable. The Koala has wearing large marble headphones. The Koala has wearing large marble headphones.) Source : Imagen

A bucket bag made of blue suede. The bag is decorated with intricate golden paisley patterns. The handle of the bag is made of rubies and pearls. The bag is decorated with intricate golden paisley patterns. The handle of the bag is made of rubies and pearls.) Source: Imagen

A giant cobra snake on a farm. The snake is made out of corn. Source : Imagen

Do you see the concept? Well, obviously, these are examples of a demo that are a bit crazy on purpose, because you will probably rarely need this kind of images in real life… 🙂

What is more interesting is to imagine what can be done later, in terms of illustrations (especially in the field of animation and advertising, for example, but not only) when these algorithms have evolved and can be used in real life.

Maybe even SEO will be able to interfere and try to understand how certain images were created, in order to try to position itself with the same starting text. A kind of “reverse engeering” in metaverse mode? Who knows what the evolution of SEO will be in the years to come? Tools to follow in any case, for their promises as well as for the possible overflows they can generate…