Cristian Najera
Mar 13, 2023
Intro You may have seen some peculiar images suddenly populating your social media feed, sometimes possessing strange qualities. The reason for this recent phenomenon? Roughly a year and a half ago, OpenAI developed an AI model called DALL-E which revolutionized the field of artificial image generation.
Brief History Image generation was born out of another task: assigning text phrases to images, also known as image captioning. Modern text to image generation techniques originated in 2015 when a research team discovered a breakthrough with DRAW, a simple image captioning model. Researchers found that not only could DRAW generate images, but it could do so much better than previous attempts at image generation. These earlier efforts had tried to create the whole image at once - but DRAW instead made the image piece by piece, allowing the model to dedicate extra “attention” to certain parts of the image that are more important. It was later found that the DRAW model could be trained using image captions, and synthesize images that match the captions.
After this milestone, significantly more research went into improving the quality of image generation. These advancements had sparked interest across the world, and state of the art machine learning techniques were introduced to help create a viable model. However, success was limited to specific examples or poorly rendered images with little resemblance to the given prompt. After careful analysis and troubleshooting, researchers found the solution that they had been searching for by taking inspiration from Natural Language Processing (NLP). NLP is one of the fastest growing areas of machine learning research now, in large part because of large pre-trained models like BERT or GPT. These models are trained with millions or even billions of parameters, sifting through hundreds of terabytes that allow the system to learn with unprecedented accuracy. This led to DALL-E, combining the pre-existing framework and concepts of image generation with those of these massive NLP models. DALL-E contains 12 billion parameters and was trained on 250 million image-text pairs, making it by far the largest image generation model to date. Practical Example At the moment, image generation is a relatively novel idea. However, it certainly has attracted a lot of attention, especially in the fields of graphic design, marketing, and content creation. Even in its early stages, there have already been several innovative use cases that utilize synthetic photos. Virtually any word you can think of can be used within the DALL-E system, and can yield results tailored to whatever you might desire. This is especially helpful for content marketing, allowing you to create images with a few keystrokes that can match your desired tone, subject, and art style - all of which are free of copyright, saving marketers thousands. It has also been used to generate several variations of animated figures for a Japanese graphic novel publication, giving artists inspiration for new characters to include in their publications. It is also a powerful tool for image editing, with the ability to fine tune lighting, prop imposition (ex: sunglasses, clothing), and clarity. Methodology DALL-E is extremely complex, but like any algorithm, it can be broken down into simple steps. Like any model, it’s important to understand what happens “underneath the hood” in order to appreciate why certain images are being generated with specific input. So, what does DALL-E do? It works by training on text-image pairs, where the text describes what is happening in the image. To first understand how DALL-E works, we should first define a “token” in the context of these machine learning algorithms. A token is a part of a word - for example, “understand” might have two tokens; ‘under’ and ‘stand’ which have meanings on their own but mean something entirely different when they are together. These tokens are how the model understands words. Similarly, there are image tokens, which break groups of pixels up into different meanings. One issue with working with images is that they are prohibitively memory intensive, so first the images are compressed into a 32 x 32 grid of image tokens using the latest image compression AI technology. This grid is then transformed into 1024 image tokens and combined with the text tokens, which are then trained in a way similar to how DRAW works, using what are called “attention layers” which learn to pay particular attention to different groups of tokens and to link groups of tokens together - this is how the AI learns that the word dog represents the four legged creature we know of as dogs. These attention layers are built together to create what are called “Transformers”, named after a seminal paper published in 2017 called “Attention Is All You Need”. This name refers to the idea that all you need to make a good model are attention layers, and no other algorithm is necessary. Transformers were originally made to work for NLP, but they have revolutionized machine learning approaches in recent years for all sorts of tasks, including image generation.
Conclusion DALL-E, and even more recently, DALL-E 2, have revolutionized text to image generation, and more broadly have changed how researchers are approaching machine learning problems. DALL-E has shown that more data, bigger models, and transformers, or attention layers, are the key ingredients to a successful machine learning model, even outside of natural language processing tasks. Image generation is just the beginning of a new era of AI, and we’re excited to see where the technology will take us next.
Let's have a conversation
Contact Us
Start a Project
Join Us
careers@cctech.io
Visit Us
777 Hornby Street, Suite 1500, Vancouver, BC, Canada, V6Z 1S4
© 2025 Convergence Concepts Inc.