Nagesh Singh Chauhan
- Apr 12, 2022
- 6 min read

OpenAI's DALL·E 2 lets you Generate Any ‘Photo’ By Just Describing It.

The article briefly explains OpenAi's GPT-3 based AI model trained to generate images from text descriptions.

Introduction

Elon musk backed OpenAI, the San Francisco artificial intelligence research group that is closely affiliated with Microsoft, just announced it has created an A.I. system that can take a description of an object or scene and automatically generate a highly realistic image picturing it. The system also permits a person to easily edit the image with simple tools and text modifications, rather than requiring traditional Photoshop or digital art skills.

“We hope tools like this democratize the ability for people to create whatever they want,” Alex Nichol, one of the OpenAI researchers who worked on the project, said. He said the tool could be helpful for product designers, magazine cover designers, and artists—either to use for inspiration and brainstorming or to create finished works. He also said computer game companies might want to use it to generate scenes and characters—although the software currently generates still images, not animation or videos.

Because the software could be also used to more easily generate racist memes or create fake images to be used in propaganda or disinformation, or, for that matter, to create pornography, OpenAI says it has taken steps to limit the software’s capabilities in this area, first by trying to remove such Images from the A.I.’s training data, but also by applying rule-based filters and human content reviews to the images the A.I. generates.

OpenAI is also endeavoring to carefully control the release of the new A.I., which it defines as currently just a research project and not a commercial product. It is sharing the software only with what it describes as a select and screened group of beta testers. But in the past, OpenAI’s breakthroughs based on natural-language processing have often found their way into commercial products within about 18 months.

The software OpenAI has created is called DALL-E 2, and it is an updated version of a system that OpenAI debuted in early 2021, simply called DALL-E. (The acronym is complicated, but it is meant to evoke a mashup of WALL-E, the animated robot of Pixar movie fame, and a play on words for Dali, as in Salvador, the surrealist artist, which makes sense given the surreal nature of the images the system can generate.)

DALL.E 2 System Components

Model

DALL·E 2 is an artificial intelligence model that takes a text prompt and/or existing image as an input and generates a new image as an output. DALL·E 2 was developed by researchers at OpenAI to understand the capabilities and broader implications of multimodal generative models. In order to help us and others better understand how image generation models can be used and misused, OpenAI is providing access to a subset of DALL·E 2's capabilities1 via the DALL·E 2 Preview.

DALL·E 2 builds on DALL·E 1 (Paper), increasing the level of resolution, fidelity, and overall photorealism it is capable of producing. DALL·E 2 is also trained to have new capabilities compared to DALL·E 1.

Model capabilities

In addition to generating images based on text description prompts ("Text to Image"), DALL·E 2 can modify existing images as prompted using a text description ("Inpainting"). It can also take an existing image as an input and be prompted to produce a creative variation on it ("Variations").

Model training data

DALL·E 2 was trained on pairs of images and their corresponding captions. Pairs were drawn from a combination of publicly available sources and sources that we licensed.

We have made an effort to filter the most explicit content from the training data for DALL·E 2.2 This filtered explicit content includes graphic sexual and violent content as well as images of some hate symbols.3 The filtering was informed by but distinct from earlier, more aggressive filtering (removing all images of people) that we performed when building GLIDE, a distinct model that we published several months ago. We performed more aggressive filtering in that context because a small version of the model was intended to be open-sourced. It is harder to prevent an open-source model from being used for harmful purposes than one that is only exposed through a controlled interface, not least due to the fact that a model, once open-sourced, can be modified and/or be combined with other third-party tools.4

We conducted an internal audit of our filtering of sexual content to see if it concentrated or exacerbated any particular biases in the training data. We found that our initial approach to filtering sexual content reduced the quantity of generated images of women in general, and we made adjustments to our filtering approach as a result.

What is DALL·E?

DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text-image pairs. We’ve found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.

GPT-3 showed that language can be used to teach a large neural network to perform a variety of text generation tasks. Image GPT showed that the same type of neural network can also be used to generate images with high fidelity. We extend these findings to show that manipulating visual concepts through language is now within reach.

What's new in DALL·E 2?

The research sounds very similar to NVIDIA’s GauGAN2, which is also able to take sentences and turn them into realistic photos. The previous version of DALL-E was only able to make cartoonish-looking images on a blain background that weren’t nearly as impressive as NVIDIA’s examples, but this new version is able to generate photo-quality images in high resolution with complex backgrounds, depth of field effects, shadows, shading, and reflections,

Textual description to generate this image: “a corgi on a beach.”

Images generated by DALL·E 2. Credits

One of the new DALL-E 2 features, inpainting, applies DALL-E’s text-to-image capabilities on a more granular level. Users can start with an existing picture, select an area, and tell the model to edit it. You can block out a painting on a living room wall and replace it with a different picture, for instance, or add a vase of flowers on a coffee table. The model can fill (or remove) objects while accounting for details like the directions of shadows in a room. Another feature, variations, is sort of like an image search tool for pictures that don’t exist. Users can upload a starting image and then create a range of variations similar to it. They can also blend two images, generating pictures that have elements of both. The generated images are 1,024 x 1,024 pixels, a leap over the 256 x 256 pixels the original model delivered.

Textual description to generate this image: "An astronaut riding a horse in space."

Textual description to generate this image: “an astronaut playing basketball with cats in space”

Textual description to generate this image: “a bowl of soup that looks like a monster knitted out of wool,”

DALL-E 2 builds on CLIP, a computer vision system that OpenAI also announced last year. “DALL-E 1 just took our GPT-3 approach from language and applied it to produce an image: we compressed images into a series of words and we just learned to predict what comes next,” says OpenAI research scientist Prafulla Dhariwal, referring to the GPT model used by many text AI apps. But the word-matching didn’t necessarily capture the qualities humans found most important, and the predictive process limited the realism of the images. CLIP was designed to look at images and summarize their contents the way a human would, and OpenAI iterated on this process to create “unCLIP” — an inverted version that starts with the description and works its way toward an image. DALL-E 2 generates the image using a process called diffusion, which Dhariwal describes as starting with a “bag of dots” and then filling in a pattern with greater and greater detail.

What the future holds?

Ilya Sutskever, OpenAI’s co-founder and chief scientist, said that DALL-E 2 was an important step toward OpenAI’s goal of trying to create artificial general intelligence (AGI), a single piece of A.I. software that can achieve human-level or better than human-level performance across a wide range of disparate tasks. AGI would need to possess “multimodal” conceptual understanding—being able to associate a word with an image or set of images and vice versa, Sutskever said. And DALL-E 2 is an attempt to create an A.I. with this sort of understanding, he said.

DALL-E 2 is far from perfect though. The system sometimes cannot render details in complex scenes. It can get some of the lighting and shadow effects slightly wrong or merge the borders of two objects that should be distinct. It is also less adept than some other multimodal A.I. software at understanding “binding attributes.” Give it the instruction, “a red cube on top of a blue cube,” and it will sometimes offer variations in which the red cube appears below a blue cube

Official page for DALLE-2: https://openai.com/dall-e-2/

Anyone interested in collaborating with DALL·E 2 can register for the waitlist here.

References

https://fortune.com/2022/04/06/openai-dall-e-2-photorealistic-images-from-text-descriptions/

https://openai.com/blog/dall-e/

https://openai.com/dall-e-2/