What are diffusion models?
Diffusion models are a category of generative AI that excels at creating images, audio, video and other types of data by using a two-step process: forward diffusion and reverse diffusion. The first step involves distorting the data, such as an image, by gradually adding noise; this is followed by learning to reverse the process to generate new data.
Imagine the forward diffusion step as drops of ink spreading throughout water until the water is opaque. Diffusion models learn to create a new image by gradually reversing this diffusion process. Under the hood, it involves training machine learning algorithms to learn complex patterns in the data. Machine learning (ML) engineers and researchers then design systems that enable the trained model to generate new content by manipulating the levers and knobs, metaphorically speaking, of the initial random noise.
This class of generative models, in which many ML techniques are mixed and matched, was inspired by century-old physics concepts, specifically the theory of nonequilibrium thermodynamics, which describes how heat and particles move around over time. In 2015, a team of Berkeley and Stanford researchers described how nonequilibrium thermodynamics could be adapted to generate new content in a paper called "Deep Unsupervised Learning using Nonequilibrium Thermodynamics."
Diffusion models excel at generating graphics, 3D content and audio, emerging as a compelling alternative to transformers or generative adversarial networks (GANs). Their strength lies in creating more refined levers for controlling outputs more gradually. This enables diffusion models to capture intricate details and generate high-fidelity content. However, they also suffer from hallucinations, such as adding extra fingers to hands, extra eyes to faces or unexpected objects.
How do diffusion models work?
Diffusion models are trained to create new images by learning how to reverse the process of adding noise to data. The training process begins with forward diffusion: showing the model many examples of images at different noise levels until the image is all noise. This type of random noise is called Gaussian noise. The model then learns to predict and remove noise -- the reverse diffusion process -- in order to generate new high-quality images. Note that, unlike dimensionality reduction used in JPEG compression for the purposes of storing a smaller version of an image and retrieving a slightly degraded facsimile of the original, diffusion models learn and store the underlying structure of a variety of images to generate new images.
Algorithms and techniques used in forward diffusion and reverse diffusion models include the following:
- Denoising diffusion probabilistic models. DDPMs use a reverse diffusion algorithm where each step in the denoising depends only on the previous noisy image -- also called a Markov chain.
- Denoising diffusion implicit model. A faster reverse diffusion process than DDPM, DDIMs use fewer steps to generate a high-quality image but require more expertise and can cost more to train.
- Score-based generative models. SGMs use a forward diffusion process but learn to generate new images by estimating the score of the data distribution.
- Latent diffusion models. LDMs use an encoder to compress the image into a lower-dimensional latent representation, which enables it to use less compute power to generate an image.
- Guided diffusion models. GDMs use external information, such as a text description or another image, to enhance or control the reverse diffusion process.
Diffusion models work well when generating synthetic data similar to the original, such as slight variations in images of handwritten numbers or faces. However, it is also possible to map the knobs and levers to human concepts, such as styles and objects. This last piece requires a technical process called alignment. The model is trained to create the correlation between prewritten descriptions and images, or the diffusion model is combined with transformers that use vector embeddings to encode the semantic meaning of images and text.
What are the advantages of diffusion models?
The following are some of the advantages diffusion models offer compared to other generative AI techniques.
High quality. Diffusion models excel at generating high-quality images, video and audio with realistic details. The probabilistic algorithms capture intricate nuances, such as textures and styles, that support this process.
Training stability. The step-by-step training process in diffusion models lets developers refine the model gradually. This can mitigate issues, such as model collapse, that limit the model's ability to generate new variations.
Robustness. The diffusion model training process enables the models to learn the underlying data distribution rather than simply replicating the training samples. This lets them generalize to create new content different from what they were trained on.
Versatility. Diffusion models are good at modeling complex data distribution across multiple modalities. This enables them to support use cases such as inpainting, text-to-image synthesis and audio creation.
Handling of missing data. The reverse diffusion process can learn to fill in missing data during training. This is useful in cases such as filling in missing pixels when increasing the resolution of images or video. It can also fill in realistic details when editing objects out of an image.
What are the disadvantages of diffusion models?
The disadvantages of diffusion models include the following.
Computational cost. The iterative nature of the training and inferencing processes for generating content requires significant compute at each step. These costs can grow significantly for larger resolution training or generated content.
Slow speed. The sequential nature of the denoising process used to generate new content results in slower speeds than when using GANs or similar techniques. This introduces lag in interactive content generation and refinement processes. Newer algorithms can help speed up the process.
Generalization. Diffusion models trained on one type of content might not generalize well to other types not represented in the training data. For example, models trained on faces can struggle to fill in the details of landscapes.
Interpretability. The features learned during the diffusion model training process do not easily map to human concepts. This can complicate efforts to fine-tune the model for different use cases.
Hallucinations. Diffusion models can learn the textures and styles associated with different types of content but struggle to learn semantic concepts. For example, they sometimes generate faces with extra eyes or hands with extra fingers.
What are the use cases of diffusion models?
The following are some of the use cases for diffusion models.
Graphic design. Diffusion models can transform rough sketches and descriptions into more polished imagery. This enables users to experiment with different ideas and then iteratively describe adjustments, such as the style, objects, relationships between them or tone.
Captioning. A diffusion model can generate captions for existing images that describe objects in an image, tone or style. This could be used to label images for other machine learning processes automatically. These captions can also serve as the starting point for making new images and adjusting the process.
Generating details. Diffusion models can also fill in missing details for various processes, such as super-resolution, inpainting and outpainting, which consider textures, reflections and shadows. Super-resolution involves scaling the size of an image and then filling in the missing pixels. Inpainting consists of removing an object and then filling in the background. Outpainting, sometimes called uncropping, extends an image in different directions.
Films and animation. Newer diffusion models, such as OpenAI's Sora, can generate realistic videos and animations from prompts describing the scene, tone, objects, characters and their behaviors. These might be used independently or to explore variations that can be fleshed out in real life or with other editing tools.
Music generation. Diffusion models can learn the underlying patterns in music to generate new variations in a particular style or using different instruments. They can also extend existing music tracks in either direction based on a small sample.
Synthetic data generation. Nvidia Cosmos uses diffusion models to generate realistic video data for training robots and autonomous cars. This lets developers transform descriptions of rare events into realistic videos to improve responses when real-world training data is unavailable. These models can also simulate how video captured on a sunny day might look at night or in rainy or snowy conditions.
Drug discovery. MIT's DiffDock uses diffusion models to discover patterns related to how drug molecules bind to the proteins in our bodies. This enables researchers to explore variations designed to have a different effect on the molecular machinery in our bodies with fewer side effects.
Diffusion tools
Many diffusion tools are now available to support various processes and use cases. Prominent examples include the following.
Dall-E. OpenAI's Dall-E family is a portmanteau of surrealistic painter Salvador Dali and the robotic character Wall-E from the Pixar film of the same name. Dall-E combines variational autoencoders and transformers but not diffusion models. Dall-E 2, however, uses a diffusion model to improve realism and speed. It can generate, edit and vary existing content. Dall-E 3, integrated into ChatGPT, supports more complex prompts and can enhance the generation of in-image text, such as signs and labels on objects, but cannot edit or create variations. Dall-E 2 and Dall-E 3 are available as application programming interfaces (APIs) for integration into other apps.
Sora. OpenAI's flagship text-to-video model Sora is named after the Japanese word for sky. It was first previewed in February 2024 and later offered as part of the company's ChatGPT subscription services. It supports the ability to generate new videos from prompts, remix existing videos, combine videos, extend scenes forward or backward in time, and organize and edit sequences in a timeline.
Stable Diffusion. This is the flagship image-processing brand maintained by Stability AI. The first version is based on the latent diffusion project developed by German researchers in December 2021. Subsequent Stable Diffusion versions took advantage of innovations in transformers to improve results. The tool is provided as a service and an open source model that can be customized for various use cases. The smaller versions can run on consumer-grade graphics processing units. It can generate images from text prompts and supports inpainting, outpainting and variations of images.
Stable Audio. This music creation tool from Stability AI lets users create short clips of high-quality audio from prompts describing instruments, speed, tone and speed. An audio-to-audio version enables users to transfer styles and create variations. The tool can also transform vocal tracks into instrumental music. Stable Audio Open is an open source version that generates shorter audio snippets trained on royalty-free music to mitigate copyright infringement concerns.
Midjourney. This image generation tool and service, offered by the company of the same name, was first introduced in July 2022. It enables users to generate images with a prompt and make variations to the whole image or specific regions. It also supports an image weight feature for determining the influence of an existing image versus a text prompt on the generated image. Style and character reference features let users create templates from existing images that can be applied to guide the generation process.
Nai Diffusion. This is a family of creative tools from Neural Love for working with text, audio and video. It can generate new content, much like the other tools on this list; however, it also showcases the ability of diffusion models to make targeted improvements to existing content. For example, image editing features include the ability to uncrop, enhance, sharpen and restore images. Video editing features include enhancing quality, changing speed and colorizing. The audio tools can improve sound quality. A free version supports basic image and text generation features. Pro offerings are available via the web or API.
Imagen. This is a family of diffusion models developed by Google DeepMind for generating and editing images. It is available as part of the Gemini chatbot service, and ImageFX provides a graphical user interface to help guide the process. It excels at delivering larger-sized images and small details, as well as rendering stylized text in images.
OmniGen. This open source diffusion model was created by researchers at the Beijing Academy of AI in November 2024. The team explored the potential of creating a single model that could complete various tasks from end to end, similar to how ChatGPT handles language tasks. Traditional diffusion model tools require combining diffusion models with multiple ML tools for pre- and post-processing. OmniGen helps unify workflows for image generation, image editing, subject-driven generation and visual conditional generation with less need for intermediate steps. It is useful as a standalone tool and as a model for future AI-powered content generation and editing innovations.
Cosmos. This is Nvidia's flagship platform for building generative world foundation models for physical AI, autonomous vehicles and robots. The platform uses diffusion models and autoregressive models for text-to-world and video-to-world generation. It demonstrates the flexibility of diffusion models to support a variety of real-world use cases. For example, it can make sense of large volumes of video to discover security or safety events. It can also generate synthetic data to better train robots and autonomous cars.