Diffusion Models are generative models which have been gaining significant popularity in the past several years, and for good reason. A handful of seminal papers released in the 2020s alone have shown the world what Diffusion models are capable of, such as beating GANs on image synthesis.
Given the recent wave of success by Diffusion Models, many Machine Learning practitioners are surely interested in their inner workings. In this article, we will examine the theoretical foundations for Diffusion Models, and then demonstrate how to generate images with a Diffusion Model in PyTorch.
Diffusion Models - Introduction
Diffusion Models are $ \bf generative $ models, meaning that they are used to generate data similar to the data on which they are trained. Fundamentally, Diffusion Models work by destroying training data through the successive addition of Gaussian noise, and then learning to recover the data by reversing this noising process. After training, we can use the Diffusion Model to generate data by simply passing randomly sampled noise through the learned denoising process.
<div>
<img src="img/DM_1.png" width="800"/>
</div>
More specifically, a Diffusion Model is a latent variable model which maps to the latent space using a fixed Markov chain. This chain gradually adds noise to the data in order to obtain the approximate posterior $ q(x_{1}:T|x_{0}) $, where $ x_{1},..,x_{T} $ are latent variables with the same dimensionality as $ x_{0} $.In the figure below, we see such a Markov chain manifested for image data.
<div>
<img src="img/DM_2.png" width="800"/>
</div>
Ultimately, the image is asymptotically transformed to pure Gaussian noise. The goal of training a diffusion model is to learn the reverse process - i.e. training $ p_{\theta}(x_{t-1}|x_{t}) $. By traversing backwards along this chain, we can generate new data.
<div>
<img src="img/DM_3.png" width="800"/>
</div>
Benefits of Diffusion Models
As mentioned above, research into Diffusion Models has exploded in recent years. Inspired by non-equilibrium thermodynamics, Diffusion Models currently produce State-of-the-Art image quality, examples of which can be seen below:
<div>
<img src="img/DM_4.png" width="800"/>
</div>
Beyond cutting-edge image quality, Diffusion Models come with a host of other benefits, including not requiring adversarial training. The difficulties of adversarial training are well-documented; and, in cases where non-adversarial alternatives exist with comparable performance and training efficiency, it is usually best to utilize them. On the topic of training efficiency, Diffusion Models also have the added benefits of scalability and parallelizability.
While Diffusion Models almost seem to be producing results out of thin air, there are a lot of careful and interesting mathematical choices and details that provide the foundation for these results, and best practices are still evolving in the literature. Let's take a look at the mathematical theory underpinning Diffusion Models in more detail now.
Diffusion Models - More explained
As mentioned above, a Diffusion Model consists of a forward process (or diffusion process), in which a datum (generally an image) is progressively noised, and a reverse process (or reverse diffusion process), in which noise is transformed back into a sample from the target distribution.
The sampling chain transitions in the forward process can be set to conditional Gaussians when the noise level is sufficiently low. Combining this fact with the Markov assumption leads to a simple parameterization of the forward process:
<div>
<img src="img/1_v2.png" width="800"/>
</div>
Where $ \beta_{1},...,\beta_{T} $ is a variance schedule (either learned or fixed) which, if well-behaved, ensures that $ x_{T} $ is nearly an isotropic Gaussian for sufficiently large $ T $.
<div>
<img src="img/DM_5.png" width="800"/>
</div>
As mentioned previously, the "magic" of diffusion models comes in the reverse process. During training, the model learns to reverse this diffusion process in order to generate new data. Starting with the pure Gaussian noise $ p_{x_{T}}:\mathcal{N}(x_{T},0,I) $, the model learns the joint distribution $ p_{\theta}(x_{0}:T) $ as
<div>
<img src="img/2_v2.png" width="800"/>
</div>
where the time-dependent parameters of the Gaussian transitions are learned. Note in particular that the Markov formulation asserts that a given reverse diffusion transition distribution depends only on the previous timestep (or following timestep, depending on how you look at it):
<div>
<img src="img/3.png" width="800"/>
</div>
<div>
<img src="img/DM_6.png" width="800"/>
</div>
Training
A Diffusion Model is trained by finding the reverse Markov transitions that maximize the likelihood of the training data. In practice, training equivalently consists of minimizing the variational upper bound on the negative log likelihood.
<div>
<img src="img/4-1.png" width="800"/>
</div>
Choice of Model
With the mathematical foundation for our objective function established, we now need to make several choices regarding how our Diffusion Model will be implemented. For the forward process, the only choice required is defining the variance schedule, the values of which are generally increasing during the forward process.
For the reverse process, we much choose the Gaussian distribution parameterization / model architecture(s). Note the high degree of flexibility that Diffusion Models afford - the only requirement on our architecture is that its input and output have the same dimensionality.
Forward Process and $ L_{T}$
As noted above, regarding the forward process, we must define the variance schedule. In particular, we set them to be time-dependent constants, ignoring the fact that they can be learned. For example, a linear schedule from $ \beta_{1}=10^{-4} $ to $ \beta_{T}=0.2 $ might be used, or perhaps a geometric series.
Regardless of the particular values chosen, the fact that the variance schedule is fixed results in $ L_{T} $ becoming a constant with respect to our set of learnable parameters, allowing us to ignore it as far as training is concerned.
Reverse Process and $ L_{1:T-1} $
Now we discuss the choices required in defining the reverse process. Recall from above we defined the reverse Markov transitions as a Gaussian:
<div>
<img src="img/7.png" width="800"/>
</div>
We must now define the functional forms of $ \mu_{\theta} $ or $ \Sigma_{\theta} $. While there are more complicated ways to parameterize $ \Sigma_{\theta} $, we simply set
<div>
<img src="img/8.png" width="800"/>
</div>
That is, we assume that the multivariate Gaussian is a product of independent gaussians with identical variance, a variance value which can change with time. We set these variances to be equivalent to our forward process variance schedule.
Given this new formulation of $ \Sigma_{\theta} $,we have
<div>
<img src="img/9.png" width="800"/>
</div>
which allows us to transform
<div>
<img src="img/6.2.png" width="800"/>
</div>
to
<div>
<img src="img/10-1.png" width="800"/>
</div>
where the first term in the difference is a linear combination of $ x_{t} $ and $ x_{0} $ that depends on the variance schedule $ \beta_{t} $.The exact form of this function is not relevant for our purposes, but it can be found in
The significance of the above proportion is that the most straightforward parameterization of $ \mu_{\theta} $ simply predicts the diffusion posterior mean. Importantly, the authors of [3] actually found that training $ \mu_{\theta} $ to predict the noise component at any given timestep yields better results. In particular, let
<div>
<img src="img/image-16.png" width="800"/>
</div>
where
<div>
<img src="img/image-17.png" width="800"/>
</div>
This leads to the following alternative loss function,
<div>
<img src="img/image-18.png" width="800"/>
</div>
Network Architecture
While our simplified loss function seeks to train a model $ \epsilon_{\theta} $
, we have still not yet defined the architecture of this model. Note that the only requirement for the model is that its input and output dimensionality are identical.
Given this restriction, it is perhaps unsurprising that image Diffusion Models are commonly implemented with U-Net-like architectures.
<div>
<img src="img/image-19.png" width="800"/>
</div>
Reverse Process Decoder and $ L_{0} $
The path along the reverse process consists of many transformations under continuous conditional Gaussian distributions. At the end of the reverse process, recall that we are trying to produce an image, which is composed of integer pixel values. Therefore, we must devise a way to obtain discrete (log) likelihoods for each possible pixel value across all pixels.
The way that this is done is by setting the last transition in the reverse diffusion chain to an independent discrete decoder. To determine the likelihood of a given image $ x_{0} $ given $ x_{1} $, we first impose independence between the data dimensions:
<div>
<img src="img/12.png" width="800"/>
</div>
where D is the dimensionality of the data and the superscript i indicates the extraction of one coordinate. The goal now is to determine how likely each integer value is for a given pixel given the distribution across possible values for the corresponding pixel in the slightly noised image at time $ t=1 $
<div>
<img src="img/13-1.png" width="800"/>
</div>
where the pixel distributions for $ t=1 $ are derived from the below multivariate Gaussian whose diagonal covariance matrix allows us to split the distribution into a product of univariate Gaussians, one for each dimension of the data:
<div>
<img src="img/11-2.png" width="800"/>
</div>
We assume that the images consist of integers in 0,1...255 (as standard RGB images do) which have been scaled linearly to $ [-1,1] $.We then break down the real line into small "buckets", where, for a given scaled pixel value x, the bucket for that range is $ [x-\frac{1}{255},x+\frac{1}{255}] $. The probability of a pixel value x, given the univariate Gaussian distribution of the corresponding pixel in $ x_{1} $, is the area under that univariate Gaussian distribution within the bucket centered at x.
Final Objective
<div>
<img src="img/17-1.png" width="800"/>
</div>
The training and sampling algorithms for our Diffusion Model therefore can be succinctly captured in the below figure:
<div>
<img src="img/image-20.png" width="800"/>
</div>