Diffusion Background

MathJax example

Note: Following is a compact summary from my own notes I took while reading Diffusion Models: A Comprehensive Survey of Methods and Applications https://arxiv.org/abs/2209.00796

Goal of DGM - Deep Generative Modeling:
$p_{\phi^{*}}(x) \approx p_{data}(x)$
i.e. approximate a target data distribution and generate new data points using sampling methods like Monte Carlo Sampling.

Training of DGM:
$\phi^{*} \in \arg \min_{\Phi} D(p_{data}, p_{\Phi})$
A standard choice of $D$ is KL divergence:
\[D_{KL}(p_{data} || p_{\phi}) := \int p_{data}(x) \log \frac{p_{data}(x)}{p_{\phi}(x)} dx\]
Equivalence to minimization:
\[= -E_{x \sim p_{data}} [\log p_{\phi}(x)] + \mathcal{H}(p_{data})\]
Where $\mathcal{H}(p_{data})$ is the entropy of the data distribution. Minimizing KL $\Leftrightarrow$ Maximizing MLE
\[\min_{\Phi} D_{KL}(p_{data} || p_{\Phi}) \Leftrightarrow \max_{\Phi} E_{x \sim p_{data}} [\log p_{\phi}(x)]\]
This population expectation can be replaced by its Monte Carlo Estimate from iid samples $\{x^{(i)}\}_{i=1}^{N} \sim p_{data}(x)$:
\[\mathcal{L}_{MLE}(\phi) := -\frac{1}{N} \sum_{i=1}^{N} \log p_{\phi}(x^{(i)})\]

Challenges in Modeling Distribution:
For $p_{\phi}$ to be a valid probability density function, it must satisfy:
1. Non-negativity: $p_{\phi}(x) \ge 0, \forall x$ in the domain
2. Normalization: Integral over the entire domain must equal one, i.e., $\int p(x) dx = 1$

Ensuring Non-Negativity:
This can be done by applying a positive function to the raw output of the neural network $E_{\phi}(x)$:
\[\tilde{p}_{\phi}(x) = \exp(E_{\phi}(x))\]
Enforcing Normalization:
To create a valid probability density, we must divide the function $\tilde{p}_{\phi}(x)$ by its integral over the entire space:
\[p_{\phi}(x) = \frac{\tilde{p}_{\phi}(x)}{\int \tilde{p}_{\phi}(x') dx'} = \frac{\exp(E_{\phi}(x))}{\int \exp(E_{\phi}(x')) dx'}\]
Where:
\[\int \exp(E_{\phi}(x')) dx' = Z(\phi)\]
$Z(\phi)$ is the normalizing constant. This integral is often intractable.

Families of Deep Generative Models

The intractability of the normalizing constant is a central problem that motivates the development of many different families of deep generative models designed to circumvent or reduce the computational cost of evaluating $Z(\phi)$.

Autoregressive Models

Deep AR models factorize the joint data distribution $p_{data}$ into a product of conditional probabilities using the chain rule of probability:

\[p_{data}(x) = \prod_{i=1}^{D} p_{\phi}(x_i | x_{<i})\]

Each conditional is parameterized by a neural network e.g., Transformer and is normalized by design e.g., softmax. Thus, global normalization is trivial.

Training is done by minimizing negative log-likelihood.

While they achieve strong density estimation, their sequential nature restricts sampling speed, and fixed ordering is restrictive.

Variational Autoencoders - VAE

VAEs extend classical autoencoders by introducing latent variables $z$ that capture hidden structure in data via a probabilistic view.

They learn both an encoder, $q_{\theta}(z|x)$, which approximates the unknown distribution of latent variables given the data, and a decoder $p_{\phi}(x|z)$ which reconstructs data from these latent variables.

Training is done by maximizing a tractable surrogate to the true log-likelihood called ELBO - Evidence Lower Bound:

\[\mathcal{L}_{ELBO}(\theta, \phi; x) = E_{q_{\theta}(z|x)}[\log p_{\phi}(x|z)] - D_{KL}(q_{\theta}(z|x) || p_{prior}(z))\]

The $p_{prior}(z)$ is typically Gaussian.

Challenges like sample sharpness and the tendency of the encoder to ignore latent variables.

Energy-Based Models - EBM

Define a probability distribution through an energy function $E_{\phi}(x)$ that assigns lower energy to more probable data points.

\[p_{\phi}(x) := \frac{1}{Z(\phi)} \exp(-E_{\phi}(x)) \text{ where } Z(\phi) = \int \exp(-E_{\phi}(x)) dx\]

Training EBMs requires maximizing the log-likelihood of data but faces challenges from the partition function.

Diffusion models circumvent this by generating data from the gradient of the log density, which does not depend on the normalizing constant.

Normalizing Flows:

These aim to learn a sequence of bijective mappings between a simple latent distribution $z$ and a complex data distribution $x$.

Defined as an invertible operator, these models leverage the change of variables formula for densities, enabling MLE training:
\[\log p_{\phi}(x) = \log p(z) + \log \left| \det \frac{\partial f_{\phi}^{-1}(x)}{\partial x} \right|\]

The normalization constant is absorbed analytically via the change of variables formula, making likelihood computation exact and tractable.

However, they typically impose restrictive architectural constraints to ensure bijectivity.

Generative Adversarial Networks - GANs:

GANs do not define an explicit density function and therefore bypass likelihood estimation entirely.

They focus on generating samples that closely mimic the data distribution while modeling it implicitly.

Training Objective:
\[\min_{G} \max_{D} \left[ E_{x \sim p_{data}(x)} [\log D_{\eta}(x)] + E_{z \sim p_{prior}(z)} [\log(1 - D_{\eta}(G_{\phi}(z)))] \right]\]

Although capable of generating high-quality data, the min-max training process is notoriously unstable.

Summary of Generative Strategies:
All these classical families of Deep Generative Models illustrate complementary strategies for modeling complex distributions. Diffusion methods inherit ideas from several of these perspectives:
To VAEs: They connect through variational training objectives.
To EBMs: They connect through score matching that learns gradients of the log-density.
To Normalizing Flows: They connect through continuous-time transformations.

Conceptual Lineage:

VAEs $\rightarrow$ Denoising Diffusion Probabilistic Models - DDPMs
EBMs / Score Matching $\rightarrow$ Noise Conditional Score Networks - NCSN
Normalizing Flows $\rightarrow$ Flow Matching - continuous-time formulation Score SDE

VAEs to DDPMs

VAEs represent data w/ latent variables and are trained by maximizing a tractable lower bound on the log likelihood.

In this setting, a learned encoder maps observations to latents and a learned decoder maps latents to observations, closing the modeling loop.

Hierarchical variants stack several latent layers to capture structure at multiple scales.

DDPMs: instead of jointly training both encoder and decoder, the encoder is fixed. The encoder acts as a forward noising process that gradually maps data to noise, and training learns a decoder that reverses this path in successive denoising steps. VAEs, hierarchical VAEs, & DDPMs all optimize a likelihood surrogate defined by a variational bound.

Mathematical Foundation
In VAEs, we assume each observation $x$ is generated from a latent variable $z \sim p(z) = \mathcal{N}(0, I) \xrightarrow{\text{decoder}} p_{\phi}(x|z)$.
Ideally we'd train w/ maximizing log likelihood of data - maximizing the probability the model assigns to real data:
\[\log p_{\phi}(x) = \log \int p_{\phi}(x|z) p(z) dz\]
The integral marginalizes out $z$ - it asks "summing over every possible latent that could have produced $x$, how probable is it overall?". But this integral is intractable.

Approximating the Integral
So can we approximate this integral? A standard way of approximating integrals is importance sampling — instead of integrating over all $z$, sample from some distribution and average:
\[\log p_{\phi}(x) = \log \left( \mathbb{E}_{z \sim q(z)} \left[ \frac{p_{\phi}(x|z) p(z)}{q(z)} \right] \right)\]

Ideally $q(z)$ should concentrate on the $z$ values that are likely to have produced the particular $x$ in our data.

In other words — the posterior $p_{\phi}(z|x)$.

But, $p_{\phi}(z|x) = \frac{p_{\phi}(x|z) p(z)}{p_{\phi}(x)}$ where the denominator is intractable for the same reason the likelihood is intractable.

A way out of this issue is to approximate the true posterior w/ a parameterized neural network — the encoder $q_{\theta}(z|x)$ which is the variational step, and maximize the ELBO instead of likelihood (tractably).

Defining the Training Objective

$q_{\theta}(z|x) \approx p_{\phi}(z|x)$ $= \mathcal{N}(\mu_{\theta}(x), \sigma_{\theta}^2(x)I)$
We are now in a position to define a computable training objective. While we cannot directly optimize $\log p_{\phi}(x)$, we can maximize a lower bound on it:
\[\log p_{\phi}(x) = \log \int p_{\phi}(x, z) dz\]
\[= \log \int q_{\theta}(z|x) \frac{p_{\phi}(x, z)}{q_{\theta}(z|x)} dz\]
\[= \log \left( \mathbb{E}_{z \sim q_{\theta}(z|x)} \left[ \frac{p_{\phi}(x, z)}{q_{\theta}(z|x)} \right] \right)\]
\[\ge \mathcal{L}_{ELBO}(\theta, \phi; x) \text{ via Jensen's inequality}\]
Where the ELBO is defined as:
\[\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\theta}(z|x)} [\log p_{\phi}(x|z)] - D_{KL}(q_{\theta}(z|x) || p(z))\]
Relationship to Likelihood:
\[\log p_{\phi}(x) - \mathcal{L}_{ELBO} = D_{KL}(q_{\theta}(z|x) || p_{\phi}(z|x)) \ge 0\]
Thus, maximizing the ELBO pushes up the likelihood and improves the posterior approximation at the same time.

Computability of ELBO

And $\mathcal{L}_{ELBO}$ is computable. Since the decoder is modeled as a Gaussian distribution w/ fixed variance:
\[p_{\phi}(x|z) := \mathcal{N}(x; \mu_{\phi}(z), \sigma^2 I)\]
The reconstruction term becomes:
\[\mathbb{E}_{q_{\theta}(z|x)} [\log p_{\phi}(x|z)] = -\frac{1}{2\sigma^2} \mathbb{E}_{q_{\theta}(z|x)} [||x - \mu_{\phi}(z)||^2] + C\]
Resulting in the objective:
\[\mathcal{L}_{ELBO} := \min_{\theta, \phi} \mathbb{E}_{q_{\theta}(z|x)} \left[ \frac{1}{2\sigma^2} ||x - \mu_{\phi}(z)||^2 \right] + D_{KL}(q_{\theta}(z|x) || p_{prior}(z))\]

Note: The KL term has a closed form solution for Gaussians.

The Reparameterization Trick

The only remaining issue is the fact that the sampling step $z \sim q_{\theta}(z|x)$ random sample does not allow backpropagation. Thus, we rewrite it as:
\[z = \mu_{\theta}(x) + \sigma_{\theta}(x) \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]
Generalizing the VAE framework, we ask: what if we stack many latent layers?

Hierarchical VAE Structure:
Encoder Path: $x \xrightarrow{\text{encode}} z_1 \xrightarrow{\text{encode}} z_2 \dots \xrightarrow{\text{encode}} z_L$
Decoder Path: $x \xleftarrow{\text{decode}} z_1 \xleftarrow{\text{decode}} z_2 \dots \xleftarrow{\text{decode}} z_L$

In this setup, each layer captures structure at a different scale.

However, training becomes unstable with deeper nets in a flat VAE; one layer is often not enough.

Denoising Diffusion Probabilistic Models - DDPMs

DDPMs solve the instability of hierarchical models by making the encoder fixed as a predetermined noise schedule.
Forward Process - Fixed Encoder:
\[x_0 \xrightarrow{+ \text{noise}_1} x_1 \xrightarrow{+ \text{noise}_2} \dots \xrightarrow{+ \text{noise}_L} x_L \sim \mathcal{N}(0, I)\]

The model only learns the reverse step - a decoder that denoises step by step.

Because the encoder is fixed, there is no joint instability.

Since each denoising step is a small, manageable task, the model learns reliably.

Crucially, the training objective is still the same ELBO—it just decomposes into a sum of simple denoising problems at each noise level.

Mathematical Formulation of DDPMs

In the DDPM framework, the encoder or forward process is fixed. We want to learn the reverse transition: given a noisy step at $i$, recover the slightly less noisy sample at step $i-1$.

The Forward Transition:
\[p(x_i | x_{i-1}) = \mathcal{N}(x_i; \sqrt{\alpha_i} x_{i-1}, \beta_i I) \text{ where } \alpha_i = \sqrt{1 - \beta_i} \mapsto \text{noising schedule}\]
The Reverse Transition via Bayes' Theorem:
\[p(x_{i-1} | x_i) = p(x_i | x_{i-1}) \cdot \frac{p(x_{i-1})}{p(x_i)}\]
Where $p(x_i) = \int p(x_i | x_0) p_{data}(x_0) dx_0$ is intractable.

Resolving Intractability with Noise Schedules:
A central insight in DDPMs resolves this intractability: we condition the reverse transition/generation on a clean data sample $x_0$.
This subtle yet powerful step transforms the intractable kernel into one that is mathematically tractable:
\[p(x_{i-1} | x_i, x_0) = p(x_i | x_{i-1}) \frac{p(x_{i-1} | x_0)}{p(x_i | x_0)}\]

All three terms on the right are now tractable because they are known Gaussians from the fixed forward process.

There are no unknown integrals over $p_{data}$ anywhere.

Key Properties of the Forward Process - Markov Property:
The next state depends only on the current state.
\[p(x_i | x_{i-1}, x_0) = p(x_i | x_{i-1})\]

Gaussian Nature of the Reverse Step

The product of Gaussians is Gaussian , so $p(x_{i-1}|x_i, x_0)$ has a closed form Gaussian expression with computable mean and variance :
\[p(x_{i-1}|x_i, x_0) = \mathcal{N}(x_{i-1}; \tilde{\mu}(x_i, x_0), \tilde{\beta}_i I)\]
Where the mean and variance $\tilde{\beta}_i$ are fully determined by the noising schedule .

Training vs. Inference

During Training: $x_0$ is available because it is a sample from the dataset. Therefore, $p(x_{i-1}|x_i, x_0)$ can be computed exactly and used as the training target for the learned reverse model.
During Inference: $x_0$ is not available; it is what we are trying to generate.
However, the model $p_{\phi}(x_{i-1}|x_i)$ has learned to approximate the marginal reverse transition, which can be viewed as the average of the conditional posterior over all plausible clean images: \[p(x_{i-1}|x_i) = \mathbb{E}_{p(x_0|x_i)} [p(x_{i-1}|x_i, x_0)]\]
So, conditioning on $x_0$ is only needed at training, not for inference. All in all, conditioning gives us a tractable objective and this forms the foundation of DDPMs.

Equivalence of KL Minimization

More specifically, equivalence between marginal and conditional KL minimization holds:
\[\mathbb{E}_{p_i(x_i)} [D_{KL}(p(x_{i-1}|x_i) || p_{\phi}(x_{i-1}|x_i))] = \mathbb{E}_{p(x_0, x_i)} [D_{KL}(p(x_{i-1}|x_i, x_0) || p_{\phi}(x_{i-1}|x_i))] + C\]
i.e., minimizing the KL divergence between marginal distributions is mathematically identical to minimizing the KL divergence between specific conditional distributions $p(x_{i-1}|x_i, x_0)$, which has a convenient closed-form expression :
\[p(x_{i-1}|x_i, x_0) = \mathcal{N}(x_{i-1}; \mu(x_i, x_0, i), \sigma^2(i) I)\]
Where the mean is defined as :
\[\mu(x_i, x_0, i) = \frac{\sqrt{\bar{\alpha}_{i-1}} \beta_i}{1 - \bar{\alpha}_i} x_0 + \frac{\sqrt{\alpha_i} (1 - \bar{\alpha}_{i-1})}{1 - \bar{\alpha}_i} x_i\]

Training Objective and Loss Simplification

So we only need to match the reverse process to the known conditional Gaussian to get our tractable surrogate (RHS):
\[\mathcal{L}_{ELBO} \approx \mathbb{E}_{p(x_0, x_i)} [D_{KL}(p(x_{i-1}|x_i, x_0) || p_{\phi}(x_{i-1}|x_i))] + C\]
Since both $p_{\phi}(x_{i-1}, x_i)$ and $p(x_{i-1}|x_i, x_0)$ are Gaussians with the same variance, minimizing their KL divergence reduces to simply matching their means.

Diffusion Loss Function:
\[\mathcal{L}_{diffusion}(x_0; \phi) = \sum_{i=1}^{T} \mathbb{E}_{p(x_i|x_0)} \left[ \frac{1}{2 \sigma^2(i)} ||\mu_{\phi}(x_i, i) - \tilde{\mu}(x_i, x_0, i)||^2 \right]\]
This is a sum of simple mean squares error losses, one per timestep.

$\epsilon$-Prediction Reparameterization

The mean matching loss can be simplified further. Recall the perturbation kernel:
\[x_i = \sqrt{\bar{\alpha}_i} x_0 + \sqrt{1 - \bar{\alpha}_i} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]
Rewriting this for $x_0$:
\[x_0 = \frac{x_i - \sqrt{1 - \bar{\alpha}_i} \cdot \epsilon}{\sqrt{\bar{\alpha}_i}}\]
Substituting this into the closed-form expression for the target mean, the target mean can be rewritten in terms of the noise that was added:
\[\tilde{\mu}(x_i, x_0, i) = \frac{1}{\sqrt{\alpha_i}} \left( x_i - \frac{1 - \alpha_i}{\sqrt{1 - \bar{\alpha}_i}} \epsilon \right)\]
This motivates parameterizing the model mean using a neural network $\epsilon_{\phi}(x_i, i)$ that directly predicts the noise:
\[\mu_{\phi}(x_i, i) = \frac{1}{\sqrt{\alpha_i}} \left( x_i - \frac{1 - \alpha_i}{\sqrt{1 - \bar{\alpha}_i}} \epsilon_{\phi}(x_i, i) \right)\]
This noise prediction by the neural net is just the noise prediction independent of the unknown $x_0$ at inference.

Equivalence of Noise Prediction

Matching the model mean to the target mean is then equivalent to matching predicted noise to true noise:
\[||\mu_{\phi}(x_i, i) - \tilde{\mu}(x_i, x_0, i)||^2 \propto ||\epsilon_{\phi}(x_i, i) - \epsilon||^2\]

Matching Gaussian means is equivalent to predicting the noise.

The model does not need to predict the full distribution, just the noise that was added.

Sampling/Generation

Generation starts from pure noise $x_T \sim \mathcal{N}(0, I)$ and iterates the learned reverse steps:
\[x_{i-1} = \frac{1}{\sqrt{\alpha_i}} \left( x_i - \frac{1 - \alpha_i}{\sqrt{1 - \bar{\alpha}_i}} \epsilon_{\phi}(x_i, i) \right) + \sigma_i z, \quad z \sim \mathcal{N}(0, I)\]
Stochasticity emerges as we repeat until $x_0$ becomes the generated sample.