Note: Following is a compact summary from my own notes I took while reading Diffusion Models: A Comprehensive Survey of Methods and Applications https://arxiv.org/abs/2209.00796
Goal of DGM - Deep Generative Modeling:
$p_{\phi^{*}}(x) \approx p_{data}(x)$
i.e. approximate a target data distribution and generate new data points using sampling methods like Monte Carlo Sampling.
Training of DGM:
$\phi^{*} \in \arg \min_{\Phi} D(p_{data}, p_{\Phi})$
A standard choice of $D$ is KL divergence:
\[D_{KL}(p_{data} || p_{\phi}) := \int p_{data}(x) \log \frac{p_{data}(x)}{p_{\phi}(x)} dx\]
Equivalence to minimization:
\[= -E_{x \sim p_{data}} [\log p_{\phi}(x)] + \mathcal{H}(p_{data})\]
Where $\mathcal{H}(p_{data})$ is the entropy of the data distribution.
Minimizing KL $\Leftrightarrow$ Maximizing MLE
\[\min_{\Phi} D_{KL}(p_{data} || p_{\Phi}) \Leftrightarrow \max_{\Phi} E_{x \sim p_{data}} [\log p_{\phi}(x)]\]
This population expectation can be replaced by its Monte Carlo Estimate from iid samples $\{x^{(i)}\}_{i=1}^{N} \sim p_{data}(x)$:
\[\mathcal{L}_{MLE}(\phi) := -\frac{1}{N} \sum_{i=1}^{N} \log p_{\phi}(x^{(i)})\]
Challenges in Modeling Distribution:
For $p_{\phi}$ to be a valid probability density function, it must satisfy:
1. Non-negativity: $p_{\phi}(x) \ge 0, \forall x$ in the domain
2. Normalization: Integral over the entire domain must equal one, i.e., $\int p(x) dx = 1$
Ensuring Non-Negativity:
This can be done by applying a positive function to the raw output of the neural network $E_{\phi}(x)$:
\[\tilde{p}_{\phi}(x) = \exp(E_{\phi}(x))\]
Enforcing Normalization:
To create a valid probability density, we must divide the function $\tilde{p}_{\phi}(x)$ by its integral over the entire space:
\[p_{\phi}(x) = \frac{\tilde{p}_{\phi}(x)}{\int \tilde{p}_{\phi}(x') dx'} = \frac{\exp(E_{\phi}(x))}{\int \exp(E_{\phi}(x')) dx'}\]
Where:
\[\int \exp(E_{\phi}(x')) dx' = Z(\phi)\]
$Z(\phi)$ is the normalizing constant. This integral is often intractable.
Families of Deep Generative Models
The intractability of the normalizing constant is a central problem that motivates the development of many different families of deep generative models designed to circumvent or reduce the computational cost of evaluating $Z(\phi)$.
Autoregressive Models
Variational Autoencoders - VAE
Energy-Based Models - EBM
Normalizing Flows:
Generative Adversarial Networks - GANs:
Summary of Generative Strategies:
All these classical families of Deep Generative Models illustrate complementary strategies for modeling complex distributions. Diffusion methods inherit ideas from several of these perspectives:
To VAEs: They connect through variational training objectives.
To EBMs: They connect through score matching that learns gradients of the log-density.
To Normalizing Flows: They connect through continuous-time transformations.
Conceptual Lineage:
VAEs $\rightarrow$ Denoising Diffusion Probabilistic Models - DDPMs
EBMs / Score Matching $\rightarrow$ Noise Conditional Score Networks - NCSN
Normalizing Flows $\rightarrow$ Flow Matching - continuous-time formulation Score SDE
VAEs to DDPMs
Defining the Training Objective
$q_{\theta}(z|x) \approx p_{\phi}(z|x)$ $= \mathcal{N}(\mu_{\theta}(x), \sigma_{\theta}^2(x)I)$
We are now in a position to define a computable training objective. While we cannot directly optimize $\log p_{\phi}(x)$, we can maximize a lower bound on it:
\[\log p_{\phi}(x) = \log \int p_{\phi}(x, z) dz\]
\[= \log \int q_{\theta}(z|x) \frac{p_{\phi}(x, z)}{q_{\theta}(z|x)} dz\]
\[= \log \left( \mathbb{E}_{z \sim q_{\theta}(z|x)} \left[ \frac{p_{\phi}(x, z)}{q_{\theta}(z|x)} \right] \right)\]
\[\ge \mathcal{L}_{ELBO}(\theta, \phi; x) \text{ via Jensen's inequality}\]
Where the ELBO is defined as:
\[\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\theta}(z|x)} [\log p_{\phi}(x|z)] - D_{KL}(q_{\theta}(z|x) || p(z))\]
Relationship to Likelihood:
\[\log p_{\phi}(x) - \mathcal{L}_{ELBO} = D_{KL}(q_{\theta}(z|x) || p_{\phi}(z|x)) \ge 0\]
Thus, maximizing the ELBO pushes up the likelihood and improves the posterior approximation at the same time.
Computability of ELBO
And $\mathcal{L}_{ELBO}$ is computable. Since the decoder is modeled as a Gaussian distribution w/ fixed variance:
\[p_{\phi}(x|z) := \mathcal{N}(x; \mu_{\phi}(z), \sigma^2 I)\]
The reconstruction term becomes:
\[\mathbb{E}_{q_{\theta}(z|x)} [\log p_{\phi}(x|z)] = -\frac{1}{2\sigma^2} \mathbb{E}_{q_{\theta}(z|x)} [||x - \mu_{\phi}(z)||^2] + C\]
Resulting in the objective:
\[\mathcal{L}_{ELBO} := \min_{\theta, \phi} \mathbb{E}_{q_{\theta}(z|x)} \left[ \frac{1}{2\sigma^2} ||x - \mu_{\phi}(z)||^2 \right] + D_{KL}(q_{\theta}(z|x) || p_{prior}(z))\]
The Reparameterization Trick
The only remaining issue is the fact that the sampling step $z \sim q_{\theta}(z|x)$ random sample does not allow backpropagation. Thus, we rewrite it as:
\[z = \mu_{\theta}(x) + \sigma_{\theta}(x) \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]
Generalizing the VAE framework, we ask: what if we stack many latent layers?
Hierarchical VAE Structure:
Encoder Path: $x \xrightarrow{\text{encode}} z_1 \xrightarrow{\text{encode}} z_2 \dots \xrightarrow{\text{encode}} z_L$
Decoder Path: $x \xleftarrow{\text{decode}} z_1 \xleftarrow{\text{decode}} z_2 \dots \xleftarrow{\text{decode}} z_L$
Denoising Diffusion Probabilistic Models - DDPMs
DDPMs solve the instability of hierarchical models by making the encoder fixed as a predetermined noise schedule.
Forward Process - Fixed Encoder:
\[x_0 \xrightarrow{+ \text{noise}_1} x_1 \xrightarrow{+ \text{noise}_2} \dots \xrightarrow{+ \text{noise}_L} x_L \sim \mathcal{N}(0, I)\]
Mathematical Formulation of DDPMs
In the DDPM framework, the encoder or forward process is fixed. We want to learn the reverse transition: given a noisy step at $i$, recover the slightly less noisy sample at step $i-1$.
The Forward Transition:
\[p(x_i | x_{i-1}) = \mathcal{N}(x_i; \sqrt{\alpha_i} x_{i-1}, \beta_i I) \text{ where } \alpha_i = \sqrt{1 - \beta_i} \mapsto \text{noising schedule}\]
The Reverse Transition via Bayes' Theorem:
\[p(x_{i-1} | x_i) = p(x_i | x_{i-1}) \cdot \frac{p(x_{i-1})}{p(x_i)}\]
Where $p(x_i) = \int p(x_i | x_0) p_{data}(x_0) dx_0$ is intractable.
Resolving Intractability with Noise Schedules:
A central insight in DDPMs resolves this intractability: we condition the reverse transition/generation on a clean data sample $x_0$.
This subtle yet powerful step transforms the intractable kernel into one that is mathematically tractable:
\[p(x_{i-1} | x_i, x_0) = p(x_i | x_{i-1}) \frac{p(x_{i-1} | x_0)}{p(x_i | x_0)}\]
Gaussian Nature of the Reverse Step
The product of Gaussians is Gaussian , so $p(x_{i-1}|x_i, x_0)$ has a closed form Gaussian expression with computable mean and variance :
\[p(x_{i-1}|x_i, x_0) = \mathcal{N}(x_{i-1}; \tilde{\mu}(x_i, x_0), \tilde{\beta}_i I)\]
Where the mean and variance $\tilde{\beta}_i$ are fully determined by the noising schedule .
Training vs. Inference
During Training: $x_0$ is available because it is a sample from the dataset. Therefore, $p(x_{i-1}|x_i, x_0)$ can be computed exactly and used as the training target for the learned reverse model.
During Inference: $x_0$ is not available; it is what we are trying to generate.
However, the model $p_{\phi}(x_{i-1}|x_i)$ has learned to approximate the marginal reverse transition, which can be viewed as the average of the conditional posterior over all plausible clean images:
\[p(x_{i-1}|x_i) = \mathbb{E}_{p(x_0|x_i)} [p(x_{i-1}|x_i, x_0)]\]
So, conditioning on $x_0$ is only needed at training, not for inference.
All in all, conditioning gives us a tractable objective and this forms the foundation of DDPMs.
Equivalence of KL Minimization
More specifically, equivalence between marginal and conditional KL minimization holds:
\[\mathbb{E}_{p_i(x_i)} [D_{KL}(p(x_{i-1}|x_i) || p_{\phi}(x_{i-1}|x_i))] = \mathbb{E}_{p(x_0, x_i)} [D_{KL}(p(x_{i-1}|x_i, x_0) || p_{\phi}(x_{i-1}|x_i))] + C\]
i.e., minimizing the KL divergence between marginal distributions is mathematically identical to minimizing the KL divergence between specific conditional distributions $p(x_{i-1}|x_i, x_0)$, which has a convenient closed-form expression :
\[p(x_{i-1}|x_i, x_0) = \mathcal{N}(x_{i-1}; \mu(x_i, x_0, i), \sigma^2(i) I)\]
Where the mean is defined as :
\[\mu(x_i, x_0, i) = \frac{\sqrt{\bar{\alpha}_{i-1}} \beta_i}{1 - \bar{\alpha}_i} x_0 + \frac{\sqrt{\alpha_i} (1 - \bar{\alpha}_{i-1})}{1 - \bar{\alpha}_i} x_i\]
Training Objective and Loss Simplification
So we only need to match the reverse process to the known conditional Gaussian to get our tractable surrogate (RHS):
\[\mathcal{L}_{ELBO} \approx \mathbb{E}_{p(x_0, x_i)} [D_{KL}(p(x_{i-1}|x_i, x_0) || p_{\phi}(x_{i-1}|x_i))] + C\]
Since both $p_{\phi}(x_{i-1}, x_i)$ and $p(x_{i-1}|x_i, x_0)$ are Gaussians with the same variance, minimizing their KL divergence reduces to simply matching their means.
Diffusion Loss Function:
\[\mathcal{L}_{diffusion}(x_0; \phi) = \sum_{i=1}^{T} \mathbb{E}_{p(x_i|x_0)} \left[ \frac{1}{2 \sigma^2(i)} ||\mu_{\phi}(x_i, i) - \tilde{\mu}(x_i, x_0, i)||^2 \right]\]
This is a sum of simple mean squares error losses, one per timestep.
$\epsilon$-Prediction Reparameterization
The mean matching loss can be simplified further. Recall the perturbation kernel:
\[x_i = \sqrt{\bar{\alpha}_i} x_0 + \sqrt{1 - \bar{\alpha}_i} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]
Rewriting this for $x_0$:
\[x_0 = \frac{x_i - \sqrt{1 - \bar{\alpha}_i} \cdot \epsilon}{\sqrt{\bar{\alpha}_i}}\]
Substituting this into the closed-form expression for the target mean, the target mean can be rewritten in terms of the noise that was added:
\[\tilde{\mu}(x_i, x_0, i) = \frac{1}{\sqrt{\alpha_i}} \left( x_i - \frac{1 - \alpha_i}{\sqrt{1 - \bar{\alpha}_i}} \epsilon \right)\]
This motivates parameterizing the model mean using a neural network $\epsilon_{\phi}(x_i, i)$ that directly predicts the noise:
\[\mu_{\phi}(x_i, i) = \frac{1}{\sqrt{\alpha_i}} \left( x_i - \frac{1 - \alpha_i}{\sqrt{1 - \bar{\alpha}_i}} \epsilon_{\phi}(x_i, i) \right)\]
This noise prediction by the neural net is just the noise prediction independent of the unknown $x_0$ at inference.
Equivalence of Noise Prediction
Matching the model mean to the target mean is then equivalent to matching predicted noise to true noise:
\[||\mu_{\phi}(x_i, i) - \tilde{\mu}(x_i, x_0, i)||^2 \propto ||\epsilon_{\phi}(x_i, i) - \epsilon||^2\]
Sampling/Generation
Generation starts from pure noise $x_T \sim \mathcal{N}(0, I)$ and iterates the learned reverse steps:
\[x_{i-1} = \frac{1}{\sqrt{\alpha_i}} \left( x_i - \frac{1 - \alpha_i}{\sqrt{1 - \bar{\alpha}_i}} \epsilon_{\phi}(x_i, i) \right) + \sigma_i z, \quad z \sim \mathcal{N}(0, I)\]
Stochasticity emerges as we repeat until $x_0$ becomes the generated sample.