Diffusion Background

MathJax example


Note: Following is a compact summary from my own notes I took while reading Diffusion Models: A Comprehensive Survey of Methods and Applications https://arxiv.org/abs/2209.00796

Goal of DGM - Deep Generative Modeling:
$p_{\phi^{*}}(x) \approx p_{data}(x)$
i.e. approximate a target data distribution and generate new data points using sampling methods like Monte Carlo Sampling.




Training of DGM:
$\phi^{*} \in \arg \min_{\Phi} D(p_{data}, p_{\Phi})$
A standard choice of $D$ is KL divergence:
\[D_{KL}(p_{data} || p_{\phi}) := \int p_{data}(x) \log \frac{p_{data}(x)}{p_{\phi}(x)} dx\]
Equivalence to minimization:
\[= -E_{x \sim p_{data}} [\log p_{\phi}(x)] + \mathcal{H}(p_{data})\]
Where $\mathcal{H}(p_{data})$ is the entropy of the data distribution. Minimizing KL $\Leftrightarrow$ Maximizing MLE
\[\min_{\Phi} D_{KL}(p_{data} || p_{\Phi}) \Leftrightarrow \max_{\Phi} E_{x \sim p_{data}} [\log p_{\phi}(x)]\]
This population expectation can be replaced by its Monte Carlo Estimate from iid samples $\{x^{(i)}\}_{i=1}^{N} \sim p_{data}(x)$:
\[\mathcal{L}_{MLE}(\phi) := -\frac{1}{N} \sum_{i=1}^{N} \log p_{\phi}(x^{(i)})\]




Challenges in Modeling Distribution:
For $p_{\phi}$ to be a valid probability density function, it must satisfy:
1. Non-negativity: $p_{\phi}(x) \ge 0, \forall x$ in the domain
2. Normalization: Integral over the entire domain must equal one, i.e., $\int p(x) dx = 1$

Ensuring Non-Negativity:
This can be done by applying a positive function to the raw output of the neural network $E_{\phi}(x)$:
\[\tilde{p}_{\phi}(x) = \exp(E_{\phi}(x))\]
Enforcing Normalization:
To create a valid probability density, we must divide the function $\tilde{p}_{\phi}(x)$ by its integral over the entire space:
\[p_{\phi}(x) = \frac{\tilde{p}_{\phi}(x)}{\int \tilde{p}_{\phi}(x') dx'} = \frac{\exp(E_{\phi}(x))}{\int \exp(E_{\phi}(x')) dx'}\]
Where:
\[\int \exp(E_{\phi}(x')) dx' = Z(\phi)\]
$Z(\phi)$ is the normalizing constant. This integral is often intractable.




Families of Deep Generative Models

The intractability of the normalizing constant is a central problem that motivates the development of many different families of deep generative models designed to circumvent or reduce the computational cost of evaluating $Z(\phi)$.

Autoregressive Models

  • Deep AR models factorize the joint data distribution $p_{data}$ into a product of conditional probabilities using the chain rule of probability:
  • \[p_{data}(x) = \prod_{i=1}^{D} p_{\phi}(x_i | x_{<i})\]
  • Each conditional is parameterized by a neural network e.g., Transformer and is normalized by design e.g., softmax. Thus, global normalization is trivial.
  • Training is done by minimizing negative log-likelihood.
  • While they achieve strong density estimation, their sequential nature restricts sampling speed, and fixed ordering is restrictive.
  • Variational Autoencoders - VAE

  • VAEs extend classical autoencoders by introducing latent variables $z$ that capture hidden structure in data via a probabilistic view.
  • They learn both an encoder, $q_{\theta}(z|x)$, which approximates the unknown distribution of latent variables given the data, and a decoder $p_{\phi}(x|z)$ which reconstructs data from these latent variables.
  • Training is done by maximizing a tractable surrogate to the true log-likelihood called ELBO - Evidence Lower Bound:
  • \[\mathcal{L}_{ELBO}(\theta, \phi; x) = E_{q_{\theta}(z|x)}[\log p_{\phi}(x|z)] - D_{KL}(q_{\theta}(z|x) || p_{prior}(z))\]
  • The $p_{prior}(z)$ is typically Gaussian.
  • Challenges like sample sharpness and the tendency of the encoder to ignore latent variables.
  • Energy-Based Models - EBM

  • Define a probability distribution through an energy function $E_{\phi}(x)$ that assigns lower energy to more probable data points.
  • \[p_{\phi}(x) := \frac{1}{Z(\phi)} \exp(-E_{\phi}(x)) \text{ where } Z(\phi) = \int \exp(-E_{\phi}(x)) dx\]
  • Training EBMs requires maximizing the log-likelihood of data but faces challenges from the partition function.
  • Diffusion models circumvent this by generating data from the gradient of the log density, which does not depend on the normalizing constant.

    Normalizing Flows:

  • These aim to learn a sequence of bijective mappings between a simple latent distribution $z$ and a complex data distribution $x$.
  • Defined as an invertible operator, these models leverage the change of variables formula for densities, enabling MLE training:
    \[\log p_{\phi}(x) = \log p(z) + \log \left| \det \frac{\partial f_{\phi}^{-1}(x)}{\partial x} \right|\]
  • The normalization constant is absorbed analytically via the change of variables formula, making likelihood computation exact and tractable.
  • However, they typically impose restrictive architectural constraints to ensure bijectivity.



  • Generative Adversarial Networks - GANs:

  • GANs do not define an explicit density function and therefore bypass likelihood estimation entirely.
  • They focus on generating samples that closely mimic the data distribution while modeling it implicitly.
  • Training Objective:
    \[\min_{G} \max_{D} \left[ E_{x \sim p_{data}(x)} [\log D_{\eta}(x)] + E_{z \sim p_{prior}(z)} [\log(1 - D_{\eta}(G_{\phi}(z)))] \right]\]
  • Although capable of generating high-quality data, the min-max training process is notoriously unstable.



  • Summary of Generative Strategies:
    All these classical families of Deep Generative Models illustrate complementary strategies for modeling complex distributions. Diffusion methods inherit ideas from several of these perspectives:
    To VAEs: They connect through variational training objectives.
    To EBMs: They connect through score matching that learns gradients of the log-density.
    To Normalizing Flows: They connect through continuous-time transformations.

    Conceptual Lineage:

    VAEs $\rightarrow$ Denoising Diffusion Probabilistic Models - DDPMs
    EBMs / Score Matching $\rightarrow$ Noise Conditional Score Networks - NCSN
    Normalizing Flows $\rightarrow$ Flow Matching - continuous-time formulation Score SDE




    VAEs to DDPMs

  • VAEs represent data w/ latent variables and are trained by maximizing a tractable lower bound on the log likelihood.
  • In this setting, a learned encoder maps observations to latents and a learned decoder maps latents to observations, closing the modeling loop.
  • Hierarchical variants stack several latent layers to capture structure at multiple scales.
  • DDPMs: instead of jointly training both encoder and decoder, the encoder is fixed. The encoder acts as a forward noising process that gradually maps data to noise, and training learns a decoder that reverses this path in successive denoising steps. VAEs, hierarchical VAEs, & DDPMs all optimize a likelihood surrogate defined by a variational bound.

  • Mathematical Foundation
    In VAEs, we assume each observation $x$ is generated from a latent variable $z \sim p(z) = \mathcal{N}(0, I) \xrightarrow{\text{decoder}} p_{\phi}(x|z)$.
    Ideally we'd train w/ maximizing log likelihood of data - maximizing the probability the model assigns to real data:
    \[\log p_{\phi}(x) = \log \int p_{\phi}(x|z) p(z) dz\]
    The integral marginalizes out $z$ - it asks "summing over every possible latent that could have produced $x$, how probable is it overall?". But this integral is intractable.

    Approximating the Integral
    So can we approximate this integral? A standard way of approximating integrals is importance sampling — instead of integrating over all $z$, sample from some distribution and average:
    \[\log p_{\phi}(x) = \log \left( \mathbb{E}_{z \sim q(z)} \left[ \frac{p_{\phi}(x|z) p(z)}{q(z)} \right] \right)\]
  • Ideally $q(z)$ should concentrate on the $z$ values that are likely to have produced the particular $x$ in our data.
  • In other words — the posterior $p_{\phi}(z|x)$.
  • But, $p_{\phi}(z|x) = \frac{p_{\phi}(x|z) p(z)}{p_{\phi}(x)}$ where the denominator is intractable for the same reason the likelihood is intractable.
  • A way out of this issue is to approximate the true posterior w/ a parameterized neural network — the encoder $q_{\theta}(z|x)$ which is the variational step, and maximize the ELBO instead of likelihood (tractably).

    Defining the Training Objective

    $q_{\theta}(z|x) \approx p_{\phi}(z|x)$ $= \mathcal{N}(\mu_{\theta}(x), \sigma_{\theta}^2(x)I)$
    We are now in a position to define a computable training objective. While we cannot directly optimize $\log p_{\phi}(x)$, we can maximize a lower bound on it:
    \[\log p_{\phi}(x) = \log \int p_{\phi}(x, z) dz\]
    \[= \log \int q_{\theta}(z|x) \frac{p_{\phi}(x, z)}{q_{\theta}(z|x)} dz\]
    \[= \log \left( \mathbb{E}_{z \sim q_{\theta}(z|x)} \left[ \frac{p_{\phi}(x, z)}{q_{\theta}(z|x)} \right] \right)\]
    \[\ge \mathcal{L}_{ELBO}(\theta, \phi; x) \text{ via Jensen's inequality}\]
    Where the ELBO is defined as:
    \[\mathcal{L}_{ELBO} = \mathbb{E}_{z \sim q_{\theta}(z|x)} [\log p_{\phi}(x|z)] - D_{KL}(q_{\theta}(z|x) || p(z))\]
    Relationship to Likelihood:
    \[\log p_{\phi}(x) - \mathcal{L}_{ELBO} = D_{KL}(q_{\theta}(z|x) || p_{\phi}(z|x)) \ge 0\]
    Thus, maximizing the ELBO pushes up the likelihood and improves the posterior approximation at the same time.




    Computability of ELBO

    And $\mathcal{L}_{ELBO}$ is computable. Since the decoder is modeled as a Gaussian distribution w/ fixed variance:
    \[p_{\phi}(x|z) := \mathcal{N}(x; \mu_{\phi}(z), \sigma^2 I)\]
    The reconstruction term becomes:
    \[\mathbb{E}_{q_{\theta}(z|x)} [\log p_{\phi}(x|z)] = -\frac{1}{2\sigma^2} \mathbb{E}_{q_{\theta}(z|x)} [||x - \mu_{\phi}(z)||^2] + C\]
    Resulting in the objective:
    \[\mathcal{L}_{ELBO} := \min_{\theta, \phi} \mathbb{E}_{q_{\theta}(z|x)} \left[ \frac{1}{2\sigma^2} ||x - \mu_{\phi}(z)||^2 \right] + D_{KL}(q_{\theta}(z|x) || p_{prior}(z))\]

  • Note: The KL term has a closed form solution for Gaussians.



  • The Reparameterization Trick

    The only remaining issue is the fact that the sampling step $z \sim q_{\theta}(z|x)$ random sample does not allow backpropagation. Thus, we rewrite it as:
    \[z = \mu_{\theta}(x) + \sigma_{\theta}(x) \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]
    Generalizing the VAE framework, we ask: what if we stack many latent layers?

    Hierarchical VAE Structure:
    Encoder Path: $x \xrightarrow{\text{encode}} z_1 \xrightarrow{\text{encode}} z_2 \dots \xrightarrow{\text{encode}} z_L$
    Decoder Path: $x \xleftarrow{\text{decode}} z_1 \xleftarrow{\text{decode}} z_2 \dots \xleftarrow{\text{decode}} z_L$

  • In this setup, each layer captures structure at a different scale.
  • However, training becomes unstable with deeper nets in a flat VAE; one layer is often not enough.



  • Denoising Diffusion Probabilistic Models - DDPMs

    DDPMs solve the instability of hierarchical models by making the encoder fixed as a predetermined noise schedule.
    Forward Process - Fixed Encoder:
    \[x_0 \xrightarrow{+ \text{noise}_1} x_1 \xrightarrow{+ \text{noise}_2} \dots \xrightarrow{+ \text{noise}_L} x_L \sim \mathcal{N}(0, I)\]

  • The model only learns the reverse step - a decoder that denoises step by step.
  • Because the encoder is fixed, there is no joint instability.
  • Since each denoising step is a small, manageable task, the model learns reliably.
  • Crucially, the training objective is still the same ELBO—it just decomposes into a sum of simple denoising problems at each noise level.



  • Mathematical Formulation of DDPMs

    In the DDPM framework, the encoder or forward process is fixed. We want to learn the reverse transition: given a noisy step at $i$, recover the slightly less noisy sample at step $i-1$.

    The Forward Transition:
    \[p(x_i | x_{i-1}) = \mathcal{N}(x_i; \sqrt{\alpha_i} x_{i-1}, \beta_i I) \text{ where } \alpha_i = \sqrt{1 - \beta_i} \mapsto \text{noising schedule}\]
    The Reverse Transition via Bayes' Theorem:
    \[p(x_{i-1} | x_i) = p(x_i | x_{i-1}) \cdot \frac{p(x_{i-1})}{p(x_i)}\]
    Where $p(x_i) = \int p(x_i | x_0) p_{data}(x_0) dx_0$ is intractable.

    Resolving Intractability with Noise Schedules:
    A central insight in DDPMs resolves this intractability: we condition the reverse transition/generation on a clean data sample $x_0$.
    This subtle yet powerful step transforms the intractable kernel into one that is mathematically tractable:
    \[p(x_{i-1} | x_i, x_0) = p(x_i | x_{i-1}) \frac{p(x_{i-1} | x_0)}{p(x_i | x_0)}\]

  • All three terms on the right are now tractable because they are known Gaussians from the fixed forward process.
  • There are no unknown integrals over $p_{data}$ anywhere.

  • Key Properties of the Forward Process - Markov Property:
    The next state depends only on the current state.
    \[p(x_i | x_{i-1}, x_0) = p(x_i | x_{i-1})\]

    Gaussian Nature of the Reverse Step

    The product of Gaussians is Gaussian , so $p(x_{i-1}|x_i, x_0)$ has a closed form Gaussian expression with computable mean and variance :
    \[p(x_{i-1}|x_i, x_0) = \mathcal{N}(x_{i-1}; \tilde{\mu}(x_i, x_0), \tilde{\beta}_i I)\]
    Where the mean and variance $\tilde{\beta}_i$ are fully determined by the noising schedule .

    Training vs. Inference

    During Training: $x_0$ is available because it is a sample from the dataset. Therefore, $p(x_{i-1}|x_i, x_0)$ can be computed exactly and used as the training target for the learned reverse model.
    During Inference: $x_0$ is not available; it is what we are trying to generate.
    However, the model $p_{\phi}(x_{i-1}|x_i)$ has learned to approximate the marginal reverse transition, which can be viewed as the average of the conditional posterior over all plausible clean images: \[p(x_{i-1}|x_i) = \mathbb{E}_{p(x_0|x_i)} [p(x_{i-1}|x_i, x_0)]\]
    So, conditioning on $x_0$ is only needed at training, not for inference. All in all, conditioning gives us a tractable objective and this forms the foundation of DDPMs.




    Equivalence of KL Minimization

    More specifically, equivalence between marginal and conditional KL minimization holds:
    \[\mathbb{E}_{p_i(x_i)} [D_{KL}(p(x_{i-1}|x_i) || p_{\phi}(x_{i-1}|x_i))] = \mathbb{E}_{p(x_0, x_i)} [D_{KL}(p(x_{i-1}|x_i, x_0) || p_{\phi}(x_{i-1}|x_i))] + C\]
    i.e., minimizing the KL divergence between marginal distributions is mathematically identical to minimizing the KL divergence between specific conditional distributions $p(x_{i-1}|x_i, x_0)$, which has a convenient closed-form expression :
    \[p(x_{i-1}|x_i, x_0) = \mathcal{N}(x_{i-1}; \mu(x_i, x_0, i), \sigma^2(i) I)\]
    Where the mean is defined as :
    \[\mu(x_i, x_0, i) = \frac{\sqrt{\bar{\alpha}_{i-1}} \beta_i}{1 - \bar{\alpha}_i} x_0 + \frac{\sqrt{\alpha_i} (1 - \bar{\alpha}_{i-1})}{1 - \bar{\alpha}_i} x_i\]

    Training Objective and Loss Simplification

    So we only need to match the reverse process to the known conditional Gaussian to get our tractable surrogate (RHS):
    \[\mathcal{L}_{ELBO} \approx \mathbb{E}_{p(x_0, x_i)} [D_{KL}(p(x_{i-1}|x_i, x_0) || p_{\phi}(x_{i-1}|x_i))] + C\]
    Since both $p_{\phi}(x_{i-1}, x_i)$ and $p(x_{i-1}|x_i, x_0)$ are Gaussians with the same variance, minimizing their KL divergence reduces to simply matching their means.

    Diffusion Loss Function:
    \[\mathcal{L}_{diffusion}(x_0; \phi) = \sum_{i=1}^{T} \mathbb{E}_{p(x_i|x_0)} \left[ \frac{1}{2 \sigma^2(i)} ||\mu_{\phi}(x_i, i) - \tilde{\mu}(x_i, x_0, i)||^2 \right]\]
    This is a sum of simple mean squares error losses, one per timestep.




    $\epsilon$-Prediction Reparameterization

    The mean matching loss can be simplified further. Recall the perturbation kernel:
    \[x_i = \sqrt{\bar{\alpha}_i} x_0 + \sqrt{1 - \bar{\alpha}_i} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]
    Rewriting this for $x_0$:
    \[x_0 = \frac{x_i - \sqrt{1 - \bar{\alpha}_i} \cdot \epsilon}{\sqrt{\bar{\alpha}_i}}\]
    Substituting this into the closed-form expression for the target mean, the target mean can be rewritten in terms of the noise that was added:
    \[\tilde{\mu}(x_i, x_0, i) = \frac{1}{\sqrt{\alpha_i}} \left( x_i - \frac{1 - \alpha_i}{\sqrt{1 - \bar{\alpha}_i}} \epsilon \right)\]
    This motivates parameterizing the model mean using a neural network $\epsilon_{\phi}(x_i, i)$ that directly predicts the noise:
    \[\mu_{\phi}(x_i, i) = \frac{1}{\sqrt{\alpha_i}} \left( x_i - \frac{1 - \alpha_i}{\sqrt{1 - \bar{\alpha}_i}} \epsilon_{\phi}(x_i, i) \right)\]
    This noise prediction by the neural net is just the noise prediction independent of the unknown $x_0$ at inference.




    Equivalence of Noise Prediction

    Matching the model mean to the target mean is then equivalent to matching predicted noise to true noise:
    \[||\mu_{\phi}(x_i, i) - \tilde{\mu}(x_i, x_0, i)||^2 \propto ||\epsilon_{\phi}(x_i, i) - \epsilon||^2\]

  • Matching Gaussian means is equivalent to predicting the noise.
  • The model does not need to predict the full distribution, just the noise that was added.



  • Sampling/Generation

    Generation starts from pure noise $x_T \sim \mathcal{N}(0, I)$ and iterates the learned reverse steps:
    \[x_{i-1} = \frac{1}{\sqrt{\alpha_i}} \left( x_i - \frac{1 - \alpha_i}{\sqrt{1 - \bar{\alpha}_i}} \epsilon_{\phi}(x_i, i) \right) + \sigma_i z, \quad z \sim \mathcal{N}(0, I)\]
    Stochasticity emerges as we repeat until $x_0$ becomes the generated sample.