DDPM

MathJax example

A succinct summary of DDPM:

Process	Direction	State
Forward	Data $\rightarrow$ Fixed Encoding	Pure noise
Reverse	Pure noise $\rightarrow$ Decoding with Learned NN	Data

The forward process:
Defines a Markov chain that progressively adds Gaussian noise.
\[q(x_{t}|x_{t-1}) = \mathcal{N}(x_{t}; \sqrt{1-\beta_{t}}x_{t-1}, \beta_{t}I); q(x_{1:T}|x_{0}) = \prod_{t=1}^{T}q(x_{t}|x_{t-1})\]
Where,
$t$ runs from 1 to $T$ timesteps.
$\beta$ is the noise variance at step $t$ chosen from a schedule.
$x_{0}$ is original clean data point.
$x_{T}$ is pure Gaussian noise.
At each step, the signal is slightly attenuated by $\sqrt{1-\beta_{t}}$. A Gaussian noise $\sqrt{\beta_{t}}$ is injected.
Additionally, we define a reparameterization trick:
$\alpha_{t} = \mathbf{1} - \beta_{\mathbf{t}}$
$\bar{\alpha}_{t} = \prod_{j=1}^{t} \alpha_{j}$
Where $\bar{\alpha}_{t}$ tells us how much of the original signal survives after $t$ steps of noising. $\mu = \sqrt{1-\beta_{t}}x_{t-1}$ and $\sigma^{2} = \beta_{t}$.
Given a standard Gaussian $\sim \mathcal{N}(0, I)$ gives $\mathcal{N}(\mu, \sigma^{2})$.
\[x_{t} = \sqrt{1-\beta_{t}}x_{t-1} + \sqrt{\beta_{t}}\epsilon_{t-1}, \epsilon_{t-1} \sim \mathcal{N}(0, I)\]
A deterministic version which can be backpropagated through - compared to $q(x_{t}|x_{t-1})$ which is a stochastic distribution—we can't backpropagate through random sampling.
Given:
\[x_{t} = \sqrt{\alpha_{t}}x_{t-1} + \sqrt{1-\alpha_{t}}\epsilon_{t-1}\]
\[x_{t-1} = \sqrt{\alpha_{t-1}}x_{t-2} + \sqrt{1-\alpha_{t-1}}\epsilon_{t-2}\]
Substituting:
\[x_{t} = \sqrt{\alpha_{t}}(\sqrt{\alpha_{t-1}}x_{t-2} + \sqrt{1-\alpha_{t-1}}\epsilon_{t-2}) + \sqrt{1-\alpha_{t}}\epsilon_{t-1}\]
The last two terms in $x_t$ are gaussian and their sum is also Gaussian. $\mathcal{N}(0, \sigma_1^2 I) + \mathcal{N}(0, \sigma_2^2 I)$.
$\alpha_1 \epsilon_1 + \alpha_2 \epsilon_2 \sim \mathcal{N}(0, (\sigma_1^2 + \sigma_2^2)I)$.

Combined Variance Derivation
Using $\sigma_1^2 = \alpha_t(1 - \alpha_{t-1})$, combined variance:
\[\sigma_1^2 + \sigma_2^2 = \alpha_t(1 - \alpha_{t-1}) + (1 - \alpha_t)\]
\[= \alpha_t - \alpha_t \alpha_{t-1} + 1 - \alpha_t = 1 - \alpha_t \alpha_{t-1}\]
The merged noise is:
\[\sqrt{\alpha_t(1 - \alpha_{t-1})} \epsilon_{t-1} + \sqrt{1 - \alpha_t} \epsilon_{t-1} = \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\epsilon}_{t-1}\]
where $\bar{\epsilon}_{t-1} \sim \mathcal{N}(0, I)$.
\[x_t = \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\epsilon}_{t-1}\]

Closed Form Expression

Recall that $x_t$ in terms of $x_{t-1}$:
\[x_t = \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon\]
We just saw that in terms of $x_{t-2}$:
\[x_t = \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\epsilon}_{t-1}\]
We can continue unrolling; we will arrive at:
\[x_t = \sqrt{\alpha_t \alpha_{t-1} \dots \alpha_1} x_0 + \sqrt{1 - \alpha_t \alpha_{t-1} \dots \alpha_1} \epsilon\]
Since $\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$:
\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon\]
where $\epsilon \sim \mathcal{N}(0, I)$.
This closed form expression lets us jump to any timestep without simulating the intermediate time steps in (1).

def q_sample(self, x0, t, noise=None):
    sqrt_ab = self.gather(self.sqrt_alpha_bars, t, x0.shape)
    sqrt_1mab = self.gather(self.sqrt_one_minus_alpha_bars, t, x0.shape)
    return sqrt_ab * x0 + sqrt_1mab * noise

The Reverse Process

We can reverse the forward process and sample from $q(x_{t-1}|x_{t})$. By doing so, we will be able to recreate the true sample from Gaussian input noise $x_{T} \sim \mathcal{N}(0, I)$. However, the true reverse process $q(x_{t-1}|x_{t})$ is intractable — it requires knowing the full data distribution.

The Tractable Posterior

As discussed before, when conditioned on $x_0$, the posterior is tractable:
\[q(x_{t-1}|x_{t}, x_{0}) = \mathbf{\mathcal{N}}(x_{t-1}; \tilde{\mu}(x_{t}, x_{0}), \tilde{\beta}_{t}I)\]
Using Bayes' Rule:
\[q(x_{t-1}|x_{t}, x_{0}) = \frac{q(x_{t}|x_{t-1}, x_{0}) q(x_{t-1}|x_{0})}{q(x_{t}|x_{0})}\]
Expanding the Gaussian densities:
\[\mathbf{q} \propto \exp \left[ -\frac{1}{2} \left( \frac{x_{t}^2 - 2\sqrt{\alpha_{t}}x_{t}x_{t-1} + \alpha_{t}x_{t-1}^2}{\beta_{t}} + \frac{x_{t-1}^2 - 2\sqrt{\bar{\alpha}_{t-1}}x_{t-1}x_{0} + \bar{\alpha}_{t-1}x_{0}^2}{1 - \bar{\alpha}_{t-1}} - \frac{(x_{t} - \sqrt{\bar{\alpha}_{t}}x_{0})^2}{1 - \bar{\alpha}_{t}} \right) \right]\]
Grouped by $x_{t-1}$:
\[= \exp \left[ -\frac{1}{2} \left( \left\{ \frac{\alpha_{t}}{\beta_{t}} + \frac{1}{1 - \bar{\alpha}_{t-1}} \right\} x_{t-1}^2 - 2 \left( \frac{\sqrt{\alpha_{t}}}{\beta_{t}}x_{t} + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}}x_{0} \right) x_{t-1} + C(x_{t}, x_{0}) \right) \right]\]

Identifying Mean and Variance

Given the standard Gaussian definition $N(x) = \exp \left[ -\frac{1}{2\sigma^2}(x - \mu)^2 \right]$:

This matches the form $-\frac{A}{2}x^2 + Bx - C$.

Where $A = \frac{1}{\sigma^2} \implies \sigma^2 = \frac{1}{A}$.

The mean is $\mu = B \sigma^2 = \frac{B}{A}$.

Solving for variance (\tilde{\beta}_{t}):

\[\tilde{\beta}_{t} = 1 / \left( \frac{\alpha_{t}}{\beta_{t}} + \frac{1}{1 - \bar{\alpha}_{t-1}} \right)\]
\[= 1 / \left( \frac{\alpha_{t}(1 - \bar{\alpha}_{t-1}) + \beta_{t}}{\beta_{t}(1 - \bar{\alpha}_{t-1})} \right)\]
\[\tilde{\beta}_{t} = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_{t}} \beta_{t}\]

# From self.posterior_var implementation
self.posterior_var = self.betas * (1.0 - alpha_bars_prev) / (1.0 - alpha_bars)

Note: This is the theoretically correct variance for the reverse process when $x_0$ is known. In inference, the model approximates $x_0$ via noise prediction, serving as a fixed lower bound.

Derivation of the Posterior Mean (\tilde{\mu})

Similarly, since $\mu = \frac{B}{A}$:
\[\tilde{\mu}_{t}(x_{t}, x_{0}) := \frac{\left( \frac{\sqrt{\alpha_{t}}}{\beta_{t}}x_{t} + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}}x_{0} \right)}{\left( \frac{\alpha_{t}}{\beta_{t}} + \frac{1}{1 - \bar{\alpha}_{t-1}} \right)}\]
\[= \left( \frac{\sqrt{\alpha_{t}}}{\beta_{t}}x_{t} + \frac{\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}}x_{0} \right) \left( \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_{t}} \cdot \beta_{t} \right)\]
\[= \frac{\sqrt{\alpha_{t}}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_{t}}x_{t} + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_{t}}{1 - \bar{\alpha}_{t}}x_{0}\]

Parameterization via Noise Prediction

Recall the forward process $x_t$ in terms of $x_0$:
\[x_{t} = \sqrt{\bar{\alpha}_{t}}x_{0} + \sqrt{1 - \bar{\alpha}_{t}}\epsilon_{t}, \quad \epsilon \sim \mathcal{N}(0, I)\]
Solving for $x_{0}$:
\[x_{0} = \frac{1}{\sqrt{\bar{\alpha}_{t}}}(x_{t} - \sqrt{1 - \bar{\alpha}_{t}}\epsilon_{t})\]
Ho et al. showed it is better to parameterize the model to predict noise $\epsilon_{\theta}$ rather than $x_{0}$. Substituting the expression for $x_{0}$ into the posterior mean:
\[\tilde{\mu}_{t}(x_{t}, \mathbf{x_{0}}) = \frac{\sqrt{\alpha_{t}}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_{t}}x_{t} + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_{t}}{1 - \bar{\alpha}_{t}} \left( \frac{1}{\sqrt{\bar{\alpha}_{t}}}(x_{t} - \sqrt{1 - \bar{\alpha}_{t}}\epsilon_{\theta}) \right)\]
This simplifies to the parameterized mean:
\[\tilde{\mu}_{\theta}(x_{t}, t) = \frac{1}{\sqrt{\alpha_{t}}} \left( x_{t} - \frac{1 - \alpha_{t}}{\sqrt{1 - \bar{\alpha}_{t}}} \epsilon_{\theta}(x_{t}, t) \right)\]

Sampling

The model predicts the noise $\epsilon_{\theta}$ which is available at train time.

def compute_mu_theta(self, xt, t_scalar):
    # eps_theta is parameterized by model
    eps_theta = self.model(xt, t_scalar) 
    
    mu_theta = (1.0 / torch.sqrt(alpha_t)) * (xt - (beta_t / torch.sqrt(1.0 - alpha_bar_t)) * eps_theta) 
    return mu_theta, eps_theta

To get $x_{t-1}$ from $x_t$:
\[x_{t-1} \sim \mathcal{N} \left( x_{t-1}; \frac{1}{\sqrt{\alpha_{t}}} \left( x_{t} - \frac{1 - \alpha_{t}}{\sqrt{1 - \bar{\alpha}_{t}}} \epsilon_{\theta}(x_{t}, t) \right), \Sigma_{\theta}(x_{t}, t) \right)\]
We can sample from the above distribution:
\[x_{t-1} = \frac{1}{\sqrt{\alpha_{t}}} \left( x_{t} - \frac{1 - \alpha_{t}}{\sqrt{1 - \bar{\alpha}_{t}}} \epsilon_{\theta}(x_{t}, t) \right) + \sigma_{t}z\]
Where $z \sim \mathcal{N}(0, I)$.

def p_sample(self, xt, t_scalar):
    mu_theta, eps_theta = self.compute_mu_theta(xt, t_scalar) 
    
    # Sample noise z
    z = torch.randn_like(xt) if t_scalar > 0 else torch.zeros_like(xt) 
    
    x_prev = mu_theta + torch.sqrt(posterior_vars) * z 
    return x_prev, mu_theta, eps_theta 

def sample(self, n_samples=2000):
    x = torch.randn(n_samples, 2) 
    for t in reversed(range(T)): 
        x, mu_theta, eps_theta = self.p_sample(x, t) 
    return x

Loss Term

Recall our tractable surrogate in equation given conditioning, as equivalence in marginal and conditional KL minimization holds:
\[\mathbb{E}_{q(x_0)} \mathbb{E}_{q(x_{1:T}|x_0)} [D_{KL}(q(x_{t-1}|x_t, x_0) || p_{\theta}(x_{t-1}|x_t))] + C\]
Since both $p_{\theta}(x_{t-1}, x_t)$ and $q(x_{t-1}|x_t, x_0)$ are Gaussians with the same variance, minimizing their KL divergence reduces to simply matching their means:
\[L_t = \mathbb{E}_{x_0, \epsilon, t} \left[ \frac{1}{2 ||\Sigma_\theta||_t^2} ||\tilde{\mu}_t(x_t, x_0) - \mu_\theta(x_t, t)||^2 \right]\]
From an earlier derivation:
\[L_t = \mathbb{E}_{x_0, \epsilon, t} \left[ \frac{1}{2 ||\Sigma_\theta||_t^2} \left|\left| \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_t \right) - \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t, t) \right) \right|\right|^2 \right]\]
\[= \mathbb{E}_{x_0, \epsilon, t} \left[ \frac{(1-\alpha_t)^2}{2\alpha_t(1-\bar{\alpha}_t)||\Sigma_\theta||_t^2} ||\epsilon_t - \epsilon_\theta(x_t, t)||^2 \right]\]
Ho et al. found that training works better by ignoring the weighting term:
\[L_t^{simple} = \mathbb{E}_{t \sim [1, T], x_0, \epsilon_t} [||\epsilon_t - \epsilon_\theta(x_t, t)||^2]\]
Parameterized by a neural net:
\[L_t = \mathbb{E}_{t \sim [1, T], x_0, \epsilon_t} [||\epsilon_t - \epsilon_\theta(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon_t, t)||^2]\]

def training_loss(self, x0):
    # Sample random timestep
    t = torch.randint(0, self.schedule.T, (x0.shape[0], 1)) 
    
    # Sample noise
    eps = torch.randn_like(x0) 
    
    # Corrupt x0 to xt
    xt = self.schedule.q_sample(x0, t, eps) 
    
    # Predict the noise
    eps_pred = self.model(xt, t) 
    
    loss = F.mse_loss(eps_pred, eps) 
    return loss

Algorithms
Training
repeat
$x_{0} \sim q(x_{0})$
$t \sim \text{Uniform}(\{1, \dots, T\})$
$\epsilon \sim \mathcal{N}(0, I)$
Take gradient descent step on: \[\nabla_{\mathbf{\theta}} || \epsilon - \epsilon_{\mathbf{\theta}}(\sqrt{\bar{\alpha}_{t}}x_{0} + \sqrt{1 - \bar{\alpha}_{t}}\epsilon, t) ||^{2}\]
until converged

Sampling
$x_{T} \sim \mathcal{N}(0, I)$
for $t = T, \dots, 1$ do
$z \sim \mathcal{N}(0, I)$ if $t > 1$, else $z = 0$
\[x_{t-1} = \frac{1}{\sqrt{\alpha_{t}}} \left( x_{t} - \frac{1 - \alpha_{t}}{\sqrt{1 - \bar{\alpha}_{t}}} \epsilon_{\theta}(x_{t}, t) \right) + \mathbf{\sigma}_{t}z\]
end for
return $x_{0}$