In the preceding notes, we covered the extensive mathematical framework behind Denoising Diffusion Probabilistic Models (DDPM). We traced the Markov chains, derived the tractable posterior, and proved why minimizing the Variational Lower Bound simplifies to a noise-prediction objective.
Now, we pivot from theoretical derivations to the practical engineering paradigm. We explore the journey of diffusion models through three progressive layers: the development of a 2D Swiss Roll Toy Model, the scale-up to high-resolution latent text-to-image synthesizers like Stable Diffusion (SD-XL), and finally, their adaptation into a highly-specialized industrial inspection pipeline for localized defect generation and inpainting.
1. The Toy Model: Swiss Roll 2D Diffusion
Before tackling gigabyte-scale neural networks, we look at the mathematical engine in its simplest, most plottable form: a 2D point cloud. In this setting, the data space represents physical coordinates, allowing us to visually inspect how structure dissolves into noise and how a neural network learns to reconstruct it.
Rather than an arbitrary data pattern, we adopt the classic Swiss Roll datasetβa two-dimensional spiral distribution. This spiral provides a non-linear manifold that is challenging for standard density estimators but perfectly suited for diffusion.
Data Generation & Normalization
The dataset consists of points along a spiral with added Gaussian noise. Crucially, before training, the coordinates are zero-mean, unit-variance normalized. Because the forward process corrupts coordinates toward standard normal noise $\mathcal{N}(0, \mathbf{I})$, any unnormalized coordinates would require recalibrating schedule hyper-parameters.
def make_swiss_roll(n_samples=10000, noise=0.1):
# Generate the 2D spiral coordinates
theta = 1.5 * np.pi * (1 + 2 * np.random.rand(n_samples))
r = theta
x = r * np.cos(theta)
y = r * np.sin(theta)
data = np.stack([x, y], axis=1)
# Inject Gaussian jitter
data += noise * np.random.randn(*data.shape)
# Critical step: Standardize to zero-mean and unit-variance
mean = data.mean(axis=0)
std = data.std(axis=0)
data_normalized = (data - mean) / std
return data_normalized, mean, std
The Noise Schedule
The noise schedule is managed by a dedicated class. In our toy model, we employ a linear variance schedule over $T = 100$ timesteps, starting from a tiny variance $\beta_1 = 10^{-4}$ (retaining clean data) up to a moderate variance $\beta_T = 0.02$ (completely destroying structure).
| Schedule Variable | Mathematical Symbol | Physical / Engineering Interpretation |
self.betas |
$\beta_t$ | Noise variance injected at timestep $t$ |
self.alphas |
$\alpha_t = 1 - \beta_t$ | Signal retention coefficient at step $t$ |
self.alpha_bars |
$\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$ | Cumulative signal remaining from clean image $x_0$ |
self.sqrt_alpha_bars |
$\sqrt{\bar{\alpha}_t}$ | Mean scalar coefficient used in the direct forward jump |
self.sqrt_one_minus_alpha_bars |
$\sqrt{1 - \bar{\alpha}_t}$ | Variance scalar coefficient used in the direct forward jump |
self.posterior_variance |
$\tilde{\beta}_t$ | Theoretical variance of the tractable reverse posterior when $x_0$ is known |
The Closed-Form Forward Jump
By exploiting the fact that the linear combination of independent Gaussians is itself Gaussian, we bypass step-by-step Markov simulation. We can jump directly from $x_0$ to any arbitrary timestep $t$ in one step:
\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\]
This is implemented in q_sample:
def q_sample(self, x0, t, noise=None):
if noise is None:
noise = torch.randn_like(x0)
sqrt_ab = self.gather(self.sqrt_alpha_bars, t, x0.shape)
sqrt_1mab = self.gather(self.sqrt_one_minus_alpha_bars, t, x0.shape)
return sqrt_ab * x0 + sqrt_1mab * noise
def gather(self, values, t, x_shape):
# Extracts the schedule values for the batch of timesteps and shapes it for broadcasting
out = values.gather(0, t)
return out.view(-1, *([1] * (len(x_shape) - 1)))
Without this reparameterization trick, each training step would require simulating $t$ steps of Markovian noise addition. q_sample collapses this complexity from $\mathcal{O}(T)$ to $\mathcal{O}(1)$.
The Neural Network & Positional Time Embeddings
The network must predict the injected noise \(\boldsymbol{\epsilon}\) given the noisy coordinates \(x_t\) and the current timestep \(t\). A scalar integer \(t\) is not expressive enough for deep models; we project it into a high-dimensional vector using sinusoidal positional encodings:
\[\text{emb}(t)_{2i} = \sin\left(\frac{t}{10000^{2i/d}}\right), \quad \text{emb}(t)_{2i+1} = \cos\left(\frac{t}{10000^{2i/d}}\right)\]
class SinusoidalTimeEmbedding(nn.Module):
def __init__(self, dim):
super().__init__()
self.dim = dim
def forward(self, t):
device = t.device
half = self.dim // 2
factor = np.log(10000) / (half - 1)
freqs = torch.exp(torch.arange(half, device=device) * -factor)
args = t.float().unsqueeze(1) * freqs.unsqueeze(0)
emb = torch.cat([torch.sin(args), torch.cos(args)], dim=1)
return emb
The model itself, EpsilonMLP, projects the time embedding, concatenates it with the 2D spatial coordinate \(x_t\), and processes the combined vector through a 4-layer MLP using SiLU (Swish) activations. Normalization layers are omitted to avoid destabilizing the learning dynamics of this highly-constrained 2D system.
Training Loss & Reverse Sampling
The simplified training objective measures the Mean Squared Error (MSE) between the true noise injected in the forward jump and the network's prediction:
\[\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \boldsymbol{\epsilon}}\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\left(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon},\; t\right)\right\|^2\right]\]
def training_loss(self, x0):
batch_size = x0.shape[0]
t = torch.randint(0, self.schedule.T, (batch_size,), device=x0.device)
eps = torch.randn_like(x0)
xt = self.schedule.q_sample(x0, t, eps)
eps_pred = self.model(xt, t)
return F.mse_loss(eps_pred, eps)
At inference, we begin with a sample from standard normal noise \(x_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) and step backward from \(t = T-1 \rightarrow 0\). The reverse process uses the noise predicted by the model to approximate the posterior mean:
\[\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta(x_t, t)\right)\]
def p_sample(self, xt, t_scalar):
# Retrieve model prediction
eps_theta = self.model(xt, torch.tensor([t_scalar], device=xt.device))
# Retrieve schedule constants
alpha_t = self.schedule.alphas[t_scalar]
beta_t = self.schedule.betas[t_scalar]
alpha_bar_t = self.schedule.alpha_bars[t_scalar]
post_var_t = self.schedule.posterior_variance[t_scalar]
# Compute the posterior mean
mu_theta = (1.0 / torch.sqrt(alpha_t)) * (xt - (beta_t / torch.sqrt(1.0 - alpha_bar_t)) * eps_theta)
# Add Langevin-style noise if not at the final step
z = torch.randn_like(xt) if t_scalar > 0 else torch.zeros_like(xt)
x_prev = mu_theta + torch.sqrt(post_var_t) * z
return x_prev
def sample(self, n_samples=2000):
x = torch.randn(n_samples, 2, device=self.device)
for t in reversed(range(self.schedule.T)):
x = self.p_sample(x, t)
return x
Stabilization with Exponential Moving Average (EMA)
Because the reverse trajectory must be smooth across all timesteps, training is highly sensitive to parameter fluctuations. We maintain a running **EMA shadow copy** of the weights:
\[\theta_{\text{shadow}} \leftarrow \lambda \theta_{\text{shadow}} + (1 - \lambda) \theta_{\text{current}}\]
where \(\lambda = 0.995\). Swapping in these shadow weights for inference smooths out SGD noise and consistently yields higher-fidelity spiral curves.
The Three Core Equations of DDPM
All diffusion modelsβregardless of scale or complexityβrely on this trio of fundamental equations:
1. Forward Jump: \(x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon\) (Code:self.q_sample(x0, t, eps))
2. Training Loss: \(\mathcal{L} = \|\epsilon - \epsilon_\theta(x_t, t)\|^2\) (Code:F.mse_loss(eps_pred, eps))
3. Reverse Mean: \(\mu_\theta = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta\right)\) (Code:compute_mu_theta(xt, t_scalar))
2. Scaling Up: From Toy to Real-World Stable Diffusion
When we transition from a 2D Swiss Roll toy model to a massive commercial image synthesizer like Stable Diffusion XL (SD-XL), the core mathematical engine remains intact. What changes is the addition of engineering layers around it to handle high-dimensional image manifolds, textual conditioning, and inference latency.
The Blueprint: Identical, Scaled-Up, and Genuinely New
We categorize the changes according to the following framework:
π’ Essentially Identical Elements
q_sample equation is run. HuggingFace's DDPMScheduler.add_noise() is functionally identical to our toy implementation.
Code Comparison: The Core Denoising Loop & Forward Jump
Underneath the hood, the core mathematical loops inside a production Stable Diffusion pipeline are functionally identical to our 2D toy model, only scaled to high-dimensional image tensors:
# Closed-Form Forward Jump (q_sample)
def q_sample(self, x0, t, noise=None):
sqrt_ab = self.gather(self.sqrt_alpha_bars, t, x0.shape)
sqrt_1mab = self.gather(self.sqrt_one_minus_alpha_bars, t, x0.shape)
return sqrt_ab * x0 + sqrt_1mab * noise
# Sequential Reverse Sampling Loop (Inference)
x = torch.randn(n_samples, 2) # Start from pure Gaussian noise
for t in reversed(range(self.schedule.T)):
x, mu_theta, eps_theta = self.p_sample(x, t)
return x
HuggingFace Diffusers Equivalent (Stable Diffusion):
# Closed-Form Forward Jump (add_noise)
noisy_latents = scheduler.add_noise(latents, noise, timesteps)
# Internally functions as:
# sqrt_alpha_prod * sample + sqrt_one_minus_alpha_prod * noise
# Sequential Reverse Sampling Loop (Inference)
latents = torch.randn(B, 4, H // 8, W // 8) # Start from pure latent Gaussian noise
for t in scheduler.timesteps: # Reversed denoising timesteps
noise_pred = unet(latents, t, text_embeddings)
latents = scheduler.step(noise_pred, t, latents).prev_sample
image = vae.decode(latents).sample # Decode final latent back to pixel space
π‘ Scaled-Up Elements
EpsilonMLP is replaced by a massive 2.6-billion parameter 2D UNet. Time embeddings are still sinusoidal, but they are injected via Adaptive Group Normalization (AdaGN) inside ResNet blocks rather than simple concatenation.
π΄ Genuinely New Elements
Code Comparison: VAE Latent Encoding & Classifier-Free Guidance (CFG)
The genuinely new architectural components (the VAE latent space and the dual-conditioning inference loop) are implemented cleanly in production pipelines using separate forward passes for unconditional and conditional paths:
from diffusers import StableDiffusionPipeline
# Load the pretrained components
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base")
pipe.to("cuda")
# Generate an image using a prompt, negative prompt, and guidance scale
image = pipe(
prompt="A realistic macro photo of a colorful glass marble on sand",
negative_prompt="blurry, low quality, oversaturated",
guidance_scale=8.0,
num_inference_steps=35
).images[0]
Under the Hood: VAE Compression & Classifier-Free Guidance Denoising Step:
# 1. Encoding healthy images/tensors into the latent space (VAE)
latents = vae.encode(pixel_images).latent_dist.sample()
latents = latents * 0.18215 # Scaling factor to match unit variance
# 2. Classifier-Free Guidance (CFG) double forward pass at each timestep t
# Extract prompt embeddings (conditional) and negative prompt embeddings (unconditional)
cond_emb = text_encoder(prompt_ids)[0]
uncond_emb = text_encoder(negative_prompt_ids)[0]
# Run double forward pass to predict noise
noise_pred_uncond = unet(latents, t, uncond_emb).sample
noise_pred_cond = unet(latents, t, cond_emb).sample
# Extrapolate: push away from unconditional, toward conditional
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_cond - noise_pred_uncond)
# 3. Step backward in the trajectory using the guided noise prediction
latents = scheduler.step(noise_pred, t, latents).prev_sample
Summary Comparison: Toy vs. Stable Diffusion XL
| Dimension | Toy Model (Swiss Roll) | Stable Diffusion XL (SD-XL) | Status |
| Data Space | 2D Physical coordinates \((B, 2)\) | Latent space \((B, 4, H//8, W//8)\) | π΄ Genuinely New (VAE) |
| Noise Schedule | Linear schedule (\(T=100\)) | Cosine schedule (\(T=1000\)) | π‘ Scaled-Up |
| Conditioning | Timestep \(t\) only | Timestep \(t\) + CLIP text embeddings | π΄ Genuinely New |
| UNet Injection | MLP concatenation | AdaGN (time) + Cross-Attention (text) | π‘ Scaled-Up |
| Inference Sampler | Stochastic DDPM (100 steps) | Deterministic DPM-Solver++ (20β35 steps) | π΄ Genuinely New |
| CFG / Neg Prompt | None | Guidance Scale and Negative prompts | π΄ Genuinely New |
THE SD-XL INFERENCE LAYERS PIPELINE
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. Fast samplers (DDIM / DPM-Solver) β β Inference Efficiency (30 steps)
β 3. CFG + CLIP text conditioning β β Guidance & Prompt Adherence
β 2. UNet (spatial structures at scale) β β Feature extraction & AdaGN
β 1. VAE latent space β β Computational space compression
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β DDPM CORE ENGINE (Forward jump, loss, reverse math, EMA) β β Functions identical to Toy Model
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3. Localized Defect Generation and Inpainting
In creative content generation, any plausible, visually pleasing image is a success. However, in industrial visual inspection (e.g., detecting cracks in concrete, wafer defects, or metallic scratches), the requirements are radically different. We must generate highly realistic defects on a healthy surface, ensuring the background is completely unaltered.
This is the precise domain of localized defect generation pipelines. Built on top of latent diffusion models, they introduce structural changes to guarantee pixel-perfect background preservation and precise spatial control over the generated defect.
1. UNet Input: 4 to 9 Channels (Inpainting)
Standard SD-XL accepts a 4-channel noisy latent. The localized inpainting model extends the first convolutional layer of the UNet to accept **9 channels**, concatenating three distinct latent tensors:
\[x_t^{\text{inpaint}} = \text{concat}(x_t,\; M,\; b)\]
2. Text Conditioning: One CLIP + sks Token
Rather than writing complex prompts, the defect generation model uses a **special concept token** (the rare token sks) initialized neutrally to act as a blank canvas. By training LoRA parameters on the attention layers of the UNet and the text encoder, the model associates the sks token with the texture of the target defect class.
3. The Core Contribution: The Three-Loss Objective
To train a standard model to paint a defect inside a tiny masked region using as few as **7 pair-samples**, standard MSE is insufficient. The localized defect generation pipeline optimizes a weighted joint objective of three losses:
\[\mathcal{L}_{\text{ours}} = 0.5 \cdot \mathcal{L}_{\text{def}} + 0.2 \cdot \mathcal{L}_{\text{obj}} + 0.05 \cdot \mathcal{L}_{\text{attn}}\]
Loss 1: Defect Texture Loss (\(\mathcal{L}_{\text{def}}\))
Defines what the defect looks like. The MSE loss is calculated **strictly inside the masked defect region**, ignoring any background errors. This forces the sks token to encode the defect texture, rather than memorizing the healthy background.
\[\mathcal{L}_{\text{def}} = \mathbb{E}\left[\left\| M \odot \left(\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(x_t^{\text{def}}, t, c^{\text{def}})\right) \right\|_2^2\right]\]
Loss 2: Object Context Loss (\(\mathcal{L}_{\text{obj}}\))
Defines how the defect sits within the wider object. We construct a temporary random mask (e.g., 30 random boxes) to corrupt healthy regions. The loss is computed over the whole image, but weighted: the defect region has weight 1.0, and the healthy background has weight 0.3. This forces the model to learn contextβhow a defect naturally blends with wafer lines, metallic grains, or fabric fibers.
\[M' = M + 0.3 \cdot (1 - M)\]
\[\mathcal{L}_{\text{obj}} = \mathbb{E}\left[\left\| M' \odot \left(\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(x_t^{\text{obj}}, t, c^{\text{obj}})\right) \right\|_2^2\right]\]
Loss 3: Cross-Attention Alignment Loss (\(\mathcal{L}_{\text{attn}}\))
Defines where the defect is generated. By forcing the spatial cross-attention map of the sks token to exactly match the target mask \(M\), we prevent the defect's features from bleeding outside the mask.
\[\mathcal{L}_{\text{attn}} = \mathbb{E}\left[\left\| A_t^{[V^*]} - M \right\|_2^2\right]\]
4. UNet Attention Hooks
To compute \(\mathcal{L}_{\text{attn}}\), we inject a custom processor into the decoder (up_blocks) cross-attention (attn2) layers. This processor intercepts and aggregates attention maps for the sks token across all heads and layers:
class AttnMapCaptureProcessor:
def __init__(self):
self.attn_map = None
def __call__(self, attn, hidden_states, encoder_hidden_states=None, **kwargs):
# standard projection
query = attn.to_q(hidden_states)
key = attn.to_k(encoder_hidden_states)
value = attn.to_v(encoder_hidden_states)
# Calculate attention weights manually to capture map (cannot use SDPA fusion)
scale = 1.0 / np.sqrt(query.shape[-1])
attn_weights = (query @ key.transpose(-2, -1) * scale).softmax(dim=-1)
# Store spatial attention matrix for the sks token column for L_attn
self.attn_map = attn_weights.detach()
# Standard output projection
return attn.to_out[0](attn_weights @ value)
5. Step-Wise Latent Blending
Standard image editors blend background pixels only at the very end of inference. However, because the UNet's reverse denoising path relies on surrounding context, a pixel-only post-hoc blend causes visual artifacts.
The model solves this by performing **step-wise latent blending** at *every single DDIM step*:
\[z_{t-1} = M \cdot z_{t-1}^{\text{model}} + (1 - M) \cdot \text{add\_noise}(z_0^{\text{healthy}},\; \boldsymbol{\epsilon},\; t_{\text{next}})\]
This guarantees that the surrounding background is kept physically correct at all stages of generation, guiding the model to align textures perfectly at the boundaries.
6. Low-Fidelity Selection (LFS)
Diffusion models are highly stochastic; some random seeds may over-reconstruct the background and fail to show a clear defect.
To resolve this, the pipeline generates **8 candidate samples** in parallel and selects the best candidate using a **perceptual metric (LPIPS)** inside the masked region:
def low_fidelity_selection(pipeline, healthy_image, mask, prompt, n_candidates=8):
candidates = []
lpips_scores = []
for seed in range(n_candidates):
img_out = pipeline(healthy_image, mask, prompt, seed=seed)
candidates.append(img_out)
# Compute LPIPS distance strictly inside the mask
score = compute_masked_lpips(img_out, healthy_image, mask)
lpips_scores.append(score)
# Counter-intuitive: pick the HIGHEST score.
# High LPIPS = clearly visible, realistic defect. Low LPIPS = failed flat generation.
best_idx = np.argmax(lpips_scores)
return candidates[best_idx]
Comparison Table: SD-XL vs. Localized Defect Generation
| Dimension | HuggingFace SD-XL | Localized Defect Generation | Status |
| UNet Channels | 4 channels (Noisy latent) | 9 channels (Noisy latent, mask, background) | π‘ Adapted |
| Training Objective | Global noise prediction MSE | \(\mathcal{L}_{\text{def}}\) + \(\mathcal{L}_{\text{obj}}\) + \(\mathcal{L}_{\text{attn}}\) | π΄ Genuinely New |
| Data Scale | Millions of image-text pairs | ~7 defect-mask pairs (few-shot LoRA) | π‘ Adapted |
| Background Preservation | Pixel-level blend at the very end | Step-wise latent blending + final pixel blend | π΄ Genuinely New |
| Inference Selection | Single sample (stochastic) | LFS: 8 seeds, select maximum LPIPS change | π΄ Genuinely New |
THE COMPLETE LOCALIZED DEFECT GENERATION TRAINING & INFERENCE PIPELINE
TRAINING (LoRA Fine-tuning):
~7 image-mask pairs βββΊ Dual-Masking Loader βββΊ VAE Encode βββΊ Shared Noise & Timestep
β
βββ Pass 1: "A photo of sks" βββΊ concat(x_t, M, b_def) βββΊ UNet βββΊ L_def
βββ Pass 2: "A wafer with sks" βββΊ concat(x_t, M_rand, b_obj) βββΊ UNet + Hooks βββΊ L_obj & L_attn
β
Update LoRA weights βββ Loss = 0.5Β·L_def + 0.2Β·L_obj + 0.05Β·L_attn
INFERENCE (Defect Generation):
Healthy Image + Mask βββΊ VAE Encode βββΊ Step-Wise DDIM Loop (50 steps)
β
βββ concat(x_t, M, b) βββΊ UNet βββΊ CFG step
βββ Latent Blend: MΒ·x_model + (1-M)Β·noised_healthy
β
VAE Decode βββΊ Pixel-level background blend
β
LFS Selection (Pick max LPIPS out of 8 seeds) βββΊ Realistic Defect Image
References