Applied Diffusion: From Toy Swiss Roll to Localized Inpainting

MathJax example

In the preceding notes, we covered the extensive mathematical framework behind Denoising Diffusion Probabilistic Models (DDPM). We traced the Markov chains, derived the tractable posterior, and proved why minimizing the Variational Lower Bound simplifies to a noise-prediction objective.

Now, we pivot from theoretical derivations to the practical engineering paradigm. We explore the journey of diffusion models through three progressive layers: the development of a 2D Swiss Roll Toy Model, the scale-up to high-resolution latent text-to-image synthesizers like Stable Diffusion (SD-XL), and finally, their adaptation into a highly-specialized industrial inspection pipeline for localized defect generation and inpainting.




1. The Toy Model: Swiss Roll 2D Diffusion

Before tackling gigabyte-scale neural networks, we look at the mathematical engine in its simplest, most plottable form: a 2D point cloud. In this setting, the data space represents physical coordinates, allowing us to visually inspect how structure dissolves into noise and how a neural network learns to reconstruct it.

Rather than an arbitrary data pattern, we adopt the classic Swiss Roll datasetβ€”a two-dimensional spiral distribution. This spiral provides a non-linear manifold that is challenging for standard density estimators but perfectly suited for diffusion.

Data Generation & Normalization

The dataset consists of points along a spiral with added Gaussian noise. Crucially, before training, the coordinates are zero-mean, unit-variance normalized. Because the forward process corrupts coordinates toward standard normal noise $\mathcal{N}(0, \mathbf{I})$, any unnormalized coordinates would require recalibrating schedule hyper-parameters.

def make_swiss_roll(n_samples=10000, noise=0.1):
    # Generate the 2D spiral coordinates
    theta = 1.5 * np.pi * (1 + 2 * np.random.rand(n_samples))
    r = theta
    x = r * np.cos(theta)
    y = r * np.sin(theta)
    data = np.stack([x, y], axis=1)
    
    # Inject Gaussian jitter
    data += noise * np.random.randn(*data.shape)
    
    # Critical step: Standardize to zero-mean and unit-variance
    mean = data.mean(axis=0)
    std = data.std(axis=0)
    data_normalized = (data - mean) / std
    return data_normalized, mean, std



The Noise Schedule

The noise schedule is managed by a dedicated class. In our toy model, we employ a linear variance schedule over $T = 100$ timesteps, starting from a tiny variance $\beta_1 = 10^{-4}$ (retaining clean data) up to a moderate variance $\beta_T = 0.02$ (completely destroying structure).

Schedule Variable Mathematical Symbol Physical / Engineering Interpretation
self.betas $\beta_t$ Noise variance injected at timestep $t$
self.alphas $\alpha_t = 1 - \beta_t$ Signal retention coefficient at step $t$
self.alpha_bars $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$ Cumulative signal remaining from clean image $x_0$
self.sqrt_alpha_bars $\sqrt{\bar{\alpha}_t}$ Mean scalar coefficient used in the direct forward jump
self.sqrt_one_minus_alpha_bars $\sqrt{1 - \bar{\alpha}_t}$ Variance scalar coefficient used in the direct forward jump
self.posterior_variance $\tilde{\beta}_t$ Theoretical variance of the tractable reverse posterior when $x_0$ is known

The derived posterior variance $\tilde{\beta}_t$ is calculated as: \[\tilde{\beta}_t = \frac{\beta_t (1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t}\] During sampling, because $x_0$ is not known, the model's predicted noise is substituted into the posterior mean formula, and $\tilde{\beta}_t$ acts as a fixed lower bound for the stochastic sampling step.




The Closed-Form Forward Jump

By exploiting the fact that the linear combination of independent Gaussians is itself Gaussian, we bypass step-by-step Markov simulation. We can jump directly from $x_0$ to any arbitrary timestep $t$ in one step: \[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\] This is implemented in q_sample:

def q_sample(self, x0, t, noise=None):
    if noise is None:
        noise = torch.randn_like(x0)
    sqrt_ab   = self.gather(self.sqrt_alpha_bars, t, x0.shape)
    sqrt_1mab = self.gather(self.sqrt_one_minus_alpha_bars, t, x0.shape)
    return sqrt_ab * x0 + sqrt_1mab * noise

def gather(self, values, t, x_shape):
    # Extracts the schedule values for the batch of timesteps and shapes it for broadcasting
    out = values.gather(0, t)
    return out.view(-1, *([1] * (len(x_shape) - 1)))

Without this reparameterization trick, each training step would require simulating $t$ steps of Markovian noise addition. q_sample collapses this complexity from $\mathcal{O}(T)$ to $\mathcal{O}(1)$.




The Neural Network & Positional Time Embeddings

The network must predict the injected noise \(\boldsymbol{\epsilon}\) given the noisy coordinates \(x_t\) and the current timestep \(t\). A scalar integer \(t\) is not expressive enough for deep models; we project it into a high-dimensional vector using sinusoidal positional encodings: \[\text{emb}(t)_{2i} = \sin\left(\frac{t}{10000^{2i/d}}\right), \quad \text{emb}(t)_{2i+1} = \cos\left(\frac{t}{10000^{2i/d}}\right)\]

class SinusoidalTimeEmbedding(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, t):
        device = t.device
        half = self.dim // 2
        factor = np.log(10000) / (half - 1)
        freqs = torch.exp(torch.arange(half, device=device) * -factor)
        args  = t.float().unsqueeze(1) * freqs.unsqueeze(0)
        emb   = torch.cat([torch.sin(args), torch.cos(args)], dim=1)
        return emb

The model itself, EpsilonMLP, projects the time embedding, concatenates it with the 2D spatial coordinate \(x_t\), and processes the combined vector through a 4-layer MLP using SiLU (Swish) activations. Normalization layers are omitted to avoid destabilizing the learning dynamics of this highly-constrained 2D system.




Training Loss & Reverse Sampling

The simplified training objective measures the Mean Squared Error (MSE) between the true noise injected in the forward jump and the network's prediction: \[\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \boldsymbol{\epsilon}}\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\left(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon},\; t\right)\right\|^2\right]\]

def training_loss(self, x0):
    batch_size = x0.shape[0]
    t = torch.randint(0, self.schedule.T, (batch_size,), device=x0.device)
    eps = torch.randn_like(x0)
    xt = self.schedule.q_sample(x0, t, eps)
    eps_pred = self.model(xt, t)
    return F.mse_loss(eps_pred, eps)

At inference, we begin with a sample from standard normal noise \(x_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) and step backward from \(t = T-1 \rightarrow 0\). The reverse process uses the noise predicted by the model to approximate the posterior mean: \[\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta(x_t, t)\right)\]

def p_sample(self, xt, t_scalar):
    # Retrieve model prediction
    eps_theta = self.model(xt, torch.tensor([t_scalar], device=xt.device))
    
    # Retrieve schedule constants
    alpha_t = self.schedule.alphas[t_scalar]
    beta_t = self.schedule.betas[t_scalar]
    alpha_bar_t = self.schedule.alpha_bars[t_scalar]
    post_var_t = self.schedule.posterior_variance[t_scalar]
    
    # Compute the posterior mean
    mu_theta = (1.0 / torch.sqrt(alpha_t)) * (xt - (beta_t / torch.sqrt(1.0 - alpha_bar_t)) * eps_theta)
    
    # Add Langevin-style noise if not at the final step
    z = torch.randn_like(xt) if t_scalar > 0 else torch.zeros_like(xt)
    x_prev = mu_theta + torch.sqrt(post_var_t) * z
    return x_prev

def sample(self, n_samples=2000):
    x = torch.randn(n_samples, 2, device=self.device)
    for t in reversed(range(self.schedule.T)):
        x = self.p_sample(x, t)
    return x



Stabilization with Exponential Moving Average (EMA)

Because the reverse trajectory must be smooth across all timesteps, training is highly sensitive to parameter fluctuations. We maintain a running **EMA shadow copy** of the weights: \[\theta_{\text{shadow}} \leftarrow \lambda \theta_{\text{shadow}} + (1 - \lambda) \theta_{\text{current}}\] where \(\lambda = 0.995\). Swapping in these shadow weights for inference smooths out SGD noise and consistently yields higher-fidelity spiral curves.

The Three Core Equations of DDPM
All diffusion modelsβ€”regardless of scale or complexityβ€”rely on this trio of fundamental equations:

1. Forward Jump: \(x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon\) (Code: self.q_sample(x0, t, eps))
2. Training Loss: \(\mathcal{L} = \|\epsilon - \epsilon_\theta(x_t, t)\|^2\) (Code: F.mse_loss(eps_pred, eps))
3. Reverse Mean: \(\mu_\theta = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta\right)\) (Code: compute_mu_theta(xt, t_scalar))




2. Scaling Up: From Toy to Real-World Stable Diffusion

When we transition from a 2D Swiss Roll toy model to a massive commercial image synthesizer like Stable Diffusion XL (SD-XL), the core mathematical engine remains intact. What changes is the addition of engineering layers around it to handle high-dimensional image manifolds, textual conditioning, and inference latency.

The Blueprint: Identical, Scaled-Up, and Genuinely New

We categorize the changes according to the following framework:

  • 🟒 Essentially Identical: No math or logic changes; only tensor shapes grow.
  • 🟑 Scaled-Up: The underlying concept is identical, but scaled up or parameterized more robustly.
  • πŸ”΄ Genuinely New: Features introduced exclusively to make high-resolution text-to-image synthesis possible.


  • 🟒 Essentially Identical Elements

  • The Forward Process: The stochastic drift-diffusion equation is identical. \(x_0\) is simply formatted as a 4-channel latent tensor rather than a 2D coordinate.
  • The Closed-Form Jump: The exact same q_sample equation is run. HuggingFace's DDPMScheduler.add_noise() is functionally identical to our toy implementation.
  • Loss Objective: The model is trained to minimize the simple MSE between injected and predicted noise.
  • EMA: Swapping in shadow weights at inference is a standard production practice for all modern checkpoints to stabilize sampling.


  • Code Comparison: The Core Denoising Loop & Forward Jump

    Underneath the hood, the core mathematical loops inside a production Stable Diffusion pipeline are functionally identical to our 2D toy model, only scaled to high-dimensional image tensors:

    Toy Model Implementation (Swiss Roll 2D):
    # Closed-Form Forward Jump (q_sample)
    def q_sample(self, x0, t, noise=None):
        sqrt_ab   = self.gather(self.sqrt_alpha_bars, t, x0.shape)
        sqrt_1mab = self.gather(self.sqrt_one_minus_alpha_bars, t, x0.shape)
        return sqrt_ab * x0 + sqrt_1mab * noise
    
    # Sequential Reverse Sampling Loop (Inference)
    x = torch.randn(n_samples, 2)          # Start from pure Gaussian noise
    for t in reversed(range(self.schedule.T)):
        x, mu_theta, eps_theta = self.p_sample(x, t)
    return x
    HuggingFace Diffusers Equivalent (Stable Diffusion):
    # Closed-Form Forward Jump (add_noise)
    noisy_latents = scheduler.add_noise(latents, noise, timesteps)
    # Internally functions as:
    # sqrt_alpha_prod * sample + sqrt_one_minus_alpha_prod * noise
    
    # Sequential Reverse Sampling Loop (Inference)
    latents = torch.randn(B, 4, H // 8, W // 8)   # Start from pure latent Gaussian noise
    for t in scheduler.timesteps:                 # Reversed denoising timesteps
        noise_pred = unet(latents, t, text_embeddings)
        latents    = scheduler.step(noise_pred, t, latents).prev_sample
    image = vae.decode(latents).sample            # Decode final latent back to pixel space

    🟑 Scaled-Up Elements

  • Noise Schedule (Linear \(\rightarrow\) Cosine): Real-world models employ a cosine schedule (Nichol & Dhariwal, 2021) to prevent the abrupt signal-to-noise ratio drop-offs that occur at the extremes of a linear schedule.
  • Architecture (MLP \(\rightarrow\) UNet): The 4-layer EpsilonMLP is replaced by a massive 2.6-billion parameter 2D UNet. Time embeddings are still sinusoidal, but they are injected via Adaptive Group Normalization (AdaGN) inside ResNet blocks rather than simple concatenation.


  • πŸ”΄ Genuinely New Elements

  • The Latent Space (VAE): Diffusion in pixel space (\(3 \times 1024 \times 1024\)) is computationally prohibitive. Stable Diffusion solves this by training a Variational Autoencoder (VAE) to compress the image spatially by 8Γ— into a lower-dimensional latent space (\(4 \times 128 \times 128\)). The diffusion process is run *entirely* within this latent space.
  • CLIP Text Conditioning: To inject a text prompt, the text is tokenized and passed through a frozen CLIP text encoder. The resulting embeddings are integrated into the UNet bottleneck via cross-attention layers where the noisy image features act as queries, and the text features act as keys and values.
  • Classifier-Free Guidance (CFG): To control how strongly the model respects the prompt, the model is trained conditionally (80% of the time) and unconditionally (20% of the time). At inference, we extrapolate the conditional prediction away from the unconditional prediction: \[\boldsymbol{\epsilon}_{\text{guided}} = \boldsymbol{\epsilon}_{\text{uncond}} + s \cdot (\boldsymbol{\epsilon}_{\text{cond}} - \boldsymbol{\epsilon}_{\text{uncond}})\] where \(s\) is the guidance scale (typically 7–9).
  • Fast Samplers (DDIM / DPM-Solver): DDPM sampling requires hundreds of sequential steps. By rewriting the reverse process as a deterministic ODE (Song et al., 2020) and setting stochastic noise \(z = 0\), DDIM allows us to safely skip timesteps, reducing inference from 1000 steps down to 20–35 steps.


  • Code Comparison: VAE Latent Encoding & Classifier-Free Guidance (CFG)

    The genuinely new architectural components (the VAE latent space and the dual-conditioning inference loop) are implemented cleanly in production pipelines using separate forward passes for unconditional and conditional paths:

    Inference Pipeline Call (HuggingFace Diffusers):
    from diffusers import StableDiffusionPipeline
    
    # Load the pretrained components
    pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base")
    pipe.to("cuda")
    
    # Generate an image using a prompt, negative prompt, and guidance scale
    image = pipe(
        prompt="A realistic macro photo of a colorful glass marble on sand",
        negative_prompt="blurry, low quality, oversaturated",
        guidance_scale=8.0,
        num_inference_steps=35
    ).images[0]
    Under the Hood: VAE Compression & Classifier-Free Guidance Denoising Step:
    # 1. Encoding healthy images/tensors into the latent space (VAE)
    latents = vae.encode(pixel_images).latent_dist.sample()
    latents = latents * 0.18215  # Scaling factor to match unit variance
    
    # 2. Classifier-Free Guidance (CFG) double forward pass at each timestep t
    # Extract prompt embeddings (conditional) and negative prompt embeddings (unconditional)
    cond_emb   = text_encoder(prompt_ids)[0]
    uncond_emb = text_encoder(negative_prompt_ids)[0]
    
    # Run double forward pass to predict noise
    noise_pred_uncond = unet(latents, t, uncond_emb).sample
    noise_pred_cond   = unet(latents, t, cond_emb).sample
    
    # Extrapolate: push away from unconditional, toward conditional
    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_cond - noise_pred_uncond)
    
    # 3. Step backward in the trajectory using the guided noise prediction
    latents = scheduler.step(noise_pred, t, latents).prev_sample

    Summary Comparison: Toy vs. Stable Diffusion XL

    Dimension Toy Model (Swiss Roll) Stable Diffusion XL (SD-XL) Status
    Data Space 2D Physical coordinates \((B, 2)\) Latent space \((B, 4, H//8, W//8)\) πŸ”΄ Genuinely New (VAE)
    Noise Schedule Linear schedule (\(T=100\)) Cosine schedule (\(T=1000\)) 🟑 Scaled-Up
    Conditioning Timestep \(t\) only Timestep \(t\) + CLIP text embeddings πŸ”΄ Genuinely New
    UNet Injection MLP concatenation AdaGN (time) + Cross-Attention (text) 🟑 Scaled-Up
    Inference Sampler Stochastic DDPM (100 steps) Deterministic DPM-Solver++ (20–35 steps) πŸ”΄ Genuinely New
    CFG / Neg Prompt None Guidance Scale and Negative prompts πŸ”΄ Genuinely New

    THE SD-XL INFERENCE LAYERS PIPELINE
    
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚  4. Fast samplers (DDIM / DPM-Solver)                       β”‚  ← Inference Efficiency (30 steps)
      β”‚  3. CFG + CLIP text conditioning                            β”‚  ← Guidance & Prompt Adherence
      β”‚  2. UNet (spatial structures at scale)                      β”‚  ← Feature extraction & AdaGN
      β”‚  1. VAE latent space                                        β”‚  ← Computational space compression
      β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
      β”‚  DDPM CORE ENGINE (Forward jump, loss, reverse math, EMA)   β”‚  ← Functions identical to Toy Model
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜



    3. Localized Defect Generation and Inpainting

    In creative content generation, any plausible, visually pleasing image is a success. However, in industrial visual inspection (e.g., detecting cracks in concrete, wafer defects, or metallic scratches), the requirements are radically different. We must generate highly realistic defects on a healthy surface, ensuring the background is completely unaltered.

    This is the precise domain of localized defect generation pipelines. Built on top of latent diffusion models, they introduce structural changes to guarantee pixel-perfect background preservation and precise spatial control over the generated defect.

    1. UNet Input: 4 to 9 Channels (Inpainting)

    Standard SD-XL accepts a 4-channel noisy latent. The localized inpainting model extends the first convolutional layer of the UNet to accept **9 channels**, concatenating three distinct latent tensors: \[x_t^{\text{inpaint}} = \text{concat}(x_t,\; M,\; b)\]

  • \(x_t\) (4 channels): The current noisy latent in the diffusion trajectory.
  • Mask \(M\) (1 channel): The binary inpainting mask downsampled to latent resolution (\(1 \times 64 \times 64\)).
  • Background \(b\) (4 channels): The VAE-encoded healthy background image, with pixels outside the mask zeroed out.

  • 2. Text Conditioning: One CLIP + sks Token

    Rather than writing complex prompts, the defect generation model uses a **special concept token** (the rare token sks) initialized neutrally to act as a blank canvas. By training LoRA parameters on the attention layers of the UNet and the text encoder, the model associates the sks token with the texture of the target defect class.

    3. The Core Contribution: The Three-Loss Objective

    To train a standard model to paint a defect inside a tiny masked region using as few as **7 pair-samples**, standard MSE is insufficient. The localized defect generation pipeline optimizes a weighted joint objective of three losses: \[\mathcal{L}_{\text{ours}} = 0.5 \cdot \mathcal{L}_{\text{def}} + 0.2 \cdot \mathcal{L}_{\text{obj}} + 0.05 \cdot \mathcal{L}_{\text{attn}}\]

    Loss 1: Defect Texture Loss (\(\mathcal{L}_{\text{def}}\))
    Defines what the defect looks like. The MSE loss is calculated **strictly inside the masked defect region**, ignoring any background errors. This forces the sks token to encode the defect texture, rather than memorizing the healthy background. \[\mathcal{L}_{\text{def}} = \mathbb{E}\left[\left\| M \odot \left(\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(x_t^{\text{def}}, t, c^{\text{def}})\right) \right\|_2^2\right]\]

    Loss 2: Object Context Loss (\(\mathcal{L}_{\text{obj}}\))
    Defines how the defect sits within the wider object. We construct a temporary random mask (e.g., 30 random boxes) to corrupt healthy regions. The loss is computed over the whole image, but weighted: the defect region has weight 1.0, and the healthy background has weight 0.3. This forces the model to learn contextβ€”how a defect naturally blends with wafer lines, metallic grains, or fabric fibers. \[M' = M + 0.3 \cdot (1 - M)\] \[\mathcal{L}_{\text{obj}} = \mathbb{E}\left[\left\| M' \odot \left(\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(x_t^{\text{obj}}, t, c^{\text{obj}})\right) \right\|_2^2\right]\]

    Loss 3: Cross-Attention Alignment Loss (\(\mathcal{L}_{\text{attn}}\))
    Defines where the defect is generated. By forcing the spatial cross-attention map of the sks token to exactly match the target mask \(M\), we prevent the defect's features from bleeding outside the mask. \[\mathcal{L}_{\text{attn}} = \mathbb{E}\left[\left\| A_t^{[V^*]} - M \right\|_2^2\right]\]

    4. UNet Attention Hooks

    To compute \(\mathcal{L}_{\text{attn}}\), we inject a custom processor into the decoder (up_blocks) cross-attention (attn2) layers. This processor intercepts and aggregates attention maps for the sks token across all heads and layers:

    class AttnMapCaptureProcessor:
        def __init__(self):
            self.attn_map = None
    
        def __call__(self, attn, hidden_states, encoder_hidden_states=None, **kwargs):
            # standard projection
            query = attn.to_q(hidden_states)
            key   = attn.to_k(encoder_hidden_states)
            value = attn.to_v(encoder_hidden_states)
            
            # Calculate attention weights manually to capture map (cannot use SDPA fusion)
            scale = 1.0 / np.sqrt(query.shape[-1])
            attn_weights = (query @ key.transpose(-2, -1) * scale).softmax(dim=-1)
            
            # Store spatial attention matrix for the sks token column for L_attn
            self.attn_map = attn_weights.detach()
            
            # Standard output projection
            return attn.to_out[0](attn_weights @ value)



    5. Step-Wise Latent Blending

    Standard image editors blend background pixels only at the very end of inference. However, because the UNet's reverse denoising path relies on surrounding context, a pixel-only post-hoc blend causes visual artifacts.

    The model solves this by performing **step-wise latent blending** at *every single DDIM step*: \[z_{t-1} = M \cdot z_{t-1}^{\text{model}} + (1 - M) \cdot \text{add\_noise}(z_0^{\text{healthy}},\; \boldsymbol{\epsilon},\; t_{\text{next}})\] This guarantees that the surrounding background is kept physically correct at all stages of generation, guiding the model to align textures perfectly at the boundaries.




    6. Low-Fidelity Selection (LFS)

    Diffusion models are highly stochastic; some random seeds may over-reconstruct the background and fail to show a clear defect. To resolve this, the pipeline generates **8 candidate samples** in parallel and selects the best candidate using a **perceptual metric (LPIPS)** inside the masked region:

    def low_fidelity_selection(pipeline, healthy_image, mask, prompt, n_candidates=8):
        candidates = []
        lpips_scores = []
        
        for seed in range(n_candidates):
            img_out = pipeline(healthy_image, mask, prompt, seed=seed)
            candidates.append(img_out)
            
            # Compute LPIPS distance strictly inside the mask
            score = compute_masked_lpips(img_out, healthy_image, mask)
            lpips_scores.append(score)
            
        # Counter-intuitive: pick the HIGHEST score. 
        # High LPIPS = clearly visible, realistic defect. Low LPIPS = failed flat generation.
        best_idx = np.argmax(lpips_scores)
        return candidates[best_idx]



    Comparison Table: SD-XL vs. Localized Defect Generation

    Dimension HuggingFace SD-XL Localized Defect Generation Status
    UNet Channels 4 channels (Noisy latent) 9 channels (Noisy latent, mask, background) 🟑 Adapted
    Training Objective Global noise prediction MSE \(\mathcal{L}_{\text{def}}\) + \(\mathcal{L}_{\text{obj}}\) + \(\mathcal{L}_{\text{attn}}\) πŸ”΄ Genuinely New
    Data Scale Millions of image-text pairs ~7 defect-mask pairs (few-shot LoRA) 🟑 Adapted
    Background Preservation Pixel-level blend at the very end Step-wise latent blending + final pixel blend πŸ”΄ Genuinely New
    Inference Selection Single sample (stochastic) LFS: 8 seeds, select maximum LPIPS change πŸ”΄ Genuinely New

    THE COMPLETE LOCALIZED DEFECT GENERATION TRAINING & INFERENCE PIPELINE
    
      TRAINING (LoRA Fine-tuning):
      ~7 image-mask pairs ──► Dual-Masking Loader ──► VAE Encode ──► Shared Noise & Timestep
                                  β”‚
                                  β”œβ”€β”€ Pass 1: "A photo of sks" ──► concat(x_t, M, b_def) ──► UNet ──► L_def
                                  └── Pass 2: "A wafer with sks" ──► concat(x_t, M_rand, b_obj) ──► UNet + Hooks ──► L_obj & L_attn
                                                                                                          β”‚
                                                  Update LoRA weights ◄── Loss = 0.5Β·L_def + 0.2Β·L_obj + 0.05Β·L_attn
    
      INFERENCE (Defect Generation):
      Healthy Image + Mask ──► VAE Encode ──► Step-Wise DDIM Loop (50 steps)
                                                    β”‚
                                                    β”œβ”€β”€ concat(x_t, M, b) ──► UNet ──► CFG step
                                                    └── Latent Blend: MΒ·x_model + (1-M)Β·noised_healthy
                                                                  β”‚
                                                VAE Decode ──► Pixel-level background blend
                                                                  β”‚
                                                LFS Selection (Pick max LPIPS out of 8 seeds) ──► Realistic Defect Image



    References

  • Song, Y. et al. (2025). DefectFill: Realistic Defect Generation with Inpainting Diffusion Model for Visual Inspection. arXiv:2503.13985v2.
  • Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS.
  • Song, J., Meng, C., & Ermon, S. (2020). Denoising Diffusion Implicit Models. ICLR 2021.
  • Rombach, R. et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR.
  • Ho, J., & Salimans, T. (2022). Classifier-Free Diffusion Guidance. NeurIPS Workshop.
  • Hu, E. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
  • Nichol, A., & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML.
  • Zhang, R. et al. (2018). The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (LPIPS). CVPR.
  • Podell, D. et al. (2023). SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952.