TL;DR

Diffusion models are currently the most advanced image generation technology, producing high-quality images by learning to gradually denoise. This guide provides an in-depth look at the core concepts of diffusion models, forward diffusion (adding noise) and reverse denoising processes, DDPM and DDIM algorithm principles, Stable Diffusion architecture analysis, and comparisons with GAN and VAE. It also covers practical applications including text-to-image, image-to-image, and inpainting, with code examples using the Diffusers library.

Introduction

In 2022, the open-source release of Stable Diffusion revolutionized the AI image generation field, enabling anyone to generate stunning images on consumer-grade GPUs. With their excellent generation quality and training stability, diffusion models have surpassed GANs to become the mainstream technology for image generation.

In this guide, you will learn:

  • The core concept of diffusion models: why "adding noise then denoising" can generate images
  • Mathematical principles of the forward diffusion process
  • How the reverse denoising process learns to generate
  • Differences and connections between DDPM and DDIM algorithms
  • Complete architecture analysis of Stable Diffusion
  • Comparison of diffusion models with GAN and VAE
  • Implementation principles of text-to-image, image-to-image, and inpainting
  • Practical code using the Diffusers library

What are Diffusion Models

Diffusion models are a class of probability-based generative models. Their core idea comes from non-equilibrium thermodynamics: by defining a forward process that gradually adds noise, then learning to reverse this process to generate data.

graph LR subgraph "Diffusion Model Core Concept" X0[Clear Image x₀] --> |"Add Noise t=1"| X1[Slight Noise x₁] X1 --> |"Add Noise t=2"| X2[More Noise x₂] X2 --> |"..."| X3["..."] X3 --> |"Add Noise t=T"| XT[Pure Noise xₜ] XT --> |"Denoise t=T"| Y3["..."] Y3 --> |"..."| Y2[Less Noise] Y2 --> |"Denoise t=2"| Y1[Slight Noise] Y1 --> |"Denoise t=1"| Y0[Generated Image] end

Core Intuition

Imagine you have a clear photo and keep sprinkling sand on it (adding noise). Eventually, the photo will be completely covered by sand, becoming a pile of random grains. The training goal of diffusion models is to learn to "sweep away the sand"—given an image with any level of noise, predict and remove the noise, gradually recovering the original image.

Once the model learns to denoise, generating new images becomes simple: start from pure random noise, repeatedly apply the denoising process, and eventually obtain a completely new image that has never been seen before.

Forward Diffusion Process

The forward diffusion process is a fixed Markov chain that gradually adds Gaussian noise to the data.

Mathematical Definition

Given original data x₀, the forward process is defined as a Gaussian distribution:

code
q(x_t | x_{t-1}) = N(x_t; √(1-β_t) * x_{t-1}, β_t * I)

Where β_t is a predefined noise schedule that controls how much noise is added at each step.

Noise Schedule Strategies

graph TB subgraph "Noise Schedule Comparison" Linear["Linear Schedule β from 0.0001 to 0.02"] Cosine["Cosine Schedule Smoother noise growth"] Scaled["Scaled Linear Schedule Optimized for high resolution"] end Linear --> |"Simple but average results"| Result1[Common in early research] Cosine --> |"Better generation quality"| Result2[Improved DDPM] Scaled --> |"Large image optimization"| Result3[Stable Diffusion]

Key Property: One-Step Noise Addition

An important property of diffusion models is the ability to compute x_t at any timestep directly from x₀:

code
q(x_t | x_0) = N(x_t; √(ᾱ_t) * x_0, (1-ᾱ_t) * I)

Where ᾱ_t = ∏(1-β_s), the cumulative noise coefficient.

This means during training, we don't need to add noise step by step—we can directly sample noisy images at any timestep:

python
def forward_diffusion(x0, t, noise_schedule):
    alpha_bar = noise_schedule.alpha_bar[t]
    noise = torch.randn_like(x0)
    xt = torch.sqrt(alpha_bar) * x0 + torch.sqrt(1 - alpha_bar) * noise
    return xt, noise

Reverse Denoising Process

The reverse process is the core of diffusion models, learning to recover data from noise through neural networks.

Denoising Network

The reverse process is also a Markov chain, but the transition probabilities need to be learned:

code
p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))

The neural network ε_θ is trained to predict the added noise, then the mean is computed using:

code
μ_θ(x_t, t) = (1/√α_t) * (x_t - (β_t/√(1-ᾱ_t)) * ε_θ(x_t, t))

Training Objective

The training objective of diffusion models is elegantly simple—minimize the mean squared error between predicted and actual noise:

code
L = E[||ε - ε_θ(x_t, t)||²]

This is the MSE loss between predicted and actual noise.

python
def training_step(model, x0, noise_schedule):
    t = torch.randint(0, T, (batch_size,))
    noise = torch.randn_like(x0)
    
    xt = forward_diffusion(x0, t, noise_schedule)
    
    predicted_noise = model(xt, t)
    
    loss = F.mse_loss(predicted_noise, noise)
    return loss

Sampling Process

graph TB subgraph "Reverse Sampling Flow" Start["Start from pure noise x_T"] --> Loop{"t > 0?"} Loop --> |Yes| Predict["Predict noise"] Predict --> Compute["Compute x_t-1"] Compute --> Add["Add random noise"] Add --> Update["t = t - 1"] Update --> Loop Loop --> |No| End["Output generated image x_0"] end

DDPM Algorithm Explained

DDPM (Denoising Diffusion Probabilistic Models) is a groundbreaking work from 2020 that laid the foundation for modern diffusion models.

DDPM Sampling Algorithm

python
@torch.no_grad()
def ddpm_sample(model, shape, noise_schedule, device):
    x = torch.randn(shape, device=device)
    
    for t in reversed(range(noise_schedule.T)):
        t_tensor = torch.full((shape[0],), t, device=device)
        
        predicted_noise = model(x, t_tensor)
        
        alpha = noise_schedule.alpha[t]
        alpha_bar = noise_schedule.alpha_bar[t]
        beta = noise_schedule.beta[t]
        
        mean = (1 / torch.sqrt(alpha)) * (
            x - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise
        )
        
        if t > 0:
            noise = torch.randn_like(x)
            sigma = torch.sqrt(beta)
            x = mean + sigma * noise
        else:
            x = mean
    
    return x

Limitations of DDPM

While DDPM produces excellent generation quality, slow sampling speed is its main drawback:

  • Requires 1000 iterations to generate a single image
  • Each step requires a complete neural network forward pass
  • Generating a 512×512 image can take tens of seconds

DDIM Algorithm Explained

DDIM (Denoising Diffusion Implicit Models) achieves accelerated sampling by re-deriving the sampling process.

Core Improvement of DDIM

DDIM discovered that the reverse process doesn't need to be Markovian—a deterministic sampling process can be defined:

code
x_{t-1} = √(ᾱ_{t-1}) * ((x_t - √(1-ᾱ_t) * ε_θ(x_t,t)) / √(ᾱ_t)) + √(1-ᾱ_{t-1}) * ε_θ(x_t,t)

DDIM Sampling Algorithm

python
@torch.no_grad()
def ddim_sample(model, shape, noise_schedule, device, steps=50, eta=0.0):
    x = torch.randn(shape, device=device)
    
    timesteps = torch.linspace(noise_schedule.T - 1, 0, steps).long()
    
    for i, t in enumerate(timesteps):
        t_tensor = torch.full((shape[0],), t, device=device)
        
        predicted_noise = model(x, t_tensor)
        
        alpha_bar_t = noise_schedule.alpha_bar[t]
        alpha_bar_prev = noise_schedule.alpha_bar[timesteps[i + 1]] if i < len(timesteps) - 1 else 1.0
        
        x0_pred = (x - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
        
        sigma = eta * torch.sqrt((1 - alpha_bar_prev) / (1 - alpha_bar_t)) * torch.sqrt(1 - alpha_bar_t / alpha_bar_prev)
        
        dir_xt = torch.sqrt(1 - alpha_bar_prev - sigma ** 2) * predicted_noise
        
        noise = torch.randn_like(x) if i < len(timesteps) - 1 else 0
        x = torch.sqrt(alpha_bar_prev) * x0_pred + dir_xt + sigma * noise
    
    return x

DDPM vs DDIM Comparison

Feature DDPM DDIM
Sampling Steps 1000 steps 10-50 steps
Sampling Speed Slow 10-100x faster
Randomness Stochastic sampling Deterministic sampling possible
Generation Quality Excellent Close to DDPM
Interpolation Difficult Supports latent space interpolation

Stable Diffusion Architecture Analysis

Stable Diffusion is currently the most popular open-source diffusion model, with highly innovative architecture design.

Overall Architecture

graph TB subgraph "Stable Diffusion Architecture" Text[Text Prompt] --> CLIP[CLIP Text Encoder] CLIP --> Cross[Cross Attention] Image["Input Image Optional"] --> VAE_E[VAE Encoder] VAE_E --> Latent[Latent Space z] Noise[Random Noise] --> Latent Latent --> UNet["U-Net with Cross Attention"] Cross --> UNet Time[Timestep t] --> UNet UNet --> Denoised[Denoised Latent] Denoised --> VAE_D[VAE Decoder] VAE_D --> Output[Generated Image] end

Latent Space Diffusion

The key innovation of Stable Diffusion is performing diffusion in latent space rather than pixel space:

  1. VAE Encoder: Compresses 512×512 images to 64×64 latent space
  2. Latent Diffusion: Performs diffusion process in compressed space
  3. VAE Decoder: Decodes denoised latent vectors back to images

This design reduces computation by approximately 64x, making it possible to run on consumer GPUs.

U-Net Structure

python
class UNetBlock(nn.Module):
    def __init__(self, in_channels, out_channels, time_emb_dim, context_dim):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
        self.time_mlp = nn.Linear(time_emb_dim, out_channels)
        self.cross_attn = CrossAttention(out_channels, context_dim)
        self.self_attn = SelfAttention(out_channels)
        
    def forward(self, x, t_emb, context):
        h = self.conv1(x)
        h = h + self.time_mlp(t_emb)[:, :, None, None]
        h = self.self_attn(h)
        h = self.cross_attn(h, context)
        h = self.conv2(h)
        return h

Conditional Generation Mechanism

Stable Diffusion achieves text-conditional control through Cross Attention:

  1. CLIP Encoding: Converts text to 77×768 embedding sequence
  2. Cross Attention: Each layer of U-Net performs cross-attention with text embeddings
  3. Classifier-Free Guidance: Predicts both conditional and unconditional noise to enhance control
python
def classifier_free_guidance(model, x, t, text_emb, guidance_scale=7.5):
    uncond_emb = model.get_unconditional_embedding()
    
    noise_uncond = model(x, t, uncond_emb)
    noise_cond = model(x, t, text_emb)
    
    noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
    return noise_pred

Diffusion Models vs GAN vs VAE

Comparison of Three Generative Models

graph TB subgraph "Generative Model Comparison" GAN["GAN Adversarial Training"] VAE["VAE Variational Inference"] Diffusion["Diffusion Models Denoising Learning"] end GAN --> G1[Fast generation] GAN --> G2[Unstable training] GAN --> G3[Mode collapse risk] VAE --> V1[Stable training] VAE --> V2[Blurry generation] VAE --> V3[Continuous latent space] Diffusion --> D1[Highest generation quality] Diffusion --> D2[Stable training] Diffusion --> D3[Slow sampling]

Detailed Comparison Table

Feature GAN VAE Diffusion Models
Generation Quality High Medium Highest
Training Stability Low High High
Sampling Speed Fast (single forward) Fast (single forward) Slow (multi-step iteration)
Mode Coverage Possible mode collapse Complete coverage Complete coverage
Likelihood Estimation Cannot compute Can compute lower bound Can compute
Conditional Generation Requires extra design Natural support Natural support
Interpolation Ability Limited Excellent Excellent

Selection Recommendations

  • For real-time generation: Choose GAN (e.g., StyleGAN)
  • For latent space operations: Choose VAE
  • For highest quality: Choose diffusion models
  • For comprehensive needs: Stable Diffusion (VAE + Diffusion Models)

Application Scenarios

Text-to-Image

Text-to-image is the most widespread application of diffusion models:

python
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16
).to("cuda")

prompt = "a beautiful sunset over mountains, digital art, highly detailed"
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
image.save("sunset.png")

Image-to-Image

Generate new images based on reference images:

python
from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16
).to("cuda")

init_image = Image.open("input.png").convert("RGB")
prompt = "a fantasy castle, oil painting style"

image = pipe(
    prompt=prompt,
    image=init_image,
    strength=0.75,
    guidance_scale=7.5
).images[0]

Inpainting

Repair or replace specified regions of images:

python
from diffusers import StableDiffusionInpaintPipeline

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-inpainting",
    torch_dtype=torch.float16
).to("cuda")

image = Image.open("image.png")
mask = Image.open("mask.png")

result = pipe(
    prompt="a cute cat sitting",
    image=image,
    mask_image=mask,
    num_inference_steps=50
).images[0]

ControlNet Conditional Control

Use edges, poses, and other conditions for precise generation control:

python
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_canny",
    torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

canny_image = get_canny_edges(input_image)
result = pipe(prompt, image=canny_image).images[0]

Practical Guide

Environment Setup

bash
pip install diffusers transformers accelerate torch
pip install xformers  # Optional, for memory optimization

Memory Optimization Tips

python
pipe.enable_attention_slicing()

pipe.enable_vae_slicing()

pipe.enable_model_cpu_offload()

pipe.enable_xformers_memory_efficient_attention()

Generation Quality Optimization

  1. Prompt Engineering: Use detailed, specific descriptions
  2. Negative Prompts: Exclude unwanted elements
  3. Adjust guidance_scale: Usually 7-12 works well
  4. Increase sampling steps: More steps generally mean better quality
python
result = pipe(
    prompt="masterpiece, best quality, detailed face, " + user_prompt,
    negative_prompt="low quality, blurry, distorted, deformed",
    num_inference_steps=50,
    guidance_scale=7.5,
    width=768,
    height=768
).images[0]

Tool Recommendations

When using diffusion models for AI image generation, these tools can improve your workflow:

Summary

Key points about diffusion models:

  1. Core Concept: Generate data by learning to reverse the noise-adding process—an elegant "add noise then denoise" design
  2. Forward Process: Fixed Markov chain that gradually adds Gaussian noise until data becomes pure noise
  3. Reverse Process: Neural network learns to predict noise, gradually denoising to recover data
  4. DDPM vs DDIM: DDPM has excellent quality but is slow; DDIM accelerates 10-100x through deterministic sampling
  5. Stable Diffusion: Diffusion in latent space, combining VAE and Cross Attention for efficient text-to-image
  6. Comparative Advantages: More stable training than GAN, higher generation quality than VAE
  7. Rich Applications: Text-to-image, image-to-image, inpainting, ControlNet, and more

Diffusion models represent a major breakthrough in generative AI. Understanding their principles is crucial for mastering AI image generation technology.

FAQ

Why can diffusion models generate high-quality images?

The high quality of diffusion models stems from the simplicity and stability of their training objective. Unlike GAN's adversarial training, diffusion models only need to learn the simple task of predicting noise, avoiding mode collapse. Additionally, the multi-step denoising process allows the model to gradually refine image details, improving generation quality at each step. Furthermore, diffusion models can fully cover the data distribution without missing certain modes.

Should I choose DDPM or DDIM?

It depends on your needs. If you want the highest generation quality and have time to spare, choose DDPM (1000 steps). If you need fast generation or real-time applications, choose DDIM (20-50 steps) with minimal quality loss. DDIM also supports deterministic sampling—the same noise input produces the same output, which is useful in certain scenarios like image editing. In practice, DDIM with 50 steps is usually the best balance between quality and speed.

What hardware does Stable Diffusion require?

Basic operation requires at least 8GB VRAM GPU (e.g., RTX 3060). Recommended configuration is 12GB+ VRAM (e.g., RTX 3080/4070) for better experience. With optimizations like half-precision (float16), attention slicing, and VAE slicing, 6GB VRAM can also work. CPU operation is possible but very slow—generating one image may take several minutes. Apple Silicon Macs can use the MPS backend, with acceptable performance on M1/M2 chips.

How can I improve diffusion model generation results?

Key tips for improvement include: 1) Write detailed, specific prompts including style, quality, and detail descriptions; 2) Use negative prompts to exclude unwanted elements; 3) Adjust guidance_scale—usually 7-12 works well; 4) Increase sampling steps to 50-100; 5) Use high-quality base models or fine-tuned models; 6) Try different samplers (e.g., DPM++ 2M Karras); 7) Use conditional control techniques like ControlNet for precise guidance.

Is training diffusion models expensive?

Training costs depend on model scale and data volume. Training a Stable Diffusion-level model from scratch requires thousands of GPU hours, costing hundreds of thousands of dollars. However, fine-tuning costs much less: LoRA fine-tuning only needs a few hours and a few GB of VRAM; DreamBooth personalization training only requires 20-30 images and a few hours of training time. For most applications, using pre-trained models + fine-tuning is the most economical choice.

This is a complex legal issue with varying regulations across countries. Main considerations: 1) Model training data may contain copyrighted works, posing potential infringement risks; 2) Copyright ownership of AI-generated content is unclear, with some countries not recognizing copyright for AI works; 3) If generated content is too similar to existing works, it may constitute infringement. Recommendations for commercial use: choose models trained on compliant data (e.g., Adobe Firefly); avoid deliberately imitating specific artist styles; keep records of the generation process; consult legal professionals when necessary.