Complete Guide to Diffusion Models: From DDPM to Stable Diffusion, Mastering AI Image Generation

2026-02-21 - QubitTool Team

TL;DR

Diffusion models are currently the most advanced image generation technology, producing high-quality images by learning to gradually denoise. This guide provides an in-depth look at the core concepts of diffusion models, forward diffusion (adding noise) and reverse denoising processes, DDPM and DDIM algorithm principles, Stable Diffusion architecture analysis, and comparisons with GAN and VAE. It also covers practical applications including text-to-image, image-to-image, and inpainting, with code examples using the Diffusers library.

Introduction

In 2022, the open-source release of Stable Diffusion revolutionized the AI image generation field, enabling anyone to generate stunning images on consumer-grade GPUs. With their excellent generation quality and training stability, diffusion models have surpassed GANs to become the mainstream technology for image generation.

In this guide, you will learn:

The core concept of diffusion models: why "adding noise then denoising" can generate images
Mathematical principles of the forward diffusion process
How the reverse denoising process learns to generate
Differences and connections between DDPM and DDIM algorithms
Complete architecture analysis of Stable Diffusion
Comparison of diffusion models with GAN and VAE
Implementation principles of text-to-image, image-to-image, and inpainting
Practical code using the Diffusers library

What are Diffusion Models

Diffusion models are a class of probability-based generative models. Their core idea comes from non-equilibrium thermodynamics: by defining a forward process that gradually adds noise, then learning to reverse this process to generate data.

Core Intuition

Imagine you have a clear photo and keep sprinkling sand on it (adding noise). Eventually, the photo will be completely covered by sand, becoming a pile of random grains. The training goal of diffusion models is to learn to "sweep away the sand"—given an image with any level of noise, predict and remove the noise, gradually recovering the original image.

Once the model learns to denoise, generating new images becomes simple: start from pure random noise, repeatedly apply the denoising process, and eventually obtain a completely new image that has never been seen before.

Forward Diffusion Process

The forward diffusion process is a fixed Markov chain that gradually adds Gaussian noise to the data.

Mathematical Definition

Given original data x₀, the forward process is defined as a Gaussian distribution:

code

q(x_t | x_{t-1}) = N(x_t; √(1-β_t) * x_{t-1}, β_t * I)

Where β_t is a predefined noise schedule that controls how much noise is added at each step.

Noise Schedule Strategies

graph TB subgraph "Noise Schedule Comparison" Linear["Linear Schedule β from 0.0001 to 0.02"] Cosine["Cosine Schedule Smoother noise growth"] Scaled["Scaled Linear Schedule Optimized for high resolution"] end Linear --> |"Simple but average results"| Result1[Common in early research] Cosine --> |"Better generation quality"| Result2[Improved DDPM] Scaled --> |"Large image optimization"| Result3[Stable Diffusion]

Key Property: One-Step Noise Addition

An important property of diffusion models is the ability to compute x_t at any timestep directly from x₀:

code

q(x_t | x_0) = N(x_t; √(ᾱ_t) * x_0, (1-ᾱ_t) * I)

Where ᾱ_t = ∏(1-β_s), the cumulative noise coefficient.

This means during training, we don't need to add noise step by step—we can directly sample noisy images at any timestep:

python

def forward_diffusion(x0, t, noise_schedule):
    alpha_bar = noise_schedule.alpha_bar[t]
    noise = torch.randn_like(x0)
    xt = torch.sqrt(alpha_bar) * x0 + torch.sqrt(1 - alpha_bar) * noise
    return xt, noise

Reverse Denoising Process

The reverse process is the core of diffusion models, learning to recover data from noise through neural networks.

Denoising Network

The reverse process is also a Markov chain, but the transition probabilities need to be learned:

code

p_θ(x_{t-1} | x_t) = N(x_{t-1}; μ_θ(x_t, t), Σ_θ(x_t, t))

The neural network ε_θ is trained to predict the added noise, then the mean is computed using:

code

μ_θ(x_t, t) = (1/√α_t) * (x_t - (β_t/√(1-ᾱ_t)) * ε_θ(x_t, t))

Training Objective

The training objective of diffusion models is elegantly simple—minimize the mean squared error between predicted and actual noise:

code

L = E[||ε - ε_θ(x_t, t)||²]

This is the MSE loss between predicted and actual noise.

python

def training_step(model, x0, noise_schedule):
    t = torch.randint(0, T, (batch_size,))
    noise = torch.randn_like(x0)
    
    xt = forward_diffusion(x0, t, noise_schedule)
    
    predicted_noise = model(xt, t)
    
    loss = F.mse_loss(predicted_noise, noise)
    return loss

Sampling Process

graph TB subgraph "Reverse Sampling Flow" Start["Start from pure noise x_T"] --> Loop{"t > 0?"} Loop --> |Yes| Predict["Predict noise"] Predict --> Compute["Compute x_t-1"] Compute --> Add["Add random noise"] Add --> Update["t = t - 1"] Update --> Loop Loop --> |No| End["Output generated image x_0"] end

DDPM Algorithm Explained

DDPM (Denoising Diffusion Probabilistic Models) is a groundbreaking work from 2020 that laid the foundation for modern diffusion models.

DDPM Sampling Algorithm

python

@torch.no_grad()
def ddpm_sample(model, shape, noise_schedule, device):
    x = torch.randn(shape, device=device)
    
    for t in reversed(range(noise_schedule.T)):
        t_tensor = torch.full((shape[0],), t, device=device)
        
        predicted_noise = model(x, t_tensor)
        
        alpha = noise_schedule.alpha[t]
        alpha_bar = noise_schedule.alpha_bar[t]
        beta = noise_schedule.beta[t]
        
        mean = (1 / torch.sqrt(alpha)) * (
            x - (beta / torch.sqrt(1 - alpha_bar)) * predicted_noise
        )
        
        if t > 0:
            noise = torch.randn_like(x)
            sigma = torch.sqrt(beta)
            x = mean + sigma * noise
        else:
            x = mean
    
    return x

Limitations of DDPM

While DDPM produces excellent generation quality, slow sampling speed is its main drawback:

Requires 1000 iterations to generate a single image
Each step requires a complete neural network forward pass
Generating a 512×512 image can take tens of seconds

DDIM Algorithm Explained

DDIM (Denoising Diffusion Implicit Models) achieves accelerated sampling by re-deriving the sampling process.

Core Improvement of DDIM

DDIM discovered that the reverse process doesn't need to be Markovian—a deterministic sampling process can be defined:

code

x_{t-1} = √(ᾱ_{t-1}) * ((x_t - √(1-ᾱ_t) * ε_θ(x_t,t)) / √(ᾱ_t)) + √(1-ᾱ_{t-1}) * ε_θ(x_t,t)

DDIM Sampling Algorithm

python

@torch.no_grad()
def ddim_sample(model, shape, noise_schedule, device, steps=50, eta=0.0):
    x = torch.randn(shape, device=device)
    
    timesteps = torch.linspace(noise_schedule.T - 1, 0, steps).long()
    
    for i, t in enumerate(timesteps):
        t_tensor = torch.full((shape[0],), t, device=device)
        
        predicted_noise = model(x, t_tensor)
        
        alpha_bar_t = noise_schedule.alpha_bar[t]
        alpha_bar_prev = noise_schedule.alpha_bar[timesteps[i + 1]] if i < len(timesteps) - 1 else 1.0
        
        x0_pred = (x - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
        
        sigma = eta * torch.sqrt((1 - alpha_bar_prev) / (1 - alpha_bar_t)) * torch.sqrt(1 - alpha_bar_t / alpha_bar_prev)
        
        dir_xt = torch.sqrt(1 - alpha_bar_prev - sigma ** 2) * predicted_noise
        
        noise = torch.randn_like(x) if i < len(timesteps) - 1 else 0
        x = torch.sqrt(alpha_bar_prev) * x0_pred + dir_xt + sigma * noise
    
    return x

DDPM vs DDIM Comparison

Feature	DDPM	DDIM
Sampling Steps	1000 steps	10-50 steps
Sampling Speed	Slow	10-100x faster
Randomness	Stochastic sampling	Deterministic sampling possible
Generation Quality	Excellent	Close to DDPM
Interpolation	Difficult	Supports latent space interpolation

Stable Diffusion Architecture Analysis

Stable Diffusion is currently the most popular open-source diffusion model, with highly innovative architecture design.

Overall Architecture

graph TB subgraph "Stable Diffusion Architecture" Text[Text Prompt] --> CLIP[CLIP Text Encoder] CLIP --> Cross[Cross Attention] Image["Input Image Optional"] --> VAE_E[VAE Encoder] VAE_E --> Latent[Latent Space z] Noise[Random Noise] --> Latent Latent --> UNet["U-Net with Cross Attention"] Cross --> UNet Time[Timestep t] --> UNet UNet --> Denoised[Denoised Latent] Denoised --> VAE_D[VAE Decoder] VAE_D --> Output[Generated Image] end

Latent Space Diffusion

The key innovation of Stable Diffusion is performing diffusion in latent space rather than pixel space:

VAE Encoder: Compresses 512×512 images to 64×64 latent space
Latent Diffusion: Performs diffusion process in compressed space
VAE Decoder: Decodes denoised latent vectors back to images

This design reduces computation by approximately 64x, making it possible to run on consumer GPUs.

U-Net Structure

python

class UNetBlock(nn.Module):
    def __init__(self, in_channels, out_channels, time_emb_dim, context_dim):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
        self.time_mlp = nn.Linear(time_emb_dim, out_channels)
        self.cross_attn = CrossAttention(out_channels, context_dim)
        self.self_attn = SelfAttention(out_channels)
        
    def forward(self, x, t_emb, context):
        h = self.conv1(x)
        h = h + self.time_mlp(t_emb)[:, :, None, None]
        h = self.self_attn(h)
        h = self.cross_attn(h, context)
        h = self.conv2(h)
        return h

Conditional Generation Mechanism

Stable Diffusion achieves text-conditional control through Cross Attention:

CLIP Encoding: Converts text to 77×768 embedding sequence
Cross Attention: Each layer of U-Net performs cross-attention with text embeddings
Classifier-Free Guidance: Predicts both conditional and unconditional noise to enhance control

python

def classifier_free_guidance(model, x, t, text_emb, guidance_scale=7.5):
    uncond_emb = model.get_unconditional_embedding()
    
    noise_uncond = model(x, t, uncond_emb)
    noise_cond = model(x, t, text_emb)
    
    noise_pred = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
    return noise_pred

Diffusion Models vs GAN vs VAE

Comparison of Three Generative Models

graph TB subgraph "Generative Model Comparison" GAN["GAN Adversarial Training"] VAE["VAE Variational Inference"] Diffusion["Diffusion Models Denoising Learning"] end GAN --> G1[Fast generation] GAN --> G2[Unstable training] GAN --> G3[Mode collapse risk] VAE --> V1[Stable training] VAE --> V2[Blurry generation] VAE --> V3[Continuous latent space] Diffusion --> D1[Highest generation quality] Diffusion --> D2[Stable training] Diffusion --> D3[Slow sampling]

Detailed Comparison Table

Feature	GAN	VAE	Diffusion Models
Generation Quality	High	Medium	Highest
Training Stability	Low	High	High
Sampling Speed	Fast (single forward)	Fast (single forward)	Slow (multi-step iteration)
Mode Coverage	Possible mode collapse	Complete coverage	Complete coverage
Likelihood Estimation	Cannot compute	Can compute lower bound	Can compute
Conditional Generation	Requires extra design	Natural support	Natural support
Interpolation Ability	Limited	Excellent	Excellent

Selection Recommendations

For real-time generation: Choose GAN (e.g., StyleGAN)
For latent space operations: Choose VAE
For highest quality: Choose diffusion models
For comprehensive needs: Stable Diffusion (VAE + Diffusion Models)

Application Scenarios

Text-to-Image

Text-to-image is the most widespread application of diffusion models:

python

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16
).to("cuda")

prompt = "a beautiful sunset over mountains, digital art, highly detailed"
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
image.save("sunset.png")

Image-to-Image

Generate new images based on reference images:

python

from diffusers import StableDiffusionImg2ImgPipeline
from PIL import Image

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16
).to("cuda")

init_image = Image.open("input.png").convert("RGB")
prompt = "a fantasy castle, oil painting style"

image = pipe(
    prompt=prompt,
    image=init_image,
    strength=0.75,
    guidance_scale=7.5
).images[0]

Inpainting

Repair or replace specified regions of images:

python

from diffusers import StableDiffusionInpaintPipeline

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-inpainting",
    torch_dtype=torch.float16
).to("cuda")

image = Image.open("image.png")
mask = Image.open("mask.png")

result = pipe(
    prompt="a cute cat sitting",
    image=image,
    mask_image=mask,
    num_inference_steps=50
).images[0]

ControlNet Conditional Control

Use edges, poses, and other conditions for precise generation control:

python

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_canny",
    torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

canny_image = get_canny_edges(input_image)
result = pipe(prompt, image=canny_image).images[0]

Practical Guide

Environment Setup

bash

pip install diffusers transformers accelerate torch
pip install xformers  # Optional, for memory optimization

Memory Optimization Tips

python

pipe.enable_attention_slicing()

pipe.enable_vae_slicing()

pipe.enable_model_cpu_offload()

pipe.enable_xformers_memory_efficient_attention()

Generation Quality Optimization

Prompt Engineering: Use detailed, specific descriptions
Negative Prompts: Exclude unwanted elements
Adjust guidance_scale: Usually 7-12 works well
Increase sampling steps: More steps generally mean better quality

python

result = pipe(
    prompt="masterpiece, best quality, detailed face, " + user_prompt,
    negative_prompt="low quality, blurry, distorted, deformed",
    num_inference_steps=50,
    guidance_scale=7.5,
    width=768,
    height=768
).images[0]

Tool Recommendations

When using diffusion models for AI image generation, these tools can improve your workflow:

JSON Formatter - Format model configurations and API response data
Base64 Encoder/Decoder - Handle image data encoding conversion
Text Diff Tool - Compare effects of different prompts
PNG to JPG - Convert generated image formats
Image Compressor - Optimize generated image file sizes

Summary

Key points about diffusion models:

Core Concept: Generate data by learning to reverse the noise-adding process—an elegant "add noise then denoise" design
Forward Process: Fixed Markov chain that gradually adds Gaussian noise until data becomes pure noise
Reverse Process: Neural network learns to predict noise, gradually denoising to recover data
DDPM vs DDIM: DDPM has excellent quality but is slow; DDIM accelerates 10-100x through deterministic sampling
Stable Diffusion: Diffusion in latent space, combining VAE and Cross Attention for efficient text-to-image
Comparative Advantages: More stable training than GAN, higher generation quality than VAE
Rich Applications: Text-to-image, image-to-image, inpainting, ControlNet, and more

Diffusion models represent a major breakthrough in generative AI. Understanding their principles is crucial for mastering AI image generation technology.

FAQ

Why can diffusion models generate high-quality images?

The high quality of diffusion models stems from the simplicity and stability of their training objective. Unlike GAN's adversarial training, diffusion models only need to learn the simple task of predicting noise, avoiding mode collapse. Additionally, the multi-step denoising process allows the model to gradually refine image details, improving generation quality at each step. Furthermore, diffusion models can fully cover the data distribution without missing certain modes.

Should I choose DDPM or DDIM?

It depends on your needs. If you want the highest generation quality and have time to spare, choose DDPM (1000 steps). If you need fast generation or real-time applications, choose DDIM (20-50 steps) with minimal quality loss. DDIM also supports deterministic sampling—the same noise input produces the same output, which is useful in certain scenarios like image editing. In practice, DDIM with 50 steps is usually the best balance between quality and speed.

What hardware does Stable Diffusion require?

Basic operation requires at least 8GB VRAM GPU (e.g., RTX 3060). Recommended configuration is 12GB+ VRAM (e.g., RTX 3080/4070) for better experience. With optimizations like half-precision (float16), attention slicing, and VAE slicing, 6GB VRAM can also work. CPU operation is possible but very slow—generating one image may take several minutes. Apple Silicon Macs can use the MPS backend, with acceptable performance on M1/M2 chips.

How can I improve diffusion model generation results?

Key tips for improvement include: 1) Write detailed, specific prompts including style, quality, and detail descriptions; 2) Use negative prompts to exclude unwanted elements; 3) Adjust guidance_scale—usually 7-12 works well; 4) Increase sampling steps to 50-100; 5) Use high-quality base models or fine-tuned models; 6) Try different samplers (e.g., DPM++ 2M Karras); 7) Use conditional control techniques like ControlNet for precise guidance.

Is training diffusion models expensive?

Training costs depend on model scale and data volume. Training a Stable Diffusion-level model from scratch requires thousands of GPU hours, costing hundreds of thousands of dollars. However, fine-tuning costs much less: LoRA fine-tuning only needs a few hours and a few GB of VRAM; DreamBooth personalization training only requires 20-30 images and a few hours of training time. For most applications, using pre-trained models + fine-tuning is the most economical choice.

Are there copyright issues with diffusion model-generated images?

This is a complex legal issue with varying regulations across countries. Main considerations: 1) Model training data may contain copyrighted works, posing potential infringement risks; 2) Copyright ownership of AI-generated content is unclear, with some countries not recognizing copyright for AI works; 3) If generated content is too similar to existing works, it may constitute infringement. Recommendations for commercial use: choose models trained on compliant data (e.g., Adobe Firefly); avoid deliberately imitating specific artist styles; keep records of the generation process; consult legal professionals when necessary.