What is DiT?

DiT (Diffusion Transformer) is a generative model architecture that replaces the traditional U-Net backbone in diffusion models with a transformer, enabling superior scaling properties and higher-quality image and video generation.

Quick Facts

Full NameDiffusion Transformer
Created2023 by William Peebles and Saining Xie (Meta/UC Berkeley)
SpecificationOfficial Specification

How It Works

Diffusion Transformers represent a pivotal architectural evolution in generative AI. Introduced by Peebles and Xie in 2023, DiT demonstrated that transformers could serve as effective denoisers in the diffusion process, offering better scaling behavior than U-Nets. The architecture processes noisy latent patches through transformer blocks with adaptive layer normalization (adaLN) for conditioning. By 2026, DiT and its variants (MMDiT, SD3's joint attention) power virtually all state-of-the-art generation systems including OpenAI's Sora, Stability AI's Stable Diffusion 3, Black Forest Labs' FLUX 2, and ByteDance's Seedance. The key insight is that transformers' proven scaling laws transfer to diffusion model backbones, enabling predictable quality improvements with increased compute.

Key Characteristics

  • Transformer backbone — replaces U-Net with standard transformer blocks for better scaling
  • Patch-based processing — divides latent representations into patches like Vision Transformers
  • Adaptive layer normalization (adaLN) — conditions generation on text, timestep, and class labels
  • Predictable scaling — follows power-law scaling similar to language models
  • Joint attention variants — MMDiT enables cross-modal attention between text and image tokens
  • Flexible resolution — handles variable input sizes more naturally than fixed U-Net architectures

Common Use Cases

  1. High-resolution image generation — producing photorealistic images at 4K+ resolution
  2. Video generation — powering temporal-consistent video synthesis (Sora, Seedance)
  3. Text rendering — superior text-in-image generation due to transformer attention patterns
  4. Multi-modal generation — joint image-text generation with unified architecture
  5. Image editing — enabling precise inpainting and outpainting with global context
  6. 3D generation — extending to multi-view consistent 3D asset creation

Example

loading...
Loading code...

Frequently Asked Questions

Why did DiT replace U-Net in diffusion models?

U-Nets were the original backbone for diffusion models but have irregular architectures that make scaling unpredictable. DiT showed that standard transformers offer better scaling laws — doubling compute reliably improves quality. Transformers also handle variable resolutions more naturally and benefit from decades of optimization research.

Which products use DiT architecture?

Major products using DiT variants include: OpenAI Sora (video), Stable Diffusion 3 and SDXL Turbo (images), FLUX 1/2 by Black Forest Labs (images), ByteDance Seedance (video), Google Imagen 3 and Veo (images/video), and Midjourney V7. Virtually all state-of-the-art generation systems in 2026 use transformer-based diffusion backbones.

What is MMDiT?

MMDiT (Multi-Modal Diffusion Transformer) is a variant used in Stable Diffusion 3 where text and image tokens attend to each other through joint self-attention. Unlike standard DiT which uses cross-attention for text conditioning, MMDiT treats both modalities as first-class tokens in a shared sequence, enabling deeper text-image alignment.

How large are DiT models?

DiT models range from ~100M parameters (DiT-S) to 30B+ parameters (FLUX 2). The original paper tested up to DiT-XL/2 (~675M params). Production systems like Sora are estimated at 3-10B parameters. Like language models, larger DiT models consistently produce better results following power-law scaling.

Can DiT generate video?

Yes. Video generation systems like Sora extend DiT to 3D by treating video as spacetime patches. The transformer processes spatial and temporal dimensions jointly, using 3D attention patterns. This enables temporally coherent video generation while inheriting DiT's scaling advantages.

Related Tools

Related Terms

Diffusion Model

Diffusion Model is a class of generative deep learning models that learn to generate data by gradually denoising a normally distributed variable, reversing a forward diffusion process that progressively adds Gaussian noise to training data until it becomes pure noise.

Transformer

Transformer is a deep learning architecture introduced in the landmark paper 'Attention Is All You Need' (2017) by Google researchers, which revolutionized natural language processing by replacing recurrent neural networks with a self-attention mechanism that enables parallel processing of sequential data and captures long-range dependencies more effectively.

Text-to-Image

Text-to-Image is an artificial intelligence technology that generates visual images from natural language text descriptions, using deep learning models to interpret textual prompts and synthesize corresponding photorealistic or artistic images.

Attention Mechanism

Attention Mechanism is a neural network technique that enables models to dynamically focus on relevant parts of the input data by computing weighted importance scores, allowing the network to selectively attend to the most pertinent information when making predictions or generating outputs. The three primary variants are Self-Attention (each position attends to all positions within the same sequence), Cross-Attention (one sequence attends to another, e.g., decoder attending to encoder outputs), and Multi-Head Attention (multiple parallel attention operations with independent learned projections that jointly capture different types of relationships). Attention is the core building block of the Transformer architecture and underpins virtually all modern large language models (GPT, Claude, Gemini, LLaMA), vision transformers (ViT, DINO), and multimodal models.

Related Articles