What is DiT?
DiT (Diffusion Transformer) is a generative model architecture that replaces the traditional U-Net backbone in diffusion models with a transformer, enabling superior scaling properties and higher-quality image and video generation.
Quick Facts
| Full Name | Diffusion Transformer |
|---|---|
| Created | 2023 by William Peebles and Saining Xie (Meta/UC Berkeley) |
| Specification | Official Specification |
How It Works
Diffusion Transformers represent a pivotal architectural evolution in generative AI. Introduced by Peebles and Xie in 2023, DiT demonstrated that transformers could serve as effective denoisers in the diffusion process, offering better scaling behavior than U-Nets. The architecture processes noisy latent patches through transformer blocks with adaptive layer normalization (adaLN) for conditioning. By 2026, DiT and its variants (MMDiT, SD3's joint attention) power virtually all state-of-the-art generation systems including OpenAI's Sora, Stability AI's Stable Diffusion 3, Black Forest Labs' FLUX 2, and ByteDance's Seedance. The key insight is that transformers' proven scaling laws transfer to diffusion model backbones, enabling predictable quality improvements with increased compute.
Key Characteristics
- Transformer backbone — replaces U-Net with standard transformer blocks for better scaling
- Patch-based processing — divides latent representations into patches like Vision Transformers
- Adaptive layer normalization (adaLN) — conditions generation on text, timestep, and class labels
- Predictable scaling — follows power-law scaling similar to language models
- Joint attention variants — MMDiT enables cross-modal attention between text and image tokens
- Flexible resolution — handles variable input sizes more naturally than fixed U-Net architectures
Common Use Cases
- High-resolution image generation — producing photorealistic images at 4K+ resolution
- Video generation — powering temporal-consistent video synthesis (Sora, Seedance)
- Text rendering — superior text-in-image generation due to transformer attention patterns
- Multi-modal generation — joint image-text generation with unified architecture
- Image editing — enabling precise inpainting and outpainting with global context
- 3D generation — extending to multi-view consistent 3D asset creation
Example
Loading code...Frequently Asked Questions
Why did DiT replace U-Net in diffusion models?
U-Nets were the original backbone for diffusion models but have irregular architectures that make scaling unpredictable. DiT showed that standard transformers offer better scaling laws — doubling compute reliably improves quality. Transformers also handle variable resolutions more naturally and benefit from decades of optimization research.
Which products use DiT architecture?
Major products using DiT variants include: OpenAI Sora (video), Stable Diffusion 3 and SDXL Turbo (images), FLUX 1/2 by Black Forest Labs (images), ByteDance Seedance (video), Google Imagen 3 and Veo (images/video), and Midjourney V7. Virtually all state-of-the-art generation systems in 2026 use transformer-based diffusion backbones.
What is MMDiT?
MMDiT (Multi-Modal Diffusion Transformer) is a variant used in Stable Diffusion 3 where text and image tokens attend to each other through joint self-attention. Unlike standard DiT which uses cross-attention for text conditioning, MMDiT treats both modalities as first-class tokens in a shared sequence, enabling deeper text-image alignment.
How large are DiT models?
DiT models range from ~100M parameters (DiT-S) to 30B+ parameters (FLUX 2). The original paper tested up to DiT-XL/2 (~675M params). Production systems like Sora are estimated at 3-10B parameters. Like language models, larger DiT models consistently produce better results following power-law scaling.
Can DiT generate video?
Yes. Video generation systems like Sora extend DiT to 3D by treating video as spacetime patches. The transformer processes spatial and temporal dimensions jointly, using 3D attention patterns. This enables temporally coherent video generation while inheriting DiT's scaling advantages.