TL;DR
Transformer is a revolutionary neural network architecture that uses self-attention mechanisms to process sequential data in parallel, fundamentally transforming natural language processing. This guide covers Transformer's core components (self-attention, positional encoding, encoder-decoder architecture), explains its advantages over RNN/LSTM, and explores how modern large models like GPT and BERT are built on Transformer.
Introduction
In 2017, Google's paper "Attention Is All You Need" introduced the Transformer architecture, an innovation that fundamentally changed the trajectory of artificial intelligence. From ChatGPT to BERT, from machine translation to code generation, virtually all modern AI large models are built on the Transformer foundation.
In this guide, you'll learn:
- Core design principles of Transformer architecture
- Mathematical principles and intuitive understanding of self-attention
- How positional encoding enables models to understand sequence order
- How encoder-decoder architecture works
- Comparative analysis of Transformer vs RNN/LSTM
- The relationship between GPT, BERT and Transformer
What is Transformer
Transformer is a sequence-to-sequence (Seq2Seq) model architecture based on attention mechanisms. Unlike traditional recurrent neural networks, Transformer completely abandons recurrent structures, relying solely on attention mechanisms to capture global dependencies between inputs and outputs.
Why Transformer is So Important
The emergence of Transformer solved several key problems with traditional sequence models:
- Parallel Computation: RNN must process sequentially, while Transformer can process the entire sequence in parallel
- Long-Range Dependencies: Attention mechanisms directly establish connections between any positions
- Scalability: Architecture design enables scaling to billions of parameters
Self-Attention Mechanism Explained
Self-Attention is Transformer's core innovation. It allows the model to attend to all other positions in the sequence when processing each position.
Query, Key, Value Concepts
Self-attention uses three vectors to compute attention:
- Query: What information the current position wants to find
- Key: What information each position contains
- Value: The actual information each position transmits
import numpy as np
def scaled_dot_product_attention(Q, K, V):
"""
Scaled dot-product attention computation
Q: Query matrix (seq_len, d_k)
K: Key matrix (seq_len, d_k)
V: Value matrix (seq_len, d_v)
"""
d_k = K.shape[-1]
# Compute attention scores
scores = np.matmul(Q, K.T) / np.sqrt(d_k)
# Softmax normalization
attention_weights = softmax(scores, axis=-1)
# Weighted sum
output = np.matmul(attention_weights, V)
return output, attention_weights
def softmax(x, axis=-1):
exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
Attention Computation Formula
The mathematical expression for self-attention is:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Where d_k is the dimension of key vectors. Dividing by √d_k prevents dot products from becoming too large, which would cause softmax gradients to vanish.
Multi-Head Attention
To enable the model to attend to different types of information, Transformer uses Multi-Head Attention:
def multi_head_attention(Q, K, V, num_heads, d_model):
"""
Multi-head attention mechanism
"""
d_k = d_model // num_heads
heads = []
for i in range(num_heads):
# Each head uses different linear projections
Q_i = linear_projection(Q, d_k)
K_i = linear_projection(K, d_k)
V_i = linear_projection(V, d_k)
head_i, _ = scaled_dot_product_attention(Q_i, K_i, V_i)
heads.append(head_i)
# Concatenate outputs from all heads
concat = np.concatenate(heads, axis=-1)
# Final linear projection
output = linear_projection(concat, d_model)
return output
Multi-head attention allows the model to simultaneously learn information from different representation subspaces—for example, one head focusing on grammatical structure while another focuses on semantic relationships.
Positional Encoding Principles
Since Transformer lacks recurrent structure, it cannot naturally perceive the position of elements in a sequence. Positional Encoding solves this problem.
Sinusoidal Positional Encoding
The original Transformer uses sine and cosine functions to generate positional encodings:
def positional_encoding(seq_len, d_model):
"""
Generate sinusoidal positional encoding
"""
position = np.arange(seq_len)[:, np.newaxis]
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
pe = np.zeros((seq_len, d_model))
pe[:, 0::2] = np.sin(position * div_term) # Even dimensions
pe[:, 1::2] = np.cos(position * div_term) # Odd dimensions
return pe
Advantages of this design:
- Each position has a unique encoding
- Model can learn relative positional relationships
- Can extrapolate to sequence lengths unseen during training
Learnable Positional Encoding
Modern models like BERT and GPT use learnable position embeddings:
class LearnablePositionalEncoding:
def __init__(self, max_seq_len, d_model):
# Position embeddings as trainable parameters
self.position_embeddings = np.random.randn(max_seq_len, d_model) * 0.02
Encoder-Decoder Architecture
Transformer adopts an encoder-decoder architecture, a classic design for sequence-to-sequence tasks.
Encoder Structure
The encoder consists of N identical stacked layers, each containing:
- Multi-Head Self-Attention Layer: Allows each position to attend to all positions in the input sequence
- Feed-Forward Neural Network: Performs independent non-linear transformations at each position
- Residual Connections and Layer Normalization: Stabilizes the training process
Decoder Structure
The decoder also consists of N stacked layers, but each layer has three sublayers:
- Masked Multi-Head Self-Attention: Can only attend to already generated positions, preventing information leakage
- Cross-Attention: Attends to encoder output to obtain source sequence information
- Feed-Forward Neural Network: Same as encoder
Residual Connections and Layer Normalization
def transformer_sublayer(x, sublayer_fn):
"""
Transformer sublayer: residual connection + layer normalization
"""
# Sublayer computation
sublayer_output = sublayer_fn(x)
# Residual connection
residual = x + sublayer_output
# Layer normalization
output = layer_norm(residual)
return output
Transformer vs RNN/LSTM Comparison
| Feature | Transformer | RNN/LSTM |
|---|---|---|
| Parallel Computation | Fully parallel | Must process sequentially |
| Long-Range Dependencies | O(1) path length | O(n) path length |
| Computational Complexity | O(n²·d) | O(n·d²) |
| Training Speed | Fast (parallelizable) | Slow (sequential dependency) |
| Memory Usage | Higher (attention matrix) | Lower |
| Interpretability | Attention weights visualization | Harder to interpret |
Why Transformer Replaced RNN
- Training Efficiency: GPUs excel at parallel computation, and Transformer fully leverages this advantage
- Long Sequence Processing: RNN's vanishing gradient problem limits effective memory length
- Model Capacity: Transformer scales more easily to large parameter counts
Relationship Between GPT, BERT and Other Models
Modern large language models are all based on Transformer architecture but adopt different design choices:
GPT Series (Decoder-Only)
GPT uses Transformer's decoder component with autoregressive text generation:
- Training Objective: Predict next token
- Characteristics: Unidirectional attention, suitable for text generation
- Applications: Dialogue, writing, code generation
BERT (Encoder-Only)
BERT uses Transformer's encoder component with bidirectional attention:
- Training Objective: Masked Language Model (MLM) + Next Sentence Prediction
- Characteristics: Bidirectional context understanding
- Applications: Text classification, question answering, named entity recognition
T5 (Encoder-Decoder)
T5 retains the complete Transformer architecture:
- Training Objective: Text-to-text unified framework
- Characteristics: Flexible handling of various NLP tasks
- Applications: Translation, summarization, question answering
Practical Guide
Using Pre-trained Models
For most applications, using pre-trained models rather than training from scratch is recommended:
from transformers import AutoModel, AutoTokenizer
# Load pre-trained model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Text encoding
text = "Transformer changed natural language processing"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
Fine-tuning Tips
- Learning Rate: Use smaller learning rates (1e-5 to 5e-5)
- Batch Size: Adjust based on GPU memory, typically 16-32
- Training Epochs: 2-4 epochs usually sufficient
- Gradient Accumulation: Use when GPU memory is limited
Tool Recommendations
The following tools can improve efficiency during AI development and learning:
- JSON Formatter - Format model configuration files and API responses
- Base64 Encoder/Decoder - Handle encoding of model weights and embedding vectors
- Text Diff Tool - Compare model output differences
- Random Data Generator - Generate test datasets
Summary
Key points of Transformer architecture:
- Self-Attention Mechanism: Achieves global dependency modeling through Query-Key-Value computation
- Positional Encoding: Provides position information for models without recurrent structure
- Encoder-Decoder: Flexible architecture supports multiple task types
- Parallel Computation: Significantly improves training efficiency compared to RNN
- Scalability: Supports scaling to hundreds of billions of parameters
Understanding Transformer architecture is fundamental to mastering modern AI technology. Whether using large language models or developing AI applications, this knowledge is essential.
FAQ
What is the relationship between attention mechanism in Transformer and human attention?
Transformer's attention mechanism is a mathematical abstraction inspired by humans' ability to selectively focus on important information. In the model, attention weights represent the strength of correlation between different positions, similar to how humans focus on keywords when reading. However, this is a computational mechanism fundamentally different from biological neural system attention mechanisms.
Why does Transformer need positional encoding?
Because Transformer's self-attention mechanism is position-agnostic—it only considers relationships between elements without considering their positions in the sequence. Language understanding requires positional information ("dog bites man" and "man bites dog" have completely different meanings), so position information must be explicitly injected through positional encoding.
Which is better, GPT or BERT?
It depends on the specific task. GPT is suitable for generation tasks (writing, dialogue, code generation) because its autoregressive design naturally fits step-by-step generation. BERT is suitable for understanding tasks (classification, QA, information extraction) because its bidirectional attention better understands context. The modern trend shows GPT-class models can also perform understanding tasks well when scaled up.
Why is Transformer's computational complexity O(n²)?
Self-attention needs to compute attention scores between every pair of positions in the sequence. For a sequence of length n, n×n scores must be computed, hence O(n²) complexity. This is also the main bottleneck when processing very long texts, and much research focuses on developing linear-complexity attention variants.
How do I choose the right pre-trained model?
When choosing a pre-trained model, consider: 1) Task type (GPT-class for generation, BERT-class for understanding); 2) Language (choose language-specific pre-trained models for non-English tasks); 3) Model size (based on computational resources and latency requirements); 4) Domain (prefer domain-specific pre-trained models when available).