TL;DR

RLHF (Reinforcement Learning from Human Feedback) is the core technology for aligning large language models with human preferences. This guide details the three key stages of RLHF: Supervised Fine-Tuning (SFT), Reward Model training, and PPO policy optimization. It also provides in-depth analysis of reward model design, RLHF vs DPO comparison, and the RLHF practices behind products like ChatGPT.

Introduction

When you chat with ChatGPT, you might be amazed at how helpful and safe its responses are. The key technology behind this is RLHF—a training method that teaches AI to understand "what makes a good answer."

Traditional language model training simply teaches models to predict the next word, but this doesn't guarantee that model outputs align with human expectations. RLHF introduces human feedback to help models learn to generate content that humans actually want.

In this guide, you will learn:

  • The core principles of RLHF and why human feedback is needed
  • Complete analysis of the three-stage RLHF training process
  • Design and training of Reward Models
  • Specific applications of PPO algorithm in RLHF
  • Comparison between RLHF and alternatives like DPO
  • RLHF practices in InstructGPT and ChatGPT

What is RLHF

Definition of RLHF

RLHF is a model training method that combines reinforcement learning with human feedback. Its core idea is: through human preference judgments on model outputs, train a reward model to evaluate output quality, then use reinforcement learning to optimize the language model to generate content that receives high rewards.

flowchart LR A[Pre-trained Model] --> B["Stage 1: SFT"] B --> C["Stage 2: Train Reward Model"] C --> D["Stage 3: PPO RL"] D --> E[Aligned Model] F[Human Demonstration Data] --> B G[Human Preference Data] --> C H[Reward Signal] --> D

Why Human Feedback is Needed

Although pre-trained language models are powerful, they have several key issues:

Problem Manifestation RLHF Solution
Harmful Content May generate toxic, biased content Learn human preferences for safe content
Hallucination Fabricate non-existent facts Reward honest responses that acknowledge uncertainty
Poor Instruction Following Cannot accurately understand user intent Learn correct instruction understanding through human feedback
Inappropriate Output Style Too verbose or too brief Learn human-preferred response styles
Value Misalignment Not aligned with human values Inject human values into the model

Core Advantages of RLHF

Compared to traditional training methods, RLHF has unique advantages:

  1. Capturing Implicit Preferences: Humans find it hard to explicitly describe what makes a "good answer," but can easily compare two answers
  2. Continuous Improvement: Can continuously optimize the model based on new human feedback
  3. Safety Alignment: Effectively reduces harmful outputs and improves model safety
  4. User Experience Optimization: Generates content that better meets user expectations

RLHF Three Stages Explained

Stage One: Supervised Fine-Tuning (SFT)

Supervised fine-tuning is the first step of RLHF, aimed at teaching the pre-trained model to generate responses following instruction formats.

flowchart TB subgraph SG_SFT_Training_Process["SFT Training Process"] A[Pre-trained Model] --> B[Collect Demonstration Data] B --> C[Human Annotators Write High-Quality Responses] C --> D[Supervised Learning Training] D --> E[SFT Model] end subgraph SG_Data_Example["Data Example"] F["Prompt: Explain what machine learning is"] G["Response: Machine learning is a branch of AI..."] end

SFT Data Preparation Key Points:

python
sft_data_example = {
    "prompt": "Explain quantum computing in simple terms",
    "response": """Quantum computing is a new type of computing that uses 
principles of quantum mechanics.

Unlike traditional computers that use bits of 0 and 1, quantum computers 
use quantum bits (qubits), which can exist in a superposition of 0 and 1 
simultaneously. This gives quantum computers exponential computational 
advantages for certain specific problems.

Main application areas include: cryptography, drug discovery, financial 
modeling, and more."""
}

Key Parameters for SFT Training:

Parameter Recommended Value Description
Learning Rate 1e-5 ~ 5e-5 Lower learning rate to avoid forgetting pre-trained knowledge
Training Epochs 1-3 epochs Prevent overfitting
Data Volume 10K-100K High-quality demonstration data

Stage Two: Reward Model Training

The Reward Model (RM) is the core component of RLHF, learning human preferences to evaluate the quality of model outputs.

flowchart TB subgraph SG_Data_Collection["Data Collection"] A[SFT Model] --> B[Generate Multiple Responses for Same Prompt] B --> C[Human Annotators Rank Preferences] C --> D["Preference Data: A > B > C"] end subgraph SG_Reward_Model_Trainin["Reward Model Training"] D --> E[Construct Comparison Pairs] E --> F["(prompt, chosen, rejected)"] F --> G[Bradley-Terry Model Training] G --> H[Reward Model RM] end

Mathematical Principles of Reward Model:

The reward model is based on the Bradley-Terry model, learning to predict which of two responses humans prefer:

code
P(y_chosen > y_rejected | x) = σ(r(x, y_chosen) - r(x, y_rejected))

Where:

  • r(x, y) is the reward model's score for input x and output y
  • σ is the sigmoid function
  • The training objective is to maximize this probability

Reward Model Training Code Example:

python
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification

class RewardModel(nn.Module):
    def __init__(self, model_name):
        super().__init__()
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name, 
            num_labels=1
        )
    
    def forward(self, input_ids, attention_mask):
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        return outputs.logits

def compute_reward_loss(reward_model, chosen_ids, rejected_ids, 
                        chosen_mask, rejected_mask):
    """Compute reward model loss function"""
    chosen_rewards = reward_model(chosen_ids, chosen_mask)
    rejected_rewards = reward_model(rejected_ids, rejected_mask)
    
    loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()
    return loss

Best Practices for Preference Data Collection:

  1. Diverse Annotators: Avoid single-perspective bias
  2. Clear Annotation Guidelines: Define what makes a "better" response
  3. Quality Control: Regularly check annotation consistency
  4. Data Volume: Typically need 50K-500K comparison pairs

Stage Three: PPO Policy Optimization

PPO (Proximal Policy Optimization) is the most commonly used reinforcement learning algorithm in RLHF, used to optimize the language model for higher rewards.

flowchart TB subgraph SG_PPO_Training_Loop["PPO Training Loop"] A[Policy Model π] --> B[Generate Response] B --> C[Reward Model Scoring] C --> D[Compute Advantage Function] D --> E[PPO Loss Calculation] E --> F[Update Policy Model] F --> A end subgraph SG_Constraint_Mechanism["Constraint Mechanism"] G[Reference Model π_ref] --> H[KL Divergence Penalty] H --> E end

PPO Objective Function in RLHF:

code
maximize E[r(x, y) - β * KL(π || π_ref)]

Where:

  • r(x, y) is the reward model's score
  • KL(π || π_ref) is the KL divergence between current and reference policies
  • β is the KL penalty coefficient to prevent the model from deviating too far

Why KL Divergence Constraint is Needed:

Problems Without KL Constraint Role of KL Constraint
Model may find "cheating" ways to get high rewards Maintain output diversity
Outputs may become unnatural Maintain language fluency
Reward model may be exploited Prevent reward hacking

Key Hyperparameters for PPO Training:

python
ppo_config = {
    "learning_rate": 1e-5,
    "batch_size": 64,
    "mini_batch_size": 16,
    "ppo_epochs": 4,
    "kl_penalty": "kl",
    "init_kl_coef": 0.2,
    "target_kl": 6.0,
    "clip_range": 0.2,
    "value_clip_range": 0.2,
    "gamma": 1.0,
    "lam": 0.95,
}

Reward Model Deep Dive

Reward Model Architecture Design

The reward model is typically based on the same pre-trained model as the policy model, but with the output layer changed to a scalar reward value:

code
┌─────────────────────────────────────────────┐
│           Reward Model Architecture          │
├─────────────────────────────────────────────┤
│  Input: [prompt + response]                  │
│           ↓                                  │
│  Transformer Encoder (shared pre-trained)    │
│           ↓                                  │
│  Last Token Hidden State                     │
│           ↓                                  │
│  Linear Layer → Scalar Reward Value          │
└─────────────────────────────────────────────┘

Common Issues with Reward Models

1. Reward Hacking

The model may learn to exploit vulnerabilities in the reward model, generating outputs that receive high rewards but are actually low quality.

Solutions:

  • Increase KL divergence penalty
  • Use ensemble of multiple reward models
  • Regularly update the reward model

2. Out-of-Distribution Generalization

The reward model may perform poorly on inputs outside the training distribution.

Solutions:

  • Expand training data diversity
  • Use uncertainty estimation
  • Limit policy model exploration range

RLHF vs DPO Comparison

DPO (Direct Preference Optimization) is a simplified alternative to RLHF that directly optimizes the policy from preference data without training a separate reward model.

flowchart TB subgraph SG_RLHF_Pipeline["RLHF Pipeline"] A1[SFT] --> B1[Train Reward Model] B1 --> C1[PPO Optimization] C1 --> D1[Aligned Model] end subgraph SG_DPO_Pipeline["DPO Pipeline"] A2[SFT] --> B2[Direct Preference Optimization] B2 --> D2[Aligned Model] end

Detailed Comparison

Dimension RLHF DPO
Training Complexity High (three stages) Low (two stages)
Computational Resources Requires multiple models running simultaneously Only needs one model
Stability Difficult to tune, may be unstable More stable, similar to supervised learning
Flexibility Reward model can be reused Requires retraining each time
Performance Ceiling Theoretically higher Approaches RLHF on some tasks
Use Cases Large-scale production Research and rapid iteration

Mathematical Principles of DPO

The core insight of DPO is that the optimal policy can be directly expressed using preference data, without an explicit reward model:

code
L_DPO = -E[log σ(β * (log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))]

Where y_w is the preferred response and y_l is the non-preferred response.

How to Choose

Choose RLHF:

  • Need fine-grained control over reward function
  • Have sufficient computational resources
  • Pursuing best results
  • Reward model needs to be reused

Choose DPO:

  • Limited computational resources
  • Rapid iteration and validation
  • Sufficient preference data
  • Pursuing training stability

InstructGPT and ChatGPT RLHF Practices

InstructGPT's Three-Stage Training

OpenAI's InstructGPT paper published in 2022 detailed RLHF practices:

flowchart TB subgraph SG_Stage1_SFT["Stage1-SFT"] A[GPT-3 175B] --> B[13K Demonstration Data] B --> C[Supervised Fine-tuning] end subgraph SG_Stage2_RM["Stage2-RM"] C --> D[Generate Multiple Responses] D --> E[33K Comparison Data] E --> F[Train 6B Reward Model] end subgraph SG_Stage3_PPO["Stage3-PPO"] F --> G[PPO Optimization] C --> G G --> H[InstructGPT] end

Key Data Statistics:

Stage Data Volume Number of Annotators
SFT 13,000 samples 40 people
RM 33,000 comparisons 40 people
PPO 31,000 prompts -

ChatGPT Improvements

ChatGPT made several improvements based on InstructGPT:

  1. Dialogue Format Optimization: Specifically trained for multi-turn conversation scenarios
  2. Enhanced Safety: Stricter harmful content filtering
  3. Continuous Iteration: Constantly updated based on user feedback
  4. Scale Expansion: Larger models and more training data

RLHF Challenges and Limitations

Main Challenges

1. Quality and Consistency of Human Feedback

code
┌─────────────────────────────────────────────┐
│        Human Feedback Challenges             │
├─────────────────────────────────────────────┤
│  • Inconsistency between annotators          │
│  • Personal biases of annotators             │
│  • Complex tasks difficult to judge          │
│  • High annotation costs                     │
│  • Quality degradation from annotator fatigue│
└─────────────────────────────────────────────┘

2. Limitations of Reward Models

  • May not capture all dimensions of human preferences
  • Easily exploited by policy models (reward hacking)
  • Limited out-of-distribution generalization

3. Training Instability

  • PPO training requires fine-tuned parameters
  • Complexity of coordinating multiple model training
  • High computational resource requirements

Solutions and Future Directions

Challenge Current Solutions Future Directions
Annotation Consistency Detailed guidelines, quality control AI-assisted annotation
Reward Hacking KL constraints, multiple reward models More robust reward design
Training Instability Careful tuning, progressive training More stable algorithms (e.g., DPO)
High Costs Active learning, data augmentation Automated feedback collection

The following tools can improve your efficiency during RLHF research and development:

  • JSON Formatter - Format training data, model configurations, and experiment results
  • Text Diff Tool - Compare output differences between model versions, evaluate RLHF effectiveness
  • Random Data Generator - Generate test prompts, verify model generalization

FAQ

What's the difference between RLHF and regular fine-tuning?

Regular fine-tuning (like SFT) learns directly from annotated data, where the model learns "standard answers." RLHF learns through human preference comparisons, where the model learns "what kind of answer is better." RLHF can capture implicit preferences that are difficult to explicitly describe, such as helpfulness, safety, and style of responses.

How much preference data is needed to train a reward model?

The data volume depends on task complexity and expected results. OpenAI's InstructGPT used about 33,000 comparison pairs. Generally, it's recommended to prepare at least 10,000-50,000 high-quality preference comparison pairs. The key is data diversity and annotation quality, not just pursuing quantity.

What to do if PPO training is unstable?

PPO training instability is a common issue. Recommendations: 1) Use smaller learning rates (1e-6 to 1e-5); 2) Increase KL penalty coefficient; 3) Use gradient clipping; 4) Start with small-scale experiments for tuning; 5) Consider using more stable alternatives like DPO.

Can RLHF completely solve AI safety problems?

RLHF is an important technology for improving AI safety, but it cannot completely solve all safety problems. It relies on the quality of human feedback, and human annotators may have biases or miss certain risk scenarios. RLHF should be used in combination with other safety measures (such as content filtering, red team testing).

Can small teams implement RLHF?

Yes, but the approach needs to be adjusted based on resources. Recommendations: 1) Use smaller base models (7B-13B); 2) Consider using DPO instead of the full RLHF pipeline; 3) Leverage open-source tools like the TRL library; 4) Start with small-scale experiments; 5) Use techniques like QLoRA to reduce memory requirements.

Summary

RLHF is the core technology for aligning large language models with human preferences, achieving model alignment optimization through three stages:

  1. Supervised Fine-Tuning (SFT): Teach the model to follow instruction formats
  2. Reward Model Training: Convert human preferences into quantifiable reward signals
  3. PPO Optimization: Maximize rewards through reinforcement learning while maintaining model stability

Understanding RLHF not only helps you better use AI products like ChatGPT, but also lays the foundation for entering the field of AI alignment research. With the emergence of new technologies like DPO, model alignment is becoming more efficient and easier to implement.

By mastering RLHF technology, you will be able to train AI models that are safer, more useful, and better aligned with human expectations, gaining a technical advantage in AI application development.