What is Reward Model?

Reward Model is a model trained to assign scores to candidate responses based on preference data or human feedback.

How It Works

A reward model approximates human or policy preferences so an optimization method can improve a language model without asking humans to judge every generation. In classic RLHF, annotators compare responses, the reward model learns those preferences, and reinforcement learning then optimizes the policy model against the reward signal. Reward models are powerful but fragile: they can be biased, over-optimized, poorly calibrated, or exploited by responses that score well without actually helping users.

Key Characteristics

  • Learns to score candidate responses according to preference data
  • Commonly used in RLHF pipelines and sometimes in evaluation workflows
  • Can encode helpfulness, safety, factuality, style, or task-specific criteria
  • Vulnerable to reward hacking and distribution shift
  • Needs calibration, validation, and monitoring against human judgments

Common Use Cases

  1. Providing the reward signal in RLHF training
  2. Ranking multiple candidate model responses
  3. Filtering low-quality generations during dataset construction
  4. Measuring alignment regressions against a preference rubric
  5. Supporting model selection when human review is expensive

Example

loading...
Loading code...

Frequently Asked Questions

Is a reward model the same as an LLM judge?

Not exactly. A reward model is usually trained for scoring preferences, while an LLM judge may be prompted to evaluate outputs without special training.

Why can reward models be risky?

Models can learn shortcuts or biased preferences, and policy optimization can exploit those weaknesses instead of improving real usefulness.

Does DPO require a reward model?

DPO avoids training a separate reward model by optimizing directly on preference pairs.

How should reward models be evaluated?

Compare their rankings against held-out human preferences, check calibration, and test for reward hacking and domain drift.

Related Tools

Related Terms

Related Articles