What is GRPO?
GRPO is a reinforcement learning optimization method for language models that uses relative rewards within groups of sampled responses instead of a separate value model.
Quick Facts
| Full Name | Group Relative Policy Optimization |
|---|
How It Works
GRPO became prominent in reasoning-model training discussions because it simplifies parts of PPO-style RL for LLMs. Instead of training a separate critic or value model, GRPO samples multiple responses for the same prompt and normalizes rewards within the group. This makes optimization depend on relative performance among candidate responses. The method can be useful for tasks with verifiable or rule-based rewards, but it still requires careful reward design, sampling control, KL management, and evaluation against over-optimization.
Key Characteristics
- Uses groups of responses for the same prompt to compute relative advantages
- Avoids a separate value model in common formulations
- Still belongs to reinforcement-learning-style policy optimization
- Often discussed for reasoning tasks with verifiable rewards
- Requires careful reward shaping and monitoring for over-optimization
Common Use Cases
- Training reasoning models with rule-based answer rewards
- Optimizing multiple sampled responses per prompt
- Reducing PPO pipeline complexity by avoiding a value model
- Experimenting with RL-style alignment for math or code tasks
- Comparing direct preference methods with group-relative RL
Example
Loading code...Frequently Asked Questions
How is GRPO different from PPO?
GRPO commonly uses group-relative rewards and avoids a separate value model, while PPO often uses a critic or value function.
Does GRPO require preference data?
Not always. It can use rule-based or verifiable rewards, though preference signals may also inform reward design.
Why is GRPO relevant for reasoning models?
Reasoning tasks often have verifiable outcomes, making grouped sampling and relative reward signals practical.
Can GRPO be over-optimized?
Yes. If the reward is incomplete or exploitable, the model may learn behaviors that score well but fail broader quality checks.