What is KTO?
KTO is a preference tuning method that optimizes a language model using examples labeled as desirable or undesirable rather than requiring paired comparisons.
Quick Facts
| Full Name | Kahneman-Tversky Optimization |
|---|
How It Works
KTO is motivated by the idea that collecting binary desirability feedback can be easier than collecting carefully paired preferences. Instead of requiring a chosen and rejected response for the same prompt, KTO can learn from examples labeled as good or bad. This can reduce data collection friction, but it shifts responsibility to label quality, class balance, and calibration. As with other alignment methods, KTO should be evaluated on real user tasks rather than only on training loss.
Key Characteristics
- Uses desirable and undesirable examples rather than only paired comparisons
- Aims to simplify preference-data collection
- Can be useful when pairwise labels are expensive or unavailable
- Depends on clean labels, representative prompts, and balanced data
- Should be compared against DPO, ORPO, SFT, and RLHF baselines
Common Use Cases
- Training from thumbs-up and thumbs-down style feedback
- Using moderation or quality labels for preference tuning
- Aligning assistants when paired comparisons are hard to collect
- Experimenting with lower-friction preference datasets
- Improving behavior after SFT without a reward-model RL loop
Example
Loading code...Frequently Asked Questions
How is KTO different from DPO?
DPO typically uses paired chosen-rejected examples, while KTO can use examples labeled as desirable or undesirable.
Why is KTO useful for data collection?
Binary desirability labels may be easier to collect from users, logs, or reviewers than carefully matched preference pairs.
Does KTO remove the need for evaluation?
No. It still needs held-out task evaluation, safety checks, and comparison with SFT or preference-optimization baselines.
What can go wrong with KTO data?
Noisy labels, class imbalance, narrow prompts, and unclear desirability criteria can all train unreliable behavior.