A Plain Language Analysis of DPO: How to Let AI Learn Human Preferences Directly?

Written by

Jasper Cole

Updated on:June-27th-2025

DPO (Direct Preference Optimization) is one of the methods to make the output of a large model more in line with human preferences. This article is organized according to my learning logic of the DPO concept, and I hope it can help you understand the principle of DPO in a popular way~

1. Existing methods: complex but effective RLHF

In the past, if AI wanted to learn these preferences, it had to take a relatively complex route called RLHF (Reinforcement Learning from Human Feedback). The process is roughly as follows:

Collect a lot of data on which answers humans prefer (preference comparison)
Train a reward model to imitate humans in judging which sentence is better
Use reinforcement learning algorithms (such as PPO) to train the main language model to "please" the reward model

In other words, the AI does not listen directly to humans, but learns indirectly what a good answer is through "raters."

Although this route can eventually achieve the goal, it also has obvious disadvantages: complex implementation, high cost, and unstable training process.

2. DPO Appearance: Tell the Model Directly “Choose This”

The changes brought about by DPO (Direct Preference Optimization) are:

I no longer train a reward model to score. You tell me you prefer A, and I just let the model lean towards A.

Doesn’t this sound simpler?

DPO transforms the problem into a very intuitive task: given two answers, directly train the model to prefer the one that humans prefer.

It's like doing a continuous voting training: whoever humans choose, the model adjusts itself to be more inclined to that side.

3. How does the model learn to “favor good answers”?

We already know that DPO guides the model to continuously optimize its generation behavior by telling it "I prefer A to B". So how does it learn to "favor good answers"?

First, DPO borrows a classic preference modeling framework, the Bradley-Terry model , the core idea of which is:

If answer A has a higher "hidden quality score" than answer B, then A should be more likely to be selected.

This model mathematically transforms the judgment that "humans prefer a certain answer" into a probability expression. For example:

If A's score is 3 and B's score is 1,
Then the probability of A being selected is:

P(A △ B) = e^3 / (e^3 + e^1) ≈ 88%

In other words, the higher the score, the greater the probability of being preferred, and the larger the score gap, the more "sure" the probability is.

4. How does DPO get this “score”?

This is the key innovation of DPO: it does not require training a separate "scoring function" or reward model, but directly uses the probability generated by the language model itself to derive the score of each answer.

What it does is:

Prepare a reference model (usually a model from an earlier stage πref)
Use the currently trained target model πθ to generate the same answer y and compare their probabilities under the same prompt
Then calculate the "relative improvement tendency" as a score:

r(x, y) = β × log[πθ(y|x) / πref(y|x)]

This r(x, y) is what we call the "implicit score". It represents whether the current model is more inclined to this answer than the reference model. If so, then this answer is the "better" option learned by the model.

The actual training of DPO happens on the comparison between answer pairs:

For a preference pair: winner A, loser B
DPO calculates the difference between the relative scores of two responses:

L = -log σ(r̂(x, A) - r̂(x, B))

This difference reflects whether the current model is "more inclined to answers that humans prefer." By continuously minimizing this loss, the model learns to gradually prefer the style of human selection.

5. Why do you want to replace RLHF?

Although RLHF is a very effective method in the early stage of preference alignment tasks, it also exposes many training problems in practical applications, especially "unstable training process". This instability is mainly manifested in three aspects:

Reinforcement learning itself is prone to instability : The last step of RLHF uses a reinforcement learning algorithm (such as PPO) to optimize the language model. However, the reward signals of reinforcement learning are often delayed and sparse, making it difficult to accurately correspond to each generated word. Mode collapse is prone to occur during training (for example, the model can only speak templated sentences).
The goals of the reward model and the main model may not be consistent : the reward model is fitted from human preferences and is not equal to the real human intention. The main model may generate content that looks high but is actually empty in order to "please the scoring model". This phenomenon is called "reward hacking".
The training process is complex and hyperparameters are difficult to adjust : Algorithms such as PPO contain multiple hyperparameters. Training is highly dependent on experience and debugging skills. The engineering implementation is complex and resource-intensive.

DPO was proposed precisely to avoid this "detour" path: it does not use reinforcement learning, nor does it train reward models. It has clear training logic and integrated processes, making preference alignment more stable, direct, and efficient .

6. What are the advantages of DPO?

Eliminating the need for reinforcement learning makes training simpler and the process more stable
No reward model is needed , and training one less model can save a lot of computing power
Mathematically equivalent to the optimization goal of RLHF , but with a simpler implementation

Experiments show that in tasks such as conversation, text summarization, and emotion control, the performance of DPO is equal to or even better than that of the traditional RLHF method, especially in terms of training cost and stability.

VII. Frequently Asked Questions

Q1: Since DPO is simpler, is there no need for RLHF based on PPO?

DPO is a new method proposed in recent years. It is theoretically equivalent to RLHF, but it needs further research on some multi-round dialogues or strategic tasks. The reward model of RLHF can also be used to automatically generate preference data or construct more complex behavioral feedback, which is a capability that DPO does not have.

Q2: Where does the preference data come from? Is it manually labeled?

Yes, DPO still needs annotated data on “which one is better”. Currently, most of this data comes from human annotations, and some studies use stronger models to automatically synthesize high-quality preference comparisons.

8. Summary: Let’s summarize

DPO does not teach AI how to score, but directly teaches it "what humans prefer". It simplifies preference learning from a three-stage process (preference data → reward model → reinforcement learning) to direct preference optimization, which is effective, low-cost, and easier to implement.