Preference Tuning with DPO · Steven Foerster

Pre-training gives a language model the ability to predict tokens fluently. Supervised fine-tuning (Parts 2 and 4) teaches it to follow instructions. But an instruction-following model can still produce responses that are technically correct yet unhelpful, unnecessarily verbose, confidently wrong, or unsafe. The model was trained to predict likely text, not to produce text that humans actually prefer. Preference tuning is the third stage of the modern LLM training pipeline: align the model’s outputs with human judgments about what constitutes a good response.

This tutorial implements the key components of preference tuning. You will train a reward model from pairwise human preferences, visualize the objectives behind PPO-based RLHF, run DPO (Direct Preference Optimization) to align a model without reinforcement learning, and compare against best-of-N sampling as a simpler alternative.

All code is in the companion repository under 05-preference-tuning/.

cd stanford-transformers-llms-labs
pip install -e ".[hf,training]"

The three-stage pipeline

The standard recipe for building a modern LLM has three stages:

Stage	What it teaches the model	Data	Scale
Pre-training	Language structure, world knowledge	Trillions of tokens (web, books, code)	Weeks on clusters
SFT (Supervised Fine-Tuning)	Instruction following, format	Thousands of (prompt, response) pairs	Hours on GPUs
Preference tuning	Which responses humans prefer	Thousands of preference comparisons	Hours on GPUs

Each stage builds on the previous one. Pre-training is the foundation, without it, the model has no language competence. SFT adapts the model to produce assistant-style responses instead of raw next-token predictions. Preference tuning refines which of many possible correct responses the model should prefer.

The distinction between SFT and preference tuning is subtle but important. SFT says “given this prompt, produce this response.” Preference tuning says “given this prompt and two possible responses, the first one is better.” SFT teaches what to say. Preference tuning teaches how to say it: the style, tone, detail level, and safety characteristics that make a response genuinely helpful.

Preference data formats

The preference_data_explorer.py script visualizes how preference datasets are structured.

python 05-preference-tuning/preference_data_explorer.py   # saves preference_data.png

Three formats exist for collecting preference data:

Pairwise, Show a rater two responses to the same prompt and ask which is better. This is the most common format because binary comparisons are easier for humans than absolute scoring. The data takes the form of (prompt, chosen, rejected) triples.

Pointwise, Show a rater a single response and ask for a score on a scale (e.g., 1–5). Simpler to collect but noisier, calibration across raters is difficult. One person’s 4 is another’s 3.

Listwise, Show a rater multiple responses and ask for a full ranking. More informative per annotation but cognitively demanding and slower to collect.

In practice, pairwise comparisons dominate because they are cheap, fast, and the resulting signal is strong enough to train effective reward models and preference optimizers.

Reward models

A reward model learns a scalar “quality score” from pairwise preferences. Given a prompt and a response, it outputs a single number, higher means better. The reward_model.py script trains one from scratch.

python 05-preference-tuning/reward_model.py

The Bradley-Terry model

The training signal comes from the Bradley-Terry model of pairwise comparison:

P(chosen beats rejected) = sigmoid(r_chosen - r_rejected)

where sigmoid is the logistic function and r is the reward model’s score. The loss is:

L = -log(sigmoid(r_chosen - r_rejected))

This is a binary cross-entropy loss over the score difference. The model learns to assign higher scores to chosen responses and lower scores to rejected ones. The absolute scores do not matter, only the relative ordering.

The script uses distilbert-base-uncased as the backbone and adds a scalar head on top of the [CLS] token representation. It trains on synthetic preference pairs where chosen responses are concise and on-topic while rejected responses are verbose, off-topic, or wrong.

What makes a good reward model

A reward model is only as good as its training data. If the preference pairs consistently favor longer responses, the reward model learns that length equals quality. If the pairs favor a specific writing style, the reward model absorbs that bias. This is not a bug in the algorithm; it is a faithful compression of the preferences it was trained on. The question is whether those preferences actually represent what you want.

Note

Reward models are not objective A reward model trained on one set of annotators may disagree with a model trained on a different set. Cultural background, expertise, and personal preference all influence which response a rater selects. This is why large-scale preference data collection uses hundreds of annotators with carefully designed guidelines.

PPO-based RLHF (conceptual)

The classical approach to preference tuning uses the reward model inside a reinforcement learning loop. The rlhf_objectives.py script visualizes the key objectives without running a full PPO trainer.

python 05-preference-tuning/rlhf_objectives.py   # saves ppo_objectives.png

How PPO-RLHF works

Generate, The policy model (the LLM being trained) generates a response to a prompt.
Score, The reward model scores the response.
Update, PPO updates the policy to increase the probability of high-reward responses while staying close to the reference model (the SFT checkpoint).

The objective combines two terms:

maximize E[R(x, y)] - beta * D_KL(pi || pi_ref)

The first term maximizes reward. The second penalizes the policy for drifting too far from the reference, measured by KL divergence between the policy’s token-level distributions and the reference’s. Without the KL penalty, the model would rapidly overfit to the reward model’s idiosyncrasies.

Why PPO is hard

PPO-based RLHF requires four models simultaneously in memory:

Policy model, the LLM being optimized
Reference model, a frozen copy of the SFT checkpoint (for KL computation)
Reward model, scores generated responses
Value model, estimates expected future reward (for advantage computation)

For a 7B model, this means roughly 4 × 14 GB = 56 GB in FP16, just for model weights, before optimizer states. PPO is also notoriously sensitive to hyperparameters: learning rate, KL coefficient, clipping epsilon, batch size, and the number of PPO epochs per batch all interact in non-obvious ways. Training can be unstable, with reward suddenly collapsing after seemingly good progress.

The script makes PPO clipping, KL penalties, and advantage estimation concrete with visualizations. That keeps this tutorial focused on the Lecture 5 material: classical RLHF, DPO, and best-of-N as the main preference-tuning options.

DPO: Direct Preference Optimization

DPO (Rafailov et al., 2023) is an elegant alternative that sidesteps the entire RL apparatus. The key insight: the optimal policy under the RLHF objective has a closed-form relationship to the reward:

r(x, y) = beta * log(pi(y|x) / pi_ref(y|x)) + const

This means we do not need to train a reward model at all. We can directly optimize the policy using preference pairs:

L_DPO = -log(sigmoid(beta * [
  log(pi(y_w|x) / pi_ref(y_w|x)) -
  log(pi(y_l|x) / pi_ref(y_l|x))
]))

where y_w is the chosen (winner) response and y_l is the rejected (loser). The loss increases the log-probability of the chosen response relative to the reference and decreases the log-probability of the rejected response; all in a single gradient step, no RL loop required.

python 05-preference-tuning/dpo_trainer.py   # ~15 min on GPU

The script uses TRL’s DPOTrainer with LoRA on distilgpt2 and a small synthetic preference dataset:

from trl import DPOConfig, DPOTrainer
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    target_modules=["c_attn", "c_proj"],
)

DPO requires only two models: the policy and the reference (a frozen copy). No reward model, no value model. Memory footprint is roughly half of PPO-RLHF. Training is stable because there is no RL; it is standard supervised learning on preference pairs.

The tradeoff: DPO uses the same preference data throughout training (off-policy). PPO can generate new responses and get fresh reward signals during training (on-policy). In theory, on-policy data is higher quality because it reflects the current policy’s behavior. In practice, DPO’s simplicity and stability often outweigh PPO’s theoretical advantages.

Tip

When to use DPO vs PPO DPO is the default choice for most practitioners. It is simpler to implement, easier to tune, and requires less hardware. PPO is worth considering when you have a very high-quality reward model and need the model to explore beyond the patterns in your static preference dataset, typically at frontier lab scale.

DPO variants worth knowing about

DPO is the canonical recipe but not the only one in the family. When tuning your own pipeline, three variants are worth being aware of:

IPO (Identity Preference Optimization, Azar et al. 2024). Reformulates the loss to drop the implicit Bradley-Terry assumption baked into DPO. In practice IPO regularises better when preference data is heavily imbalanced or contains many tied pairs, where DPO can over-fit and collapse to extreme logit margins. TRL exposes this via loss_type="ipo" on DPOConfig.
KTO (Kahneman-Tversky Optimization, Ethayarajh et al. 2024). Replaces pairwise preferences with per-sample binary “good / bad” labels modelled with a prospect-theory utility. KTO is much easier to collect data for at scale (you don’t need pairs) and tends to be more robust to noisy labels. Use when you have lots of single-sample feedback rather than head-to-head comparisons.
RLAIF (RL from AI Feedback, Bai et al. / Lee et al. 2023). Replaces human preference annotators with another LLM judging response pairs. The mechanics are identical to RLHF/DPO once the preferences are produced; only the data source changes. This is increasingly common in practice — Anthropic’s Constitutional AI is the foundational example, and most frontier post-training pipelines now blend human and AI preferences. It does NOT remove the need for human oversight of what the judge model is trained to value.

These are all available in modern TRL (trl >= 0.11); switch by passing the appropriate loss_type (or use KTOTrainer) and adjusting the dataset format.

Note

TRL / library versions The DPO API in trl has moved more than once. The snippets below assume trl >= 0.11 with the unified DPOConfig / DPOTrainer flow. Older tutorials passing kwargs directly to DPOTrainer.__init__ (rather than via DPOConfig) target trl < 0.8 and will not run on current versions. Pin trl, transformers, and accelerate together; pin to a single PyTorch line as well.

Best-of-N sampling

The best_of_n.py script demonstrates the simplest alignment technique: generate N candidate responses, score each with the reward model, and return the best.

python 05-preference-tuning/best_of_n.py   # uses local distilgpt2

Best-of-N requires no training at all; it uses the reward model only at inference time. The model weights are unchanged. This means it can be applied to any model with any reward model, without fine-tuning.

The cost is compute: you generate N candidates with the policy model and then score all N with the reward model. Runtime therefore grows roughly linearly with N, not counting batching optimizations. In practice, N=4–16 provides meaningful quality improvements. The quality curve has diminishing returns, going from N=1 to N=4 is a bigger jump than N=4 to N=16.

Best-of-N and DPO are complementary approaches. DPO improves the model’s default output (the N=1 case). Best-of-N improves output selection at test time. You can use both: DPO-tune the model, then apply best-of-N on top for the highest quality.

Reward hacking

When a model is optimized against a reward signal, it can discover shortcuts that maximize the score without actually improving quality. The lecture uses a vivid analogy: “If you optimize too much for the volume of clapping, you might end up making jokes instead of informative content.”

Common failure modes:

Verbosity hacking, The reward model weakly correlates length with quality (longer responses cover more ground). The policy learns to be verbose, padding responses with filler. The reward goes up. The actual quality goes down.
Style mimicry, The reward model was trained on preferences from annotators who favored a particular writing style. The policy learns to adopt that style regardless of whether it fits the question.
Sycophancy, The reward model scores agreement and helpfulness highly. The policy learns to agree with the user even when the user is wrong.

Mitigation strategies include KL penalties (limiting how far the policy drifts), length normalization (dividing reward by response length), and carefully diversifying the preference data.

What to notice

Preference tuning changes style more than substance. The model already knows facts and can follow instructions. Preference tuning teaches it how to respond: the right level of detail, appropriate hedging on uncertain claims, and when to refuse.
DPO is remarkably simple. The entire training loop is standard cross-entropy-style optimization on pairs of responses. No RL, no value function, no policy gradients. The mathematical insight is that the reward function is implicit in the policy.
Reward models encode annotator preferences, not ground truth. The quality of preference tuning is bounded by the quality of the preference data. Garbage in, garbage out applies strongly here.
Best-of-N is a strong baseline. Before investing in DPO or PPO training, try best-of-N with a reward model. It often captures a surprising fraction of the alignment gains with zero training.
Over-optimization is a real risk. There is an optimal amount of preference tuning. Beyond that, the model starts exploiting the reward signal rather than genuinely improving. The KL penalty exists precisely to prevent this.

In Part 6: Chain-of-Thought and Reasoning Evaluation, explore techniques that enhance LLM reasoning. Compare direct and chain-of-thought prompting on math problems, implement self-consistency via majority voting, and evaluate code generation with the Pass@K metric.

The three-stage pipeline

Preference data formats

Reward models

The Bradley-Terry model

What makes a good reward model

PPO-based RLHF (conceptual)

How PPO-RLHF works

Why PPO is hard

DPO: Direct Preference Optimization

DPO variants worth knowing about

Best-of-N sampling

Reward hacking

What to notice

Next