Back to library

🎯Understand Reward Hacking and Goodhart's Law in RLHF

Spot reward hacking in real model outputs — length bias, sycophancy, refusal escalation, sophistication bias — and pick the right mitigation (KL penalty, reward model ensembling, or process-based reward) for each failure mode.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why RLHF Games Its Own Reward

See why every RLHF pipeline games its own reward

4 drops
  1. The measure became the target — and stopped being a measure

    6 min

    The measure became the target — and stopped being a measure

  2. Your reward model is a 7B opinion — and your policy is paid to flatter it

    7 min

    Your reward model is a 7B opinion — and your policy is paid to flatter it

  3. Reward model accuracy on the test set lies — once PPO starts running

    7 min

    Reward model accuracy on the test set lies — once PPO starts running

  4. Length, sycophancy, refusals, sophistication — the four ways your RM lies

    7 min

    Length, sycophancy, refusals, sophistication — the four ways your RM lies

Phase 2Spotting Reward Hacks in the Wild

Name the four failure modes from real outputs

5 drops
  1. Verbosity is the easiest hack — and the one your team will normalize

    6 min

    Verbosity is the easiest hack — and the one your team will normalize

  2. "You're absolutely right!" — the four words your model learned to print

    7 min

    "You're absolutely right!" — the four words your model learned to print

  3. When safety training overshoots, your model refuses to help you sharpen a knife

    7 min

    When safety training overshoots, your model refuses to help you sharpen a knife

  4. Confidence is rewarded; calibration isn't — that's why your model sounds smart and is sometimes wrong

    8 min

    Confidence is rewarded; calibration isn't — that's why your model sounds smart and is sometimes wrong

  5. One output, all four lenses — your first triage pass

    8 min

    One output, all four lenses — your first triage pass

Phase 3Mitigations and Their Trade-offs

Decide which mitigation fixes which failure mode

4 drops
  1. Gao's law: every nat of KL costs you preference accuracy — and you can predict the cliff

    8 min

    Gao's law: every nat of KL costs you preference accuracy — and you can predict the cliff

  2. Tune β like a leash — too short, no learning; too long, full Goodhart

    7 min

    Tune β like a leash — too short, no learning; too long, full Goodhart

  3. If one RM hallucinates a reward signal, three RMs trip over each other's hallucinations

    8 min

    If one RM hallucinates a reward signal, three RMs trip over each other's hallucinations

  4. Reward the steps, not the answer — and reward hacking has nowhere to hide

    8 min

    Reward the steps, not the answer — and reward hacking has nowhere to hide

Phase 4Sketch a Mitigation Memo

Sketch a mitigation memo for one real failure

1 drop
  1. Write the one-page memo: failure mode, root cause, mitigation, predicted result

    18 min

    Write the one-page memo: failure mode, root cause, mitigation, predicted result

Frequently asked questions

What is reward hacking in RLHF and why does it happen?
This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How does Goodhart's Law apply to language model fine-tuning?
This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why do RLHF'd models become longer and more sycophantic?
This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What is reward model overoptimization and how is it measured?
This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Does a KL penalty actually prevent reward hacking?
This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.