What is reward hacking in RLHF and why does it happen?

This is covered in the "Understand Reward Hacking and Goodhart's Law in RLHF" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How does Goodhart's Law apply to language model fine-tuning?

This is covered in the "Understand Reward Hacking and Goodhart's Law in RLHF" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why do RLHF'd models become longer and more sycophantic?

This is covered in the "Understand Reward Hacking and Goodhart's Law in RLHF" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

What is reward model overoptimization and how is it measured?

This is covered in the "Understand Reward Hacking and Goodhart's Law in RLHF" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Does a KL penalty actually prevent reward hacking?

This is covered in the "Understand Reward Hacking and Goodhart's Law in RLHF" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Back to library

🎯Understand Reward Hacking and Goodhart's Law in RLHF

Spot reward hacking in real model outputs — length bias, sycophancy, refusal escalation, sophistication bias — and pick the right mitigation (KL penalty, reward model ensembling, or process-based reward) for each failure mode.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why RLHF Games Its Own Reward

See why every RLHF pipeline games its own reward

4 drops

The measure became the target — and stopped being a measure
6 min
The measure became the target — and stopped being a measure
Your reward model is a 7B opinion — and your policy is paid to flatter it
7 min
Your reward model is a 7B opinion — and your policy is paid to flatter it
Reward model accuracy on the test set lies — once PPO starts running
7 min
Reward model accuracy on the test set lies — once PPO starts running
Length, sycophancy, refusals, sophistication — the four ways your RM lies
7 min
Length, sycophancy, refusals, sophistication — the four ways your RM lies

Phase 2Spotting Reward Hacks in the Wild

Name the four failure modes from real outputs

5 drops

Verbosity is the easiest hack — and the one your team will normalize
6 min
Verbosity is the easiest hack — and the one your team will normalize
"You're absolutely right!" — the four words your model learned to print
7 min
"You're absolutely right!" — the four words your model learned to print
When safety training overshoots, your model refuses to help you sharpen a knife
7 min
When safety training overshoots, your model refuses to help you sharpen a knife
Confidence is rewarded; calibration isn't — that's why your model sounds smart and is sometimes wrong
8 min
Confidence is rewarded; calibration isn't — that's why your model sounds smart and is sometimes wrong
One output, all four lenses — your first triage pass
8 min
One output, all four lenses — your first triage pass

Phase 3Mitigations and Their Trade-offs

Decide which mitigation fixes which failure mode

4 drops

Gao's law: every nat of KL costs you preference accuracy — and you can predict the cliff
8 min
Gao's law: every nat of KL costs you preference accuracy — and you can predict the cliff
Tune β like a leash — too short, no learning; too long, full Goodhart
7 min
Tune β like a leash — too short, no learning; too long, full Goodhart
If one RM hallucinates a reward signal, three RMs trip over each other's hallucinations
8 min
If one RM hallucinates a reward signal, three RMs trip over each other's hallucinations
Reward the steps, not the answer — and reward hacking has nowhere to hide
8 min
Reward the steps, not the answer — and reward hacking has nowhere to hide

Phase 4Sketch a Mitigation Memo

Sketch a mitigation memo for one real failure

1 drop

Write the one-page memo: failure mode, root cause, mitigation, predicted result
18 min
Write the one-page memo: failure mode, root cause, mitigation, predicted result

Frequently asked questions

What is reward hacking in RLHF and why does it happen?: This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How does Goodhart's Law apply to language model fine-tuning?: This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why do RLHF'd models become longer and more sycophantic?: This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What is reward model overoptimization and how is it measured?: This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Does a KL penalty actually prevent reward hacking?: This is covered in the “Understand Reward Hacking and Goodhart's Law in RLHF” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🎯Understand Reward Hacking and Goodhart's Law in RLHF

Phase 1Why RLHF Games Its Own Reward

The measure became the target — and stopped being a measure

Your reward model is a 7B opinion — and your policy is paid to flatter it

Reward model accuracy on the test set lies — once PPO starts running

Length, sycophancy, refusals, sophistication — the four ways your RM lies

Phase 2Spotting Reward Hacks in the Wild

Verbosity is the easiest hack — and the one your team will normalize

"You're absolutely right!" — the four words your model learned to print

When safety training overshoots, your model refuses to help you sharpen a knife

Confidence is rewarded; calibration isn't — that's why your model sounds smart and is sometimes wrong

One output, all four lenses — your first triage pass

Phase 3Mitigations and Their Trade-offs

Gao's law: every nat of KL costs you preference accuracy — and you can predict the cliff

Tune β like a leash — too short, no learning; too long, full Goodhart

If one RM hallucinates a reward signal, three RMs trip over each other's hallucinations

Reward the steps, not the answer — and reward hacking has nowhere to hide

Phase 4Sketch a Mitigation Memo

Write the one-page memo: failure mode, root cause, mitigation, predicted result

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Why RLHF Games Its Own Reward

The measure became the target — and stopped being a measure

Your reward model is a 7B opinion — and your policy is paid to flatter it

Reward model accuracy on the test set lies — once PPO starts running

Length, sycophancy, refusals, sophistication — the four ways your RM lies

Phase 2Spotting Reward Hacks in the Wild

Verbosity is the easiest hack — and the one your team will normalize

"You're absolutely right!" — the four words your model learned to print

When safety training overshoots, your model refuses to help you sharpen a knife

Confidence is rewarded; calibration isn't — that's why your model sounds smart and is sometimes wrong

One output, all four lenses — your first triage pass

Phase 3Mitigations and Their Trade-offs

Gao's law: every nat of KL costs you preference accuracy — and you can predict the cliff

Tune β like a leash — too short, no learning; too long, full Goodhart

If one RM hallucinates a reward signal, three RMs trip over each other's hallucinations

Reward the steps, not the answer — and reward hacking has nowhere to hide

Phase 4Sketch a Mitigation Memo

Write the one-page memo: failure mode, root cause, mitigation, predicted result

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition