🧪Understand Benchmark Saturation and Contamination
MMLU plateaued. HumanEval is in the training set. You'll separate saturation from contamination, run n-gram and perplexity checks on real test items, and design a holdout that's structurally hard to leak — defensible enough to put in front of a buyer.
Phase 1What saturation and contamination actually mean
What saturation and contamination actually mean
Saturation and contamination are not the same problem
6 minSaturation and contamination are not the same problem
Saturation is a shape, not a score
7 minSaturation is a shape, not a score
Contamination is a continuum, not a binary
7 minContamination is a continuum, not a binary
The last 5 points of progress are probably not real
7 minThe last 5 points of progress are probably not real
Phase 2Detect contamination with n-grams, perplexity, canaries
Detect contamination with n-grams, perplexity, canaries
N-gram overlap is the cheapest contamination probe
7 minN-gram overlap is the cheapest contamination probe
Perplexity gaps reveal items the model has seen before
8 minPerplexity gaps reveal items the model has seen before
Canary strings are a contamination smoke alarm you install once
6 minCanary strings are a contamination smoke alarm you install once
No single probe is enough — triangulate three
7 minNo single probe is enough — triangulate three
Run the audit before you cite the score
7 minRun the audit before you cite the score
Phase 3Goodhart, Arena pressure, and dynamic evals
Goodhart, Arena pressure, and dynamic evals
A measure becomes a target — and stops measuring
8 minA measure becomes a target — and stops measuring
The leaderboard shapes the training mix
7 minThe leaderboard shapes the training mix
Moving targets defeat the optimization loop
8 minMoving targets defeat the optimization loop
No single benchmark survives — portfolio them
7 minNo single benchmark survives — portfolio them
Phase 4Audit your eval set and design a leak-resistant holdout
Audit your eval set and design a leak-resistant holdout
Audit and rebuild a real eval — end to end
25 minAudit and rebuild a real eval — end to end
Frequently asked questions
- What's the difference between benchmark saturation and benchmark contamination?
- This is covered in the “Understand Benchmark Saturation and Contamination” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why is MMLU no longer a useful signal of model progress?
- This is covered in the “Understand Benchmark Saturation and Contamination” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do I detect if my eval set leaked into a model's pretraining data?
- This is covered in the “Understand Benchmark Saturation and Contamination” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What is a canary string and how do I use one in evals?
- This is covered in the “Understand Benchmark Saturation and Contamination” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why do LiveBench and dynamic evals exist if static benchmarks are easier?
- This is covered in the “Understand Benchmark Saturation and Contamination” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.