🔬Understand bf16, fp16, and Loss Scaling
Stop flipping the precision flag and praying. You'll read a float as sign-exponent-mantissa, see exactly why fp16 NaNs and bf16 doesn't, and prescribe the right fix — loss scaling, bf16, or a mixed policy — for any training run.
Phase 1Inside a Float: Sign, Exponent, Mantissa
Read a float's bits and see range versus precision
A float is three knobs, not one number
7 minA float is three knobs, not one number
Exponent buys reach. Mantissa buys resolution.
7 minExponent buys reach. Mantissa buys resolution.
fp16 has a basement at 6e-5
6 minfp16 has a basement at 6e-5
fp16 has a ceiling at 65,504
6 minfp16 has a ceiling at 65,504
Phase 2Watching Gradients Survive or Disappear
Walk gradients through fp16, bf16, and loss scaling
A gradient that lives in fp32 dies in fp16
7 minA gradient that lives in fp32 dies in fp16
Loss scaling is a range-shift, not a magic constant
7 minLoss scaling is a range-shift, not a magic constant
Mixed precision keeps a fp32 master copy of every weight
7 minMixed precision keeps a fp32 master copy of every weight
bf16 needs no loss scaler at all
7 minbf16 needs no loss scaler at all
Pick which gradient lives in which precision
7 minPick which gradient lives in which precision
Phase 3Why bf16 Took Over After the A100
Trace the post-A100 bf16 shift across real hardware
Your team trained fp16 last year because their GPU couldn't do bf16
7 minYour team trained fp16 last year because their GPU couldn't do bf16
Your inference service runs fp16 — should it switch?
7 minYour inference service runs fp16 — should it switch?
A LayerNorm in bf16 makes your loss curve weird
7 minA LayerNorm in bf16 makes your loss curve weird
Google picked bf16 in 2017. Why did the industry wait?
7 minGoogle picked bf16 in 2017. Why did the industry wait?
Phase 4Diagnosing a NaN'ing Run
Diagnose a NaN'ing run and prescribe the fix
Prescribe a fix for a NaN'ing fp16 training run
8 minPrescribe a fix for a NaN'ing fp16 training run
Frequently asked questions
- What's the actual difference between bf16 and fp16?
- This is covered in the “Understand bf16, fp16, and Loss Scaling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does fp16 training NaN but bf16 doesn't?
- This is covered in the “Understand bf16, fp16, and Loss Scaling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Do I still need loss scaling with bf16?
- This is covered in the “Understand bf16, fp16, and Loss Scaling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Is bf16 always better than fp16 for training?
- This is covered in the “Understand bf16, fp16, and Loss Scaling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What's mixed precision and why use fp32 master weights?
- This is covered in the “Understand bf16, fp16, and Loss Scaling” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.