Back to library

🧮Understand Tensor Cores and Mixed Precision

Stop hand-waving about '100x faster than CUDA cores.' You'll trace one 4x4 tile through a tensor core's registers, multipliers, and FP32 accumulator, then estimate the real FLOPS uplift from switching one layer of your favorite model to mixed precision.

Applied14 drops~2-week path Ā· 5–8 min/daytechnology

Phase 1Why Chips Built a Core Just for Matmul

See why a neural net is a stack of small matmuls

4 drops
  1. A neural net is a stack of small matmuls

    6 min

    A neural net is a stack of small matmuls

  2. CUDA cores do one FMA. Tensor cores do sixty-four.

    6 min

    CUDA cores do one FMA. Tensor cores do sixty-four.

  3. FP16 inputs, FP32 accumulator — the whole magic

    7 min

    FP16 inputs, FP32 accumulator — the whole magic

  4. Mixed precision isn't 'half the bytes' — it's 'eight times the throughput'

    7 min

    Mixed precision isn't 'half the bytes' — it's 'eight times the throughput'

Phase 2Walking a 4x4 Tile Through the Silicon

Trace a 4x4 FMA tile through the silicon by hand

5 drops
  1. The operand is a 4x4 tile, not a vector

    7 min

    The operand is a 4x4 tile, not a vector

  2. Step 1: A and B land in registers

    7 min

    Step 1: A and B land in registers

  3. Step 2: Sixteen FP16 multiplies in one cycle

    6 min

    Step 2: Sixteen FP16 multiplies in one cycle

  4. Step 3: The FP32 accumulator catches every product

    7 min

    Step 3: The FP32 accumulator catches every product

  5. Step 4: D goes back to registers — or back through another tile

    7 min

    Step 4: D goes back to registers — or back through another tile

Phase 3Volta to Blackwell, Generation by Generation

Walk Volta to Blackwell through real workloads

4 drops
  1. The training NaN'd. Ampere shipped BF16.

    7 min

    The training NaN'd. Ampere shipped BF16.

  2. Your transformer wants FP8. H100 says yes.

    7 min

    Your transformer wants FP8. H100 says yes.

  3. Half your weights are zero. The tile knows.

    7 min

    Half your weights are zero. The tile knows.

  4. FP4 sounds insane. Blackwell did it anyway.

    7 min

    FP4 sounds insane. Blackwell did it anyway.

Phase 4Estimating Your Layer's FLOPS Uplift

Estimate one layer's mixed-precision FLOPS uplift

1 drop
  1. Estimate the FLOPS uplift on one real layer

    18 min

    Estimate the FLOPS uplift on one real layer

Frequently asked questions

What is a tensor core and how is it different from a CUDA core?
This is covered in the ā€œUnderstand Tensor Cores and Mixed Precisionā€ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why do tensor cores use FP16 inputs but accumulate in FP32?
This is covered in the ā€œUnderstand Tensor Cores and Mixed Precisionā€ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What does a 4x4 fused multiply-add actually do inside the chip?
This is covered in the ā€œUnderstand Tensor Cores and Mixed Precisionā€ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How much faster is mixed precision than FP32 on a real model layer?
This is covered in the ā€œUnderstand Tensor Cores and Mixed Precisionā€ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What changed between Volta, Ampere, Hopper, and Blackwell tensor cores?
This is covered in the ā€œUnderstand Tensor Cores and Mixed Precisionā€ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.