Question 1

What is a tensor core and how is it different from a CUDA core?

Accepted Answer

This is covered in the "Understand Tensor Cores and Mixed Precision" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 2

Why do tensor cores use FP16 inputs but accumulate in FP32?

Accepted Answer

This is covered in the "Understand Tensor Cores and Mixed Precision" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 3

What does a 4x4 fused multiply-add actually do inside the chip?

Accepted Answer

This is covered in the "Understand Tensor Cores and Mixed Precision" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 4

How much faster is mixed precision than FP32 on a real model layer?

Accepted Answer

This is covered in the "Understand Tensor Cores and Mixed Precision" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Question 5

What changed between Volta, Ampere, Hopper, and Blackwell tensor cores?

Accepted Answer

This is covered in the "Understand Tensor Cores and Mixed Precision" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🧮Understand Tensor Cores and Mixed Precision

Phase 1Why Chips Built a Core Just for Matmul

A neural net is a stack of small matmuls

CUDA cores do one FMA. Tensor cores do sixty-four.

FP16 inputs, FP32 accumulator — the whole magic

Mixed precision isn't 'half the bytes' — it's 'eight times the throughput'

Phase 2Walking a 4x4 Tile Through the Silicon

The operand is a 4x4 tile, not a vector

Step 1: A and B land in registers

Step 2: Sixteen FP16 multiplies in one cycle

Step 3: The FP32 accumulator catches every product

Step 4: D goes back to registers — or back through another tile

Phase 3Volta to Blackwell, Generation by Generation

The training NaN'd. Ampere shipped BF16.

Your transformer wants FP8. H100 says yes.

Half your weights are zero. The tile knows.

FP4 sounds insane. Blackwell did it anyway.

Phase 4Estimating Your Layer's FLOPS Uplift

Estimate the FLOPS uplift on one real layer

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Why Chips Built a Core Just for Matmul

A neural net is a stack of small matmuls

CUDA cores do one FMA. Tensor cores do sixty-four.

FP16 inputs, FP32 accumulator — the whole magic

Mixed precision isn't 'half the bytes' — it's 'eight times the throughput'

Phase 2Walking a 4x4 Tile Through the Silicon

The operand is a 4x4 tile, not a vector

Step 1: A and B land in registers

Step 2: Sixteen FP16 multiplies in one cycle

Step 3: The FP32 accumulator catches every product

Step 4: D goes back to registers — or back through another tile

Phase 3Volta to Blackwell, Generation by Generation

The training NaN'd. Ampere shipped BF16.

Your transformer wants FP8. H100 says yes.

Half your weights are zero. The tile knows.

FP4 sounds insane. Blackwell did it anyway.

Phase 4Estimating Your Layer's FLOPS Uplift

Estimate the FLOPS uplift on one real layer

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition