Mentor Handbook · 02 The Machine

Latency hiding: the short-order cook juggling tickets

By the end of this chapter you'll be able to stand at a whiteboard and teach the single most important idea in how a GPU actually works: that memory is painfully slow, and the GPU wins anyway — not by making memory faster, but by never sitting still while it waits. You'll teach latency hiding, and the number that measures it, occupancy. No electronics needed. One kitchen metaphor, one honest number, and a picture your students will never forget.

This chapter builds directly on the cafeteria idea from the CPU-vs-GPU chapter. There we said the real bottleneck isn't doing the math — it's feeding the cooks. Now we answer the obvious next question: if feeding is so slow, how does the GPU stay fast at all? The answer is the whole soul of the machine.

The one-sentence answer

Reaching out to the GPU's main memory to fetch a number takes roughly 400 cycles. That is an eternity — in that time the chip could have done hundreds of multiply-adds. A CPU would panic and spend huge effort trying to make that wait shorter. The GPU does the opposite. It accepts the 400-cycle wait, and simply makes sure it always has other work ready to do while it waits. It never reduces the wait. It hides it.

🧠 Metaphor

The short-order cook at a busy diner. One cook, one grill, dozens of tickets clipped up on the rail. Ticket #1 says "eggs" — the cook cracks them and they need four minutes to cook. Does the cook stand and stare at the eggs for four minutes? Never. She starts ticket #2's toast, flips ticket #3's bacon, plates ticket #4. By the time she circles back, the eggs are done. She's one worker, but because she keeps dozens of tickets in flight, the grill is never idle. The four-minute egg-wait is completely hidden behind other cooking. That is exactly what a GPU warp scheduler does — the eggs are a slow memory load, and the other tickets are other warps ready to run.

🎓 Teaching note

Draw the diner first, before any GPU words. Act it out: mime cracking eggs, then throw your hands up in mock boredom "do I just... watch them?" — the class laughs — then rapidly mime flipping other things. The physical comedy plants the idea. Only after the picture lands do you translate: the cook is the scheduler, a ticket is a warp, the slow eggs are a memory load. Never lead with "warp scheduler." Earn the word.

What a warp is, and what the scheduler does

To teach this you need exactly two new nouns, introduced gently.

A warp is a group of 32 threads that move together in perfect lockstep — 32 workers all doing the same instruction at the same instant. Think of a warp as one ticket on the rail: it's the unit the cook picks up and works on. It's always exactly 32, on every NVIDIA GPU ever made. That number is baked into the hardware.

The warp scheduler is the cook. Each cycle, it looks at all the warps sitting on its rail, finds one whose next instruction is ready to go (all its ingredients have arrived), and runs that one. If the warp it just ran is now stuck waiting on a slow memory load — fine. It just picks a different ready warp next cycle. There's no cost to switching. Every warp's state is already sitting in the hardware, so "switching tickets" is free — the cook doesn't clean the grill between tickets, she just looks up and grabs the next ready one.

🎤 Say this at the board

"Here's the magic word, and it's free: on a GPU, switching from one warp to another costs nothing. No saving your place, no cleanup. All the warps are already loaded, all their numbers already sitting in the registers. So the scheduler just glances at the rail every single cycle and asks one question — 'who's ready?' — and runs that one. That's it. That's the whole trick that makes a GPU fast."

🔢 By hand

The tiniest concrete version. One scheduler. Warp 0 issues a memory load at cycle 0 — it'll be stuck for ~400 cycles. But warps 1, 2, and 3 each have arithmetic ready. So: cycle 1, run warp 1's multiply-add. Cycle 2, run warp 2's. Cycle 3, warp 3's. Cycle 4, back to warp 1. The scheduler issues something useful every single cycle, and warp 0's giant 400-cycle wait is completely covered. Nobody ever sees the stall. Four warps hid it.

The whole idea in one frame: warp 0's 400-cycle memory wait is fully covered by useful work from warps 1–3. That is latency hiding.

The napkin math: how many tickets do you need?

Here's where it stops being a vibe and becomes arithmetic your students can do. Ask the question straight: if a memory load takes ~400 cycles, and the scheduler can only run one warp per cycle, how many warps do I need on the rail so the scheduler always has someone ready?

The rough answer comes from a very old, very simple idea called Little's Law: the number of jobs you need in flight equals the wait time divided by how often each job needs the counter. If the wait is ~400 cycles and each warp comes back for a turn every ~30 cycles, you need about 400 ÷ 30 ≈ 13 warps in flight to keep the cook busy all the way through the wait.

Little's Law on a napkin: a 400-cycle wait needs only about a dozen warps to hide, not 400 — because each warp's own work overlaps.

✨ The click

Say the number and let it land: "You do NOT need 400 warps to hide a 400-cycle stall. You need about a dozen. Because each warp does a little independent work of its own before it needs memory again, and all those little bits of work overlap. A handful of warps, each carrying its own tickets, covers the whole wait." This is the number that makes occupancy feel achievable instead of hopeless.

⚠️ Where students trip

The trap students fall in: "so a 400-cycle stall needs 400 warps." No! That would only be true if each warp did exactly one instruction and then immediately blocked again. Real warps do several independent instructions between memory loads, and those overlap across warps. The magic word is ratio — you need (stall length) ÷ (work each warp does between stalls) warps, not (stall length) warps. Draw a dozen tickets on the rail, not four hundred.

Occupancy: how full is the rail?

Now name the thing. Occupancy is simply: how many warps did I manage to keep resident on the SM, compared to the maximum it can hold? It's a fraction.

On an H100, one SM (Streaming Multiprocessor — one processing unit on the chip, and it has four schedulers) can hold up to 64 warps at once. That's the maximum-length ticket rail. If your kernel manages to keep 32 warps resident, that's 32 ÷ 64 = 50% occupancy. Keep all 64 and you're at 100%.

More warps resident means more tickets on the rail means more chances the scheduler always finds a ready one — which means better latency hiding. That's the entire reason occupancy matters. It is a measure of how well you can hide the memory wait.

🧠 Metaphor

Occupancy is just how crowded the ticket rail is. A rail with two tickets: the moment both are waiting on eggs, the cook stands idle — the diner slows down. A rail packed with twelve tickets: there's always something to flip, the grill never cools. Occupancy is the length of your ticket rail as a percentage of the longest rail the diner allows.

Two tickets starves the cook; a dozen keeps the grill full. That fullness, as a fraction of the max, is occupancy.

Why you don't just get 64 warps: the three limits

Here's the honest catch, and it's a great teaching beat because it turns occupancy from a mystery into a simple calculation. You almost never reach 64 warps, because three finite resources on the SM get shared out among your warps, and whichever runs out first sets your limit. It's a min(), not an average — the scarcest resource wins.

The three resources:

Registers — the tiny ultra-fast scratch slots each thread uses to hold its live numbers. The SM has one big pool of 65536 of them. If each thread demands a lot of registers, fewer threads fit, so fewer warps go resident. Greedy threads → short rail.
Shared memory — a small fast on-chip scratchpad (up to ~228 KiB per SM) that a block of threads shares. If each block grabs a big chunk, only a few blocks fit.
Thread/block slots — hard ceilings: at most 1024 threads per block, and 2048 threads (64 warps) total per SM.

Whichever of these you exhaust first caps your occupancy. That's the one mental model to drill: occupancy = min(register limit, shared-memory limit, thread limit).

🔢 By hand

Work one on the board — it takes thirty seconds and demystifies everything. Take the shared-memory GEMM kernel: 1024 threads per block, and the compiler reports it uses 37 registers per thread. Registers needed for one block: 37 × 1024 = 37,888. The SM has 65,536. Two blocks would need 75,776 — too many! So only one block fits. One block is 32 warps. 32 ÷ 64 = 50% occupancy, capped by registers. The shared memory could've fit two dozen blocks; the thread ceiling allowed two; but registers permitted only one. The scarcest resource won.

Three gauges, one verdict. Whichever resource is scarcest for your block caps how many warps go resident — here, registers.

The plot twist: more occupancy isn't always better

This is the sophisticated beat that separates a mentor who read about occupancy from one who gets it. State the trap first: "Once you know occupancy hides latency, your instinct screams maximize it — cram in every warp! That instinct is wrong."

Why? Because hiding latency has a sufficiency point, not a bottomless appetite. Remember the napkin math: you needed about a dozen warps to fully cover a 400-cycle stall. Once you have that dozen, the scheduler is already never idle. Adding a thirteenth, a fortieth — buys you nothing. The wait is already hidden. There's no one left waiting to be covered.

And extra warps aren't free. The occupancy you buy by shrinking registers comes out of the registers — and registers are precious. The fastest kernels do the opposite of cramming warps: they give each thread a big pile of registers to hold a tile of results, so each thread has many independent multiply-adds in flight at once. That's a second, sneakier way to hide latency — within a single warp, by having lots of independent work queued up in it. It's like one ticket that itself has eight things cooking at once.

✨ The click

The line that reframes the whole thing: "The best GEMM and FlashAttention kernels — the ones actually running in cuBLAS and serving models to millions of people right now — often run at single-digit occupancy on purpose. They keep the rail nearly empty. They hand each warp a huge job and let it hide latency all by itself. Occupancy was never the goal. Hiding the wait was the goal, and they found a better way to hide it." Watch students' models of the world reorganize.

Occupancy across warps and independent work within a warp are two routes to the same goal. Fast kernels deliberately pick the second.

The production link

Frame the stakes so it's not a toy. Everything on the GEMM optimization ladder your students will climb is, from this angle, one long campaign to keep the warp scheduler busy during those 400-cycle memory windows. The naive kernel sits at a humiliating 1.3% of cuBLAS for exactly this reason: it fires a fresh slow memory load for nearly every number, with almost no independent work between loads, so the scheduler runs out of ready warps and the issue slot goes empty cycle after cycle. The profiler would show it pinned on a stall reason literally named "Long Scoreboard" — waiting on memory.

🏭 In production today

Concrete and current. When DeepSeek or Meta serves a model to millions of users on racks of H100 and B200 GPUs, whether the scheduler's issue slot is full 40% of the time or 90% of the time is, directly, whether you need one cluster or two — half the electricity, half the machines. FlashAttention became the kernel the entire industry adopted within months precisely because it keeps the schedulers fed far better than what came before. And it does it not by maxing occupancy but by the fat-ticket trick: big register tiles, few warps, deep independent work. This exact idea — hide the wait, keep the cook busy — is where hardware money is won or lost every single day.

▶️ Live demo

The one live demo. Run a memory-bound kernel under Nsight Compute (ncu) and open "Warp State Statistics." Point at the biggest bar — "Stall Long Scoreboard, 62%." Say: "The profiler is telling us: 62% of the time, every warp on the rail was stuck waiting on memory, and we didn't have enough other tickets to cover it. That one bar is our entire to-do list." Then show the same kernel after adding shared-memory reuse — the bar shrinks, the speed jumps. Nothing makes latency hiding real like watching that bar move.

That's the chapter. One cook, a rail full of tickets, a 400-cycle egg-wait hidden behind other cooking — and the twist that a nearly-empty rail can be the fastest of all. If a student leaves able to say why the GPU hides latency instead of reducing it, and why occupancy is a means and not the goal, you've given them the beating heart of how a GPU works.

You can now teach

Latency hiding as a short-order cook juggling tickets — the GPU accepts the ~400-cycle memory wait and covers it with other ready work instead of trying to make it shorter.
The warp (32 threads, one ticket) and the warp scheduler (the cook picking a ready warp every cycle for free), with the tiny four-warp by-hand example that hides a 400-cycle stall.
The napkin math (Little's Law): you need only about a dozen warps to hide a 400-cycle stall, not 400 — because independent work overlaps.
Occupancy as "how full is the ticket rail" — resident warps ÷ max warps — and the min() of three limits (registers, shared memory, thread slots) that caps it, worked by hand to 50%.
The plot twist: hiding latency has a sufficiency point, extra warps cost registers, and the fastest real kernels (cuBLAS, FlashAttention) run at single-digit occupancy on purpose.
The production hook: the whole GEMM ladder is a campaign to keep the scheduler fed, and "Long Scoreboard" in Nsight is the to-do list that decides half the electricity bill.

‹ previousTensor cores: the matmul machine inside the machine next ›Coalescing: everybody boards the same bus