Vizuara Kernel Engineering
Mentor Handbook · 02 The Machine

Coalescing: everybody boards the same bus

By the end of this chapter you'll be able to stand at a whiteboard and teach memory coalescing — why 32 threads reading neighbouring addresses fetch their data in a single trip, why 32 threads reading scattered addresses waste almost the whole trip, and why fixing this is the cheapest big speedup in the entire workshop. You need no electronics. You need one bus, one metaphor, and one honest number.

This is the first real optimization students will meet after the naive kernel. And it's the sweetest one to teach, because it changes nothing about the math. Same multiplies. Same adds. Same bytes we care about. We just read them in the right order — and the kernel runs several times faster. That contrast is the whole lesson: the pattern of access, not the amount, decides your speed.

The one idea: memory comes in busloads, not by the person

Start with the fact that reorganizes everything. A GPU does not fetch one number at a time from memory. When you ask for a float, the memory system doesn't hand you back four lonely bytes. It hands back a whole fixed-size chunk — a 128-byte line, which happens to be exactly 32 consecutive floats — whether you wanted all of them or just one.

So memory doesn't run like a taxi that carries one passenger to one door. It runs like a bus: it always makes the trip with a full 32-seat vehicle, and the only question that matters for speed is how many buses you have to dispatch.

🧠 Metaphor
The neighbourhood bus. Picture a warp — a group of 32 threads that move in lockstep — as 32 kids who all need a ride to school at the same moment. The city runs one kind of vehicle: a 32-seat bus. If all 32 kids live on the same street, one bus swings down that street, everyone climbs aboard, and it's a single trip. If the 32 kids are scattered one-per-street across 32 different streets, the city has to send 32 separate buses, each driving all the way out to pick up a single child and coming back 31 seats empty. Same 32 kids. Same destination. Thirty-two times the fuel. Coalescing is just: make the kids live on the same street.
The core metaphor: a warp is a busload. Neighbours ride together in one trip; scattered riders each need their own near-figure rendering · The core metaphor: a warp is a busload. Neighbours ride together in on
The core metaphor: a warp is a busload. Neighbours ride together in one trip; scattered riders each need their own near-empty bus.

That's the entire chapter in one picture. Everything from here is making "same street" precise and showing students where a real kernel accidentally scatters the kids.

The tiny by-hand number: 1 bus vs 32 buses

Put concrete numbers on the board so it stops being a vibe and becomes arithmetic.

A warp is 32 threads. A memory line is 128 bytes. A float is 4 bytes. So one line = 128 / 4 = 32 floats — one line holds exactly one warp's worth of floats. That coincidence is the whole reason this works out so cleanly, and it's worth pausing on.

🔢 By hand
Do both cases on the board with the same 32 threads. Case A — contiguous. Thread t reads the float at position t: thread 0 reads float 0, thread 1 reads float 1, … thread 31 reads float 31. All 32 addresses fall inside one 128-byte line. One trip. 128 bytes fetched, 128 bytes used. 100% useful. Case B — strided. Thread t reads the float at position t × 1000: thread 0 reads float 0, thread 1 reads float 1000, thread 2 reads float 2000… Every address lands in a different line. 32 trips. You fetch 32 × 128 = 4096 bytes and use 32 × 4 = 128 of them. That's about 3% useful; 97% thrown away.

Say that last number out loud and let it land. Same 32 floats wanted in both cases. Case A costs one trip; Case B costs thirty-two. The bytes you care about are identical — the bytes you're charged for differ by 32×.

The click
Here's the sentence that flips the room: "Coalescing doesn't read fewer bytes you want. It stops you paying for bytes you don't." Students assume a fast kernel must be doing less work. This one does the exact same math and requests the exact same useful data — it just stops dragging home 31 empty seats on every bus. The speedup is pure waste-removal, which is why it's free.

The real mechanism, built up gently

Now name the machinery, one term at a time, each defined as it arrives.

A warp is a group of 32 threads that execute the same instruction at the same time. When a warp hits a load instruction, the memory system does not see 32 independent requests. It gathers the 32 addresses those threads want and asks: how many fixed-size lines do these addresses touch?

The line is 128 bytes, aligned to a 128-byte boundary — 32 consecutive floats.1 On Hopper (H100) the 128-byte line is split into four 32-byte sectors, and the hardware can fetch just the sectors it needs. So a partial access wastes down to 32-byte granularity, not the full 128. For teaching, "128-byte busload" is the right first picture; mention sectors only if a sharp student asks why the waste isn't always exactly 32×. If the warp's 32 addresses all fall inside one line, the hardware runs one memory transaction — one bus. If they scatter across 32 lines, it runs up to 32 transactions — 32 buses.2 "Up to" because if two threads happen to land in the same 128-byte line, they share a bus. Worst case — a large stride — gives every thread its own line and hits the full 32× penalty squarely.

That's it. That's the whole rule: for one load by one warp, coalescing asks how many lines the 32 addresses touch. One line is the dream. Thirty-two is the disaster. Nothing about the amount of useful data changed — only the pattern.

Say a word about the road the data travels, because it explains why the line is the unit. A load walks a fixed path — from HBM (the far, huge memory) up through the L2 cache, into the fast on-chip memory near the cores, and finally into registers. Every rung of that path is priced in whole lines, never in single floats. So you pay per line touched, not per byte used. That single sentence is why a strided load is so ruinous: it touches many lines to use one float from each.

The technical translation of the bus picture: whether one load costs one transaction or thirty-two depends entirely on hfigure rendering · The technical translation of the bus picture: whether one load costs o
The technical translation of the bus picture: whether one load costs one transaction or thirty-two depends entirely on how thread indices map to addresses.
The path a load travels: HBM → L2 → L1/shared → registers, priced in fixed-size lines at every rung — which is exactly wfigure rendering · The path a load travels: HBM → L2 → L1/shared → registers, priced in f
The path a load travels: HBM → L2 → L1/shared → registers, priced in fixed-size lines at every rung — which is exactly why the pattern beats the volume.

Why does anyone ever scatter? Because of layout

The cruel part is that scattering happens by accident, in code that looks perfectly reasonable and computes the correct answer. To see why, students need one fact about how matrices sit in memory.

C and CUDA store a 2-D array in row-major order: the element A[i][j] lives at linear position i × N + j. Read that carefully with students. Elements along a row (increasing j) are right next to each other — positions …, i·N+j, i·N+j+1, …, one float apart. Elements down a column (increasing i) are N floats apart — a giant jump.

🧠 Metaphor
A matrix in memory is a book with no chapter breaks — just one long ribbon of numbers, row after row after row, taped end to end. Reading along a row is reading the ribbon left to right: the next number is right under your finger. Reading down a column means reading one word, then leaping a whole row's width to the next, then leaping again — same book, but you're pole-vaulting across the page instead of sliding along it. The ribbon (memory) rewards sliding and punishes leaping.
Row-major layout drawn as a ribbon: neighbouring columns are adjacent on the tape (coalesce), neighbouring rows are N flfigure rendering · Row-major layout drawn as a ribbon: neighbouring columns are adjacent
Row-major layout drawn as a ribbon: neighbouring columns are adjacent on the tape (coalesce), neighbouring rows are N floats apart (scatter).

So the entire coalescing verdict for any matrix access collapses to a single question you can teach students to ask every time:

As the thread number increases across a warp, does the column index change, or the row index?

Column-changing means neighbours on the ribbon — coalesced, one bus. Row-changing means leaping by N — scattered, many buses. That one question is the whole diagnostic skill, and it's the thing students should carry out of this workshop even if they forget every number: look at what moves as the thread index rises, and ask whether that motion slides along memory or leaps across it.

Where it bites the naive kernel

Now connect it to the kernel from the matmul chapters, because this is where the abstract idea becomes a real 6× on a real GPU.

In the naive kernel the two loads inside the inner loop are A[row·N + k] and B[k·N + col]. The threads of a warp differ in col (the fast axis) while sharing the same row. Trace it:

  • B[k·N + col] — as we step across the warp, col increases by 1 each thread, so we walk B[k][col], B[k][col+1], … along a row of B. Neighbours on the ribbon. Coalesced — one bus. Good.
  • A[row·N + k]row and k are the same for all 32 threads, so all 32 read the identical address. That's a broadcast, which the hardware also handles cheaply — one fetch, shared by all.
🎓 Teaching note
The board sequence that makes this click: (1) draw the ribbon and the "column changes → slide / row changes → leap" rule. (2) Write just the two load expressions, big. (3) Ask the room, for each load: "as the thread number goes up, which index moves — row or column?" Make them answer before you do. When they say "column" for B, cheer — that's the coalesced one. This turns a passive derivation into an active diagnosis, which is exactly the skill you want them to own. Don't front-load the SASS; the ribbon question is the transferable tool.

The honest point: the naive kernel isn't the worst case — B coalesces and A broadcasts. The trouble is that this happened by accident, from CUDA's default thread numbering, and it's fragile. The next kernel makes the good mapping on purpose, so you can reason about it and build on it.

The fix: one line, chosen deliberately

The remap is almost anticlimactically small — and that's the punchline you want students to feel. We flatten the thread block to 1-D and compute the 2-D position ourselves, so the fast-moving thread axis is guaranteed to land on the contiguous (column) axis:

const uint BLOCKSIZE = 32;
const uint row = blockIdx.y * BLOCKSIZE + (threadIdx.x / BLOCKSIZE);
const uint col = blockIdx.x * BLOCKSIZE + (threadIdx.x % BLOCKSIZE);

if (row < N && col < N) {
    float acc = 0.0f;
    for (int k = 0; k < N; ++k)
        acc += A[row * N + k] * B[k * N + col];
    C[row * N + col] = acc;
}

The whole change is col = threadIdx.x % 32 and row = threadIdx.x / 32. Now a warp is exactly threadIdx.x = 0..31, so col runs 0..31 and row stays constant across the warp. Consecutive threads → consecutive columns → the B load is one clean 128-byte transaction per warp, and the A load is a tidy broadcast. We designed the good mapping instead of inheriting it by luck.

⚠️ Where students trip
The number-one confusion here: students think warps are built from threadIdx.y — "one warp per row of the block, right?" No. CUDA flattens threads with x fastest: the linear index is threadIdx.x + threadIdx.y·blockDim.x. So a warp is 32 threads with consecutive threadIdx.x. Get this backwards and your entire coalescing analysis inverts — you'll call the coalesced load strided and vice versa. The fix: on the board, physically number the threads 0..31 by walking left-to-right along x FIRST, only wrapping to the next y-row after 32. "x fills up before y moves." Make them chant it.
The whole change: deciding on purpose that the warp's fast axis lands on B's contiguous axis.figure rendering · The whole change: deciding on purpose that the warp's fast axis lands
The whole change: deciding on purpose that the warp's fast axis lands on B's contiguous axis.

The number that makes jaws drop

Here's the payoff to run as a live demo. The naive kernel sat at about 1.3% of cuBLAS (cuBLAS is NVIDIA's hand-tuned reference library — the speed everyone measures against). After the remap, the kernel jumps to about 8.5% of cuBLAS — roughly a 6.4× speedup. Write those two percentages on the board before you show the code change, so the size of the win is fixed in their minds when they see how small the change is.

▶️ Live demo
Run both kernels back to back under nvidia-smi/Nsight and show two numbers on screen: the naive time and the coalesced time. Then show a diff of the source — it's two lines. The room's reaction to "6× faster, two lines, zero fewer multiplies" is the emotional peak of the memory section. If you can, open Nsight Compute and point at the metric l1tex__t_sectors_per_request (sectors fetched per load): the naive number is bloated; the coalesced one drops toward the ideal floor of 4 sectors per warp. That single metric is "how full were the buses."
The change is invisible in the FLOP count and loud in the memory metrics: fuller buses, fewer trips, a 6× win.figure rendering · The change is invisible in the FLOP count and loud in the memory metri
The change is invisible in the FLOP count and loud in the memory metrics: fuller buses, fewer trips, a 6× win.
🏭 In production today
This is not a lab curiosity — it's the first thing any serious kernel gets right, everywhere money is spent on inference. Every production matmul and attention kernel in vLLM, in FlashAttention, in the kernels DeepSeek and Meta run to serve models to millions, is built so that consecutive threads read consecutive addresses. On an H100, HBM3 can move about 3.35 TB/s — but only if your accesses are coalesced. Scatter them and you might see a third of that. That gap — running your multi-million-dollar cluster at 30% vs 90% of its rated memory bandwidth — is decided by whether the kids ride one bus or thirty-two. Coalescing is the price of admission to fast; you can't skip it and be clever later.

Be honest about what it did not fix

Leave students with the right sense of proportion, or they'll think this was the whole game. Coalescing made each bus full — but it did nothing about the fact that we send far too many buses. The kernel still re-reads the same element of A from slow memory once for every thread that needs it. We're still hauling O(N³) floats from HBM to do O(N³) math — dreadful reuse.

🎤 Say this at the board
"Coalescing filled every bus. It did not stop us running the same route over and over. We're still fetching the same rice from the far pantry a thousand times because a thousand cooks each walked out to get it themselves. The next win is to fetch it once and share it — that's shared memory, kernel 3, and that's where the real climb begins." This frames coalescing correctly: not the summit, but the tax you must pay before any cleverness is worth attempting.

That's the frame to end on. Coalescing is the cheapest big win — ~6×, zero math removed — precisely because it fixes waste, not work. And it's mandatory first: there's no point staging data into fast memory if the loads that fill it are scattered. Get the warp's fast axis onto contiguous memory. Then you've earned the right to be clever.

You can now teach

  • The bus metaphor: a warp is a 32-seat busload, memory comes in 128-byte lines, and speed is just "how many buses did we dispatch?" — one for neighbours, up to 32 for scattered riders.
  • The by-hand number: contiguous = 1 transaction, 100% used; strided = up to 32 transactions, ~3% used — same useful floats either way.
  • The ribbon of row-major memory and the one diagnostic question: as the thread number rises, does the column index move (slide, coalesced) or the row index (leap, scattered)?
  • Where it bites the naive matmul (B coalesces, A broadcasts — but only by accident) and the two-line remap that makes the good mapping deliberate.
  • The warp-numbering trap (x fills before y moves) that flips students' whole analysis if they get it wrong — and how to drill it.
  • The jaw-drop demo and production stakes: ~6.4× from two lines and zero fewer multiplies, and why every real kernel — vLLM, FlashAttention, DeepSeek — is designed around coalescing to hit 3.35 TB/s instead of a third of it.