Vizuara Kernel Engineering
/ silicon

01 The GPU, From Silicon Up

A tour of the H100/B200 from the die down to the wire. Every component, what it costs, and why it exists — the vocabulary the rest of the site assumes.

The Streaming MultiprocessorSM
The unit of GPU compute. Sub-partitions, schedulers, and the on-chip memory that makes or breaks a kernel.
CUDA cores & the FP32/INT pipes
The scalar ALUs, what they can and can't do, and why they are no longer where the FLOPs are.
Tensor coresTC
The matrix-multiply-accumulate engines that hold ~95% of an H100's FLOPs. Shapes, precisions, and why kernels are written around them.
The warp scheduler & latency hiding
How an SM hides 400-cycle memory latency by juggling dozens of warps — the reason occupancy matters.
The register fileRMEM
The fastest and most contended memory on the chip. Register pressure, spills, and why 255 is a magic number.
Shared memory & L1SMEM
The programmable scratchpad that co-lives with L1. 228 KiB, 32 banks, and the single most important optimization surface.
The L2 cache & partitionsL2
60 MiB split across two partitions with a crossbar, data compression, residency control, and near/far latency.
HBM, global memory & the packageHBM
Stacked DRAM, through-silicon vias, the interposer, 3.35 TB/s — and why every kernel is ultimately a memory-traffic problem.
GPCs, TPCs & the chip floorplanGPC
How 132 SMs are grouped, why defective SMs get fused off, and what thread-block clusters buy you.
A100 → H100 → B200: what changed
TMA, wgmma, DSMEM, FP8, then tcgen05, TMEM and NVFP4 — the architectural deltas that each spawned new kernels.
The roofline model in practiceROOFLINE
Plotting your kernel on the roofline, reading the ridge point, and knowing when to stop optimizing.
Occupancy: the balancing act
Registers × threads × shared memory → how many warps fit, and why more occupancy is not always faster.
Arithmetic intensityAI
FLOPs per byte: the number that decides your regime and the target of every tiling trick.
Memory coalescing
How a warp's 32 loads become one 128-byte transaction — or thirty-two — and what that does to your bandwidth.