/ silicon

01 The GPU, From Silicon Up

A tour of the H100/B200 from the die down to the wire. Every component, what it costs, and why it exists — the vocabulary the rest of the site assumes.

The Streaming MultiprocessorSM

The unit of GPU compute. Sub-partitions, schedulers, and the on-chip memory that makes or breaks a kernel.

CUDA cores & the FP32/INT pipes

The scalar ALUs, what they can and can't do, and why they are no longer where the FLOPs are.

The matrix-multiply-accumulate engines that hold ~95% of an H100's FLOPs. Shapes, precisions, and why kernels are written around them.

The warp scheduler & latency hiding

How an SM hides 400-cycle memory latency by juggling dozens of warps — the reason occupancy matters.

The register fileRMEM

The fastest and most contended memory on the chip. Register pressure, spills, and why 255 is a magic number.

Shared memory & L1SMEM

The programmable scratchpad that co-lives with L1. 228 KiB, 32 banks, and the single most important optimization surface.

The L2 cache & partitionsL2

60 MiB split across two partitions with a crossbar, data compression, residency control, and near/far latency.

HBM, global memory & the packageHBM

Stacked DRAM, through-silicon vias, the interposer, 3.35 TB/s — and why every kernel is ultimately a memory-traffic problem.

GPCs, TPCs & the chip floorplanGPC

How 132 SMs are grouped, why defective SMs get fused off, and what thread-block clusters buy you.

A100 → H100 → B200: what changed

TMA, wgmma, DSMEM, FP8, then tcgen05, TMEM and NVFP4 — the architectural deltas that each spawned new kernels.

The roofline model in practiceROOFLINE

Plotting your kernel on the roofline, reading the ridge point, and knowing when to stop optimizing.

Occupancy: the balancing act

Registers × threads × shared memory → how many warps fit, and why more occupancy is not always faster.

Arithmetic intensityAI

FLOPs per byte: the number that decides your regime and the target of every tiling trick.

Memory coalescing

How a warp's 32 loads become one 128-byte transaction — or thirty-two — and what that does to your bandwidth.