/ the-book

The Kernel Engineering Book

The complete, free knowledge base behind Vizuara's Kernel Engineering — 72 illustrated worklog chapters across seven parts, each written in the hypothesis → measure → figure rhythm and cross-linked to the chapters it needs. Start anywhere.

How to read this book →

Start Here

5 chapters ›

The mental models. Why kernels decide who wins, how to think in speed-of-light terms, and the exact skill map a GPU kernel engineer is hired for.

Why kernels run the world The three regimes: compute, memory, overheadBRRR Speed-of-light thinking & the rooflineROOFLINE The kernel engineer's skill mapCAREER How to use this site

The GPU, From Silicon Up

14 chapters ›

A tour of the H100/B200 from the die down to the wire. Every component, what it costs, and why it exists — the vocabulary the rest of the site assumes.

The Streaming MultiprocessorSM CUDA cores & the FP32/INT pipes Tensor coresTC The warp scheduler & latency hiding The register fileRMEM Shared memory & L1SMEM The L2 cache & partitionsL2 HBM, global memory & the packageHBM GPCs, TPCs & the chip floorplanGPC A100 → H100 → B200: what changed The roofline model in practiceROOFLINE Occupancy: the balancing act Arithmetic intensityAI Memory coalescing

The CUDA Programming Model

12 chapters ›

From the abstract launch to the metal. Threads to grids, the compilation story, and the primitives you build every kernel out of.

Threads, warps, blocks, grids SIMT & warp divergence Anatomy of a kernel launch The memory spaces PTX vs SASS: the compilation storyPTX Compute capability & targeting Shared-memory bank conflictsSMEM Atomics & reductions Streams, events & async Your first kernel, end to end GPU Puzzles, walkthrough IPRACTICE GPU Puzzles, walkthrough IIPRACTICE

The GEMM Worklog

15 chapters ›

The heart of the course. We rebuild matrix multiply from a 1.3%-of-cuBLAS naive kernel to a 94% warptiled monster — one optimization, one measurement, one figure at a time. Then we do it again on tensor cores.

Kernel 1: Naive1.3%Kernel 2: Global memory coalescing8.5%Kernel 3: Shared-memory tiling12.8%Kernel 4: 1D block-tiling36.5%Kernel 5: 2D block-tiling68.7%Kernel 6: Vectorized memory access78.4%Kernel 7: Autotuning the tiles84.8%Kernel 8: Warptiling93.7%Double buffering & cp.async What cuBLAS is actually doing Benchmarking without lying to yourself The ladder, end to endRECAP Tensor cores I: the WMMA GEMMTC Tensor cores II: fragments & swizzlingTC Tensor cores III: to cuBLAS speedTC

Kernels for Inference

12 chapters ›

Where the GEMM skills meet real LLMs. Fusion, softmax, attention, FlashAttention, the KV cache and the quantized kernels that serve tokens at scale.

Prefill vs decode: two different machines Operator fusion Softmax from scratch (and online)RMSNorm & LayerNorm kernels Attention, the naive way FlashAttention I: tiling attentionFA FlashAttention II: better work partitioningFA2 FlashAttention III: Hopper & asyncFA3 The KV cache & PagedAttention Quantization kernels: FP8, INT4, W4A16FP8 The SwiGLU kernel Batched decode: the GEMV problem

The Frontier

10 chapters ›

The cutting edge, as of now. Hopper's async engine, Blackwell's tensor memory and NVFP4, DeepSeek's open kernels, CUTLASS the hard way, and how to debug when it all breaks.

Hopper's TMA: async bulk copyTMA WGMMA & warp specializationWGMMA Beating cuBLAS on an H100WORKLOG Blackwell: tcgen05 & tensor memoryB200 NVFP4 & microscaling formatsNVFP4 DeepSeek's open kernels: FlashMLA & DeepGEMM DSpark: speculative decoding as a kernel problemDSPARK CUTLASS the hard wayCUTLASS CuTe, the DSL landscape & Triton Debugging kernels: the vLLM workflowDEBUG

AI × Kernels

4 chapters ›

The newest frontier: can models write the kernels? KernelBench, test-time search, RL, and the honest picture of where AI-generated kernels win and where they still fail.

KernelBench & measuring AI kernels Monkeys & search: test-time scaling The CRFM experiments Kevin, RL & what's still human