Vizuara Kernel Engineering
/ gemm-worklog

03 The GEMM Worklog

The heart of the course. We rebuild matrix multiply from a 1.3%-of-cuBLAS naive kernel to a 94% warptiled monster — one optimization, one measurement, one figure at a time. Then we do it again on tensor cores.

Kernel 1: Naive1.3%
One thread per output element. The baseline, the profile, and why it leaves 98% of the GPU on the table.
Kernel 2: Global memory coalescing8.5%
A one-line remap of thread indices that turns 32 scattered loads into one — a 6× win for free.
Kernel 3: Shared-memory tiling12.8%
Blocking over K, staging tiles in SMEM, and cutting global-memory traffic by the tile width.
Kernel 4: 1D block-tiling36.5%
One thread computes many outputs. Register caching pushes arithmetic intensity past the memory wall.
Kernel 5: 2D block-tiling68.7%
A 2D grid of results per thread. Input reuse in registers doubles us again.
Kernel 6: Vectorized memory access78.4%
float4 / LDS.128 loads. The SASS moment where 8 instructions collapse into 2.
Kernel 7: Autotuning the tiles84.8%
BM, BN, BK, TM, TN — searching the parameter space instead of guessing it.
Kernel 8: Warptiling93.7%
Making every level of parallelism explicit — block, warp, thread — to reach cuBLAS territory.
Double buffering & cp.async
Overlapping the next tile's load with this tile's math to hide the memory latency entirely.
What cuBLAS is actually doing
The black box we've been chasing: heuristics, split-K, and why the last 6% is so hard.
Benchmarking without lying to yourself
Warmup, clocks, L2 flushing, and reporting TFLOP/s honestly — the discipline that makes the whole ladder trustworthy.
The ladder, end to endRECAP
All ten kernels on one roofline, the speedups stacked, and the general principles that transfer to any kernel.
Tensor cores I: the WMMA GEMMTC
Rebuilding matmul on tensor cores from scratch (after alexarmbr): fragments, the mma shape, and a first working kernel.
Tensor cores II: fragments & swizzlingTC
Register-fragment layouts, SMEM swizzling to kill bank conflicts on the tensor-core path, and the profile that proves it.
Tensor cores III: to cuBLAS speedTC
mma.sync, double-buffered pipelines and the precision menu — a tensor-core GEMM that rivals the library.