/ gemm-worklog
03 The GEMM Worklog
The heart of the course. We rebuild matrix multiply from a 1.3%-of-cuBLAS naive kernel to a 94% warptiled monster — one optimization, one measurement, one figure at a time. Then we do it again on tensor cores.
→
Kernel 1: Naive1.3%
One thread per output element. The baseline, the profile, and why it leaves 98% of the GPU on the table.
→
Kernel 2: Global memory coalescing8.5%
A one-line remap of thread indices that turns 32 scattered loads into one — a 6× win for free.
→
Kernel 3: Shared-memory tiling12.8%
Blocking over K, staging tiles in SMEM, and cutting global-memory traffic by the tile width.
→
Kernel 4: 1D block-tiling36.5%
One thread computes many outputs. Register caching pushes arithmetic intensity past the memory wall.
→
Kernel 5: 2D block-tiling68.7%
A 2D grid of results per thread. Input reuse in registers doubles us again.
→
Kernel 6: Vectorized memory access78.4%
float4 / LDS.128 loads. The SASS moment where 8 instructions collapse into 2.
→
Kernel 7: Autotuning the tiles84.8%
BM, BN, BK, TM, TN — searching the parameter space instead of guessing it.
→
Kernel 8: Warptiling93.7%
Making every level of parallelism explicit — block, warp, thread — to reach cuBLAS territory.
→
Double buffering & cp.async
Overlapping the next tile's load with this tile's math to hide the memory latency entirely.
→
What cuBLAS is actually doing
The black box we've been chasing: heuristics, split-K, and why the last 6% is so hard.
→
Benchmarking without lying to yourself
Warmup, clocks, L2 flushing, and reporting TFLOP/s honestly — the discipline that makes the whole ladder trustworthy.
→
The ladder, end to endRECAP
All ten kernels on one roofline, the speedups stacked, and the general principles that transfer to any kernel.
→
Tensor cores I: the WMMA GEMMTC
Rebuilding matmul on tensor cores from scratch (after alexarmbr): fragments, the mma shape, and a first working kernel.
→
Tensor cores II: fragments & swizzlingTC
Register-fragment layouts, SMEM swizzling to kill bank conflicts on the tensor-core path, and the profile that proves it.
→
Tensor cores III: to cuBLAS speedTC
mma.sync, double-buffered pipelines and the precision menu — a tensor-core GEMM that rivals the library.
