/ gemm-worklog

03 The GEMM Worklog

The heart of the course. We rebuild matrix multiply from a 1.3%-of-cuBLAS naive kernel to a 94% warptiled monster — one optimization, one measurement, one figure at a time. Then we do it again on tensor cores.

→

Kernel 1: Naive1.3%

One thread per output element. The baseline, the profile, and why it leaves 98% of the GPU on the table.

→

Kernel 2: Global memory coalescing8.5%

A one-line remap of thread indices that turns 32 scattered loads into one — a 6× win for free.

→

Kernel 3: Shared-memory tiling12.8%

Blocking over K, staging tiles in SMEM, and cutting global-memory traffic by the tile width.

→

Kernel 4: 1D block-tiling36.5%

One thread computes many outputs. Register caching pushes arithmetic intensity past the memory wall.

→

Kernel 5: 2D block-tiling68.7%

A 2D grid of results per thread. Input reuse in registers doubles us again.

→

Kernel 6: Vectorized memory access78.4%

float4 / LDS.128 loads. The SASS moment where 8 instructions collapse into 2.

→

Kernel 7: Autotuning the tiles84.8%

BM, BN, BK, TM, TN — searching the parameter space instead of guessing it.

→

Kernel 8: Warptiling93.7%

Making every level of parallelism explicit — block, warp, thread — to reach cuBLAS territory.

→

Double buffering & cp.async

Overlapping the next tile's load with this tile's math to hide the memory latency entirely.

→

What cuBLAS is actually doing

The black box we've been chasing: heuristics, split-K, and why the last 6% is so hard.

→

Benchmarking without lying to yourself

Warmup, clocks, L2 flushing, and reporting TFLOP/s honestly — the discipline that makes the whole ladder trustworthy.

→

The ladder, end to endRECAP

All ten kernels on one roofline, the speedups stacked, and the general principles that transfer to any kernel.

→

Tensor cores I: the WMMA GEMMTC

Rebuilding matmul on tensor cores from scratch (after alexarmbr): fragments, the mma shape, and a first working kernel.

→

Tensor cores II: fragments & swizzlingTC

Register-fragment layouts, SMEM swizzling to kill bank conflicts on the tensor-core path, and the profile that proves it.

→

Tensor cores III: to cuBLAS speedTC

mma.sync, double-buffered pipelines and the precision menu — a tensor-core GEMM that rivals the library.