/ frontier

05 The Frontier

The cutting edge, as of now. Hopper's async engine, Blackwell's tensor memory and NVFP4, DeepSeek's open kernels, CUTLASS the hard way, and how to debug when it all breaks.

Hopper's TMA: async bulk copyTMA

The Tensor Memory Accelerator: hardware-managed multidimensional copies that free the threads to compute.

WGMMA & warp specializationWGMMA

Warpgroup matmul and the producer/consumer pattern that keeps the tensor cores saturated on Hopper.

Beating cuBLAS on an H100WORKLOG

A full worklog (after hamzaelshafie & cudaforfun) that assembles TMA + wgmma + warp specialization into a library-beating GEMM.

Blackwell: tcgen05 & tensor memoryB200

The 5th-gen tensor cores, the new Tensor Memory (TMEM), and CTA pairs — a different machine to write for.

NVFP4 & microscaling formatsNVFP4

e2m1 with FP8 block scales, and the hackathon journey from a 2000µs to a 22µs FP4 GEMV.

DeepSeek's open kernels: FlashMLA & DeepGEMM

MLA decode, FP8 MoE GEMMs and the DeepGEMM backend — production kernels behind a frontier model.

DSpark: speculative decoding as a kernel problemDSPARK

V4-Pro's speculative-decoding module — draft passes, parallel verification, acceptance kernels, and CSA/HCA sparse attention.

CUTLASS the hard wayCUTLASS

From a naive GEMM to a real CUTLASS kernel (after kapilsh) — reading library-grade warptiling with our own vocabulary.

CuTe, the DSL landscape & Triton

Layouts and tensors in CuTe, Triton in 40 lines, TileLang — the abstractions above raw CUDA and when to use them.

Debugging kernels: the vLLM workflowDEBUG

compute-sanitizer, user-triggered core dumps, cuda-gdb, -lineinfo and nvdisasm — finding the illegal access at 3am.