/ frontier
05 The Frontier
The cutting edge, as of now. Hopper's async engine, Blackwell's tensor memory and NVFP4, DeepSeek's open kernels, CUTLASS the hard way, and how to debug when it all breaks.
→
Hopper's TMA: async bulk copyTMA
The Tensor Memory Accelerator: hardware-managed multidimensional copies that free the threads to compute.
→
WGMMA & warp specializationWGMMA
Warpgroup matmul and the producer/consumer pattern that keeps the tensor cores saturated on Hopper.
→
Beating cuBLAS on an H100WORKLOG
A full worklog (after hamzaelshafie & cudaforfun) that assembles TMA + wgmma + warp specialization into a library-beating GEMM.
→
Blackwell: tcgen05 & tensor memoryB200
The 5th-gen tensor cores, the new Tensor Memory (TMEM), and CTA pairs — a different machine to write for.
→
NVFP4 & microscaling formatsNVFP4
e2m1 with FP8 block scales, and the hackathon journey from a 2000µs to a 22µs FP4 GEMV.
→
DeepSeek's open kernels: FlashMLA & DeepGEMM
MLA decode, FP8 MoE GEMMs and the DeepGEMM backend — production kernels behind a frontier model.
→
DSpark: speculative decoding as a kernel problemDSPARK
V4-Pro's speculative-decoding module — draft passes, parallel verification, acceptance kernels, and CSA/HCA sparse attention.
→
CUTLASS the hard wayCUTLASS
From a naive GEMM to a real CUTLASS kernel (after kapilsh) — reading library-grade warptiling with our own vocabulary.
→
CuTe, the DSL landscape & Triton
Layouts and tensors in CuTe, Triton in 40 lines, TileLang — the abstractions above raw CUDA and when to use them.
→
Debugging kernels: the vLLM workflowDEBUG
compute-sanitizer, user-triggered core dumps, cuda-gdb, -lineinfo and nvdisasm — finding the illegal access at 3am.
