/ the-workshop
Vizuara's Kernel Engineering Workshop
Eight live foundational lectures and six deep-dive workshops on modern topics — from the three performance regimes to DeepSeek's DSpark and AI-generated kernels. Enrolled students get the complete 72-chapter book, the GPU-Puzzles track and quizzes, the guided projects, worklog assignments, and the "You vs the machine" capstone.
8 foundational live lectures 2/week · 3 hours each · 4 weeks
L1
How fast can this go?
The three regimes, the roofline, and a top-down tour of the silicon. Live: predict-then-measure PyTorch ops.
L2
The CUDA programming model
Grids, warps, SIMT, and the nvcc→PTX→SASS story. Live: your first kernels + GPU Puzzles.
L3
The memory hierarchy in anger
Coalescing, bank conflicts, occupancy. Live: the matrix-transpose ladder under Nsight Compute.
L4
GEMM worklog I
Kernels 1–4: naive (1.3%) to 1D block-tiling (36.5%). Hypothesis → profile → number, every step.
L5
GEMM worklog II
Kernels 5–10: 2D tiling, float4, autotuning, warptiling (93.7%). Live: the SASS '8 loads → 2 loads' moment.
L6
Tensor cores, the second worklog
mma.sync, fragments, swizzling, the precision menu. Live: a WMMA GEMM beating our best SIMT kernel.
L7
Profiling & debugging like a pro
Nsight Compute deep-read + the vLLM workflow: sanitizer, core dumps, cuda-gdb. Live: 3 sabotaged kernels.
L8
Attention: the kernel that ate the world
Online softmax, FlashAttention v1 built live, why decode is GEMV. Capstone kickoff.
6 deep-dive workshops modern kernel inference topics
W1
FlashAttention from scratch
Full forward pass, online-softmax rescaling, causal masking; FA2/FA3 ideas.
W2
Beating cuBLAS on an H100
TMA + WGMMA + warp specialization, assembled into a library-beating GEMM.
W3
Triton → CUTLASS → TileLang
The abstraction ladder: Triton in 40 lines, then CUTLASS the hard way.
W4
Inference-serving kernels
Prefill vs decode, PagedAttention, fusion, and quantized (FP8/W4A16) kernels.
W5
Blackwell & NVFP4
tcgen05, Tensor Memory, CTA pairs, and the 2000µs→22µs FP4 GEMV journey.
W6
DeepSeek, DSpark & AI-written kernels
FlashMLA/DeepGEMM, speculative decoding, KernelBench and the human+AI+profiler loop.
