/ the-workshop

Vizuara's Kernel Engineering Workshop

Eight live foundational lectures and six deep-dive workshops on modern topics — from the three performance regimes to DeepSeek's DSpark and AI-generated kernels. Enrolled students get the complete 72-chapter book, the GPU-Puzzles track and quizzes, the guided projects, worklog assignments, and the "You vs the machine" capstone.

8 foundational live lectures 2/week · 3 hours each · 4 weeks

How fast can this go?

The three regimes, the roofline, and a top-down tour of the silicon. Live: predict-then-measure PyTorch ops.

The CUDA programming model

Grids, warps, SIMT, and the nvcc→PTX→SASS story. Live: your first kernels + GPU Puzzles.

The memory hierarchy in anger

Coalescing, bank conflicts, occupancy. Live: the matrix-transpose ladder under Nsight Compute.

GEMM worklog I

Kernels 1–4: naive (1.3%) to 1D block-tiling (36.5%). Hypothesis → profile → number, every step.

GEMM worklog II

Kernels 5–10: 2D tiling, float4, autotuning, warptiling (93.7%). Live: the SASS '8 loads → 2 loads' moment.

Tensor cores, the second worklog

mma.sync, fragments, swizzling, the precision menu. Live: a WMMA GEMM beating our best SIMT kernel.

Profiling & debugging like a pro

Nsight Compute deep-read + the vLLM workflow: sanitizer, core dumps, cuda-gdb. Live: 3 sabotaged kernels.

Attention: the kernel that ate the world

Online softmax, FlashAttention v1 built live, why decode is GEMV. Capstone kickoff.

6 deep-dive workshops modern kernel inference topics

FlashAttention from scratch

Full forward pass, online-softmax rescaling, causal masking; FA2/FA3 ideas.

Beating cuBLAS on an H100

TMA + WGMMA + warp specialization, assembled into a library-beating GEMM.

Triton → CUTLASS → TileLang

The abstraction ladder: Triton in 40 lines, then CUTLASS the hard way.

Inference-serving kernels

Prefill vs decode, PagedAttention, fusion, and quantized (FP8/W4A16) kernels.

Blackwell & NVFP4

tcgen05, Tensor Memory, CTA pairs, and the 2000µs→22µs FP4 GEMV journey.

DeepSeek, DSpark & AI-written kernels

FlashMLA/DeepGEMM, speculative decoding, KernelBench and the human+AI+profiler loop.

Dates & pricing announced soon

Get notified / enquire →

Questions? team@vizuara.com