/ interactive
Interactive
Practice, not just reading. Test yourself against the core ideas, then work the guided GPU-Puzzles track. More hands-on kernel challenges are landing here as the workshop grows.
The GPU-Puzzles track
Fourteen tiny puzzles that build the kernel-writer's instincts — map, zip, guards, broadcasting, shared memory, pooling, dot product, convolution, prefix sum and a first matmul. Do them in the browser, then read our walkthroughs.
Quiz yourself
0 / 12Twelve questions spanning the regimes, the GEMM ladder, inference kernels and the frontier. Pick an answer to see the explanation.
Q1
A kernel achieves 4% of the GPU's peak FLOP/s and near-peak HBM bandwidth. What is it?
Near-peak bandwidth with tiny FLOP utilisation is the signature of a memory-bound kernel — adding faster math won't help.
Q2
The H100's tensor cores do ~989 TFLOP/s bf16 and HBM3 gives ~3.35 TB/s. Roughly what arithmetic intensity must a kernel exceed to be compute-bound?
989e12 / 3.35e12 ≈ 295 FLOPs per byte — the ridge point. Below it you're memory-bound no matter what.
Q3
In the GEMM ladder, the jump from naive (1.3%) to 8.5% of cuBLAS comes from…
Kernel 2 is a one-line thread-index remap so a warp reads contiguous columns — coalesced access, ~6× for free.
Q4
Why must GEMM block over the K dimension when using shared memory?
SMEM is ~228 KiB; you can't stage whole K-length rows/cols, so you tile K and accumulate across tiles.
Q5
A float4 / LDS.128 load moves 128 bits per instruction. Its main benefit over four scalar loads is…
Same bytes, a quarter of the instructions — a win precisely when you're issue-bound, as in kernel 6.
Q6
On an H100, shared memory has 32 banks. A bank conflict happens when…
Multiple lanes addressing different words in the same bank serialise; padding (+1) or swizzling fixes it.
Q7
Why is LLM decode (one token at a time) usually memory-bound?
Decode is a skinny mat-vec dominated by reading the KV cache from HBM — the opposite regime from prefill.
Q8
FlashAttention's core trick is…
Tiling + online-softmax rescaling fuses attention into one kernel; traffic scales like N·d, not N².
Q9
What is new in Hopper (sm_90a) that Ampere lacks?
Hopper adds the Tensor Memory Accelerator, warpgroup MMA, distributed shared memory and clusters.
Q10
DeepSeek-V4-Pro-DSpark is best described as…
DSpark is a speculative-decoding module bolted onto V4-Pro — draft tokens verified in parallel; a kernels problem.
Q11
In Stanford CRFM's AI-generated-kernel experiments, results were…
Branching search shone on less-tuned FP32 ops but still trailed badly on hard ones like FlashAttention.
Q12
Which is the right first move when a kernel hangs and Ctrl-C does nothing?
The vLLM workflow: CUDA_ENABLE_USER_TRIGGERED_COREDUMP via a named pipe, then cuda-gdb on the dump.
