/ interactive

Interactive

Practice, not just reading. Test yourself against the core ideas, then work the guided GPU-Puzzles track. More hands-on kernel challenges are landing here as the workshop grows.

The GPU-Puzzles track

Fourteen tiny puzzles that build the kernel-writer's instincts — map, zip, guards, broadcasting, shared memory, pooling, dot product, convolution, prefix sum and a first matmul. Do them in the browser, then read our walkthroughs.

Open GPU-Puzzles ↗ Walkthrough I → Walkthrough II →

Quiz yourself

0 / 12

Twelve questions spanning the regimes, the GEMM ladder, inference kernels and the frontier. Pick an answer to see the explanation.

A kernel achieves 4% of the GPU's peak FLOP/s and near-peak HBM bandwidth. What is it?

Near-peak bandwidth with tiny FLOP utilisation is the signature of a memory-bound kernel — adding faster math won't help.

The H100's tensor cores do ~989 TFLOP/s bf16 and HBM3 gives ~3.35 TB/s. Roughly what arithmetic intensity must a kernel exceed to be compute-bound?

989e12 / 3.35e12 ≈ 295 FLOPs per byte — the ridge point. Below it you're memory-bound no matter what.

In the GEMM ladder, the jump from naive (1.3%) to 8.5% of cuBLAS comes from…

Kernel 2 is a one-line thread-index remap so a warp reads contiguous columns — coalesced access, ~6× for free.

Why must GEMM block over the K dimension when using shared memory?

SMEM is ~228 KiB; you can't stage whole K-length rows/cols, so you tile K and accumulate across tiles.

A float4 / LDS.128 load moves 128 bits per instruction. Its main benefit over four scalar loads is…

Same bytes, a quarter of the instructions — a win precisely when you're issue-bound, as in kernel 6.

On an H100, shared memory has 32 banks. A bank conflict happens when…

Multiple lanes addressing different words in the same bank serialise; padding (+1) or swizzling fixes it.

Why is LLM decode (one token at a time) usually memory-bound?

Decode is a skinny mat-vec dominated by reading the KV cache from HBM — the opposite regime from prefill.

FlashAttention's core trick is…

Tiling + online-softmax rescaling fuses attention into one kernel; traffic scales like N·d, not N².

What is new in Hopper (sm_90a) that Ampere lacks?

Hopper adds the Tensor Memory Accelerator, warpgroup MMA, distributed shared memory and clusters.

Q10

DeepSeek-V4-Pro-DSpark is best described as…

DSpark is a speculative-decoding module bolted onto V4-Pro — draft tokens verified in parallel; a kernels problem.

Q11

In Stanford CRFM's AI-generated-kernel experiments, results were…

Branching search shone on less-tuned FP32 ops but still trailed badly on hard ones like FlashAttention.

Q12

Which is the right first move when a kernel hangs and Ctrl-C does nothing?

The vLLM workflow: CUDA_ENABLE_USER_TRIGGERED_COREDUMP via a named pipe, then cuda-gdb on the dump.