/ projects

Projects

Reading is not enough — kernels are learned by writing them. Each project is a build you do with your hands, pointed at the exact chapters that carry it. Work top to bottom, or jump to your level. Every one produces something you can put in a worklog and show an employer.

beginner

GPU Puzzles: the on-ramp

Solve Sasha Rush's 14 GPU-Puzzles to internalise the one-thread-one-element model, indexing, guards, shared memory and reductions — the fastest way from zero to writing correct kernels.

Walkthrough I → Walkthrough II →

core

GEMM to 94% of cuBLAS

Build matrix-multiply from a 1.3%-of-cuBLAS naive kernel up the full ten-step ladder — coalescing, SMEM tiling, register tiling, vectorization, autotuning, warptiling — profiling every step.

Start: Kernel 1 (naive) → The ladder, end to end →

core

Matmul on tensor cores

Rebuild GEMM a second time on the tensor cores: wmma fragments, SMEM swizzling to kill bank conflicts, and mma.sync to library-class speed.

Tensor cores I: WMMA → Tensor cores III: fast →

advanced

FlashAttention from scratch

Fuse the whole attention into one kernel with online softmax so the N×N scores never touch HBM — then benchmark it against PyTorch's SDPA.

FlashAttention I → Online softmax →

expert

Beat cuBLAS on an H100

Assemble TMA, WGMMA and warp specialization into a Hopper GEMM that matches or beats NVIDIA's own library — the full frontier worklog.

Beating cuBLAS on H100 → WGMMA & warp specialization →

capstone

You vs. the machine (capstone)

Pick a kernel (SwiGLU, a FlashAttention variant, histogram…), optimize it BY HAND, then run an LLM-in-the-loop against your own kernel and document — CS149 × KernelBench — what each found that the other missed.

KernelBench & fast_p → The SwiGLU kernel →

skill

Debug a broken kernel

Take a kernel with a race, a misaligned vector load and a silent NaN, and hunt each down with compute-sanitizer, user-triggered core dumps, cuda-gdb and nvdisasm — the vLLM workflow.

The vLLM debugging workflow →

Want these reviewed live and a certificate?

See Vizuara's Kernel Engineering Workshop →