Projects
Reading is not enough — kernels are learned by writing them. Each project is a build you do with your hands, pointed at the exact chapters that carry it. Work top to bottom, or jump to your level. Every one produces something you can put in a worklog and show an employer.
GPU Puzzles: the on-ramp
Solve Sasha Rush's 14 GPU-Puzzles to internalise the one-thread-one-element model, indexing, guards, shared memory and reductions — the fastest way from zero to writing correct kernels.
GEMM to 94% of cuBLAS
Build matrix-multiply from a 1.3%-of-cuBLAS naive kernel up the full ten-step ladder — coalescing, SMEM tiling, register tiling, vectorization, autotuning, warptiling — profiling every step.
Matmul on tensor cores
Rebuild GEMM a second time on the tensor cores: wmma fragments, SMEM swizzling to kill bank conflicts, and mma.sync to library-class speed.
FlashAttention from scratch
Fuse the whole attention into one kernel with online softmax so the N×N scores never touch HBM — then benchmark it against PyTorch's SDPA.
Beat cuBLAS on an H100
Assemble TMA, WGMMA and warp specialization into a Hopper GEMM that matches or beats NVIDIA's own library — the full frontier worklog.
You vs. the machine (capstone)
Pick a kernel (SwiGLU, a FlashAttention variant, histogram…), optimize it BY HAND, then run an LLM-in-the-loop against your own kernel and document — CS149 × KernelBench — what each found that the other missed.
Debug a broken kernel
Take a kernel with a race, a misaligned vector load and a silent NaN, and hunt each down with compute-sanitizer, user-triggered core dumps, cuda-gdb and nvdisasm — the vLLM workflow.
