Vizuara Kernel Engineering
/ projects

Projects

Reading is not enough — kernels are learned by writing them. Each project is a build you do with your hands, pointed at the exact chapters that carry it. Work top to bottom, or jump to your level. Every one produces something you can put in a worklog and show an employer.

beginner

GPU Puzzles: the on-ramp

Solve Sasha Rush's 14 GPU-Puzzles to internalise the one-thread-one-element model, indexing, guards, shared memory and reductions — the fastest way from zero to writing correct kernels.

core

GEMM to 94% of cuBLAS

Build matrix-multiply from a 1.3%-of-cuBLAS naive kernel up the full ten-step ladder — coalescing, SMEM tiling, register tiling, vectorization, autotuning, warptiling — profiling every step.

core

Matmul on tensor cores

Rebuild GEMM a second time on the tensor cores: wmma fragments, SMEM swizzling to kill bank conflicts, and mma.sync to library-class speed.

advanced

FlashAttention from scratch

Fuse the whole attention into one kernel with online softmax so the N×N scores never touch HBM — then benchmark it against PyTorch's SDPA.

expert

Beat cuBLAS on an H100

Assemble TMA, WGMMA and warp specialization into a Hopper GEMM that matches or beats NVIDIA's own library — the full frontier worklog.

capstone

You vs. the machine (capstone)

Pick a kernel (SwiGLU, a FlashAttention variant, histogram…), optimize it BY HAND, then run an LLM-in-the-loop against your own kernel and document — CS149 × KernelBench — what each found that the other missed.

skill

Debug a broken kernel

Take a kernel with a race, a misaligned vector load and a silent NaN, and hunt each down with compute-sanitizer, user-triggered core dumps, cuda-gdb and nvdisasm — the vLLM workflow.

Want these reviewed live and a certificate?
See Vizuara's Kernel Engineering Workshop →