Vizuara Kernel Engineering
/ cuda-model

02 The CUDA Programming Model

From the abstract launch to the metal. Threads to grids, the compilation story, and the primitives you build every kernel out of.

Threads, warps, blocks, grids
The execution hierarchy and how it maps onto SMs, plus the indexing arithmetic you will write a thousand times.
SIMT & warp divergence
Why 32 threads share a program counter, what happens at an if-statement, and how divergence quietly halves your throughput.
Anatomy of a kernel launch
Grid/block dims, launch overhead, streams, and what actually happens between <<<>>> and the first instruction.
The memory spaces
Global, shared, local, constant, register, texture — the full map, with latencies and when to reach for each.
PTX vs SASS: the compilation storyPTX
nvcc → PTX → ptxas → SASS. What is virtual, what is real, and why you read SASS to find the truth.
Compute capability & targeting
sm_90a and friends — how to compile for the right architecture and why the 'a' matters on Hopper.
Shared-memory bank conflictsSMEM
32 banks, the conflict rule, and the padding/swizzling fixes that recover the bandwidth you paid for.
Atomics & reductions
Warp-shuffle reductions, atomicAdd, and how to sum a million numbers without serializing your GPU.
Streams, events & async
Overlapping copy and compute, cp.async, and the concurrency that keeps the SMs fed.
Your first kernel, end to end
SAXPY to RGB→grayscale: launch config, boundary checks, and benchmarking done right, from scratch.
GPU Puzzles, walkthrough IPRACTICE
Solving the first set of Sasha Rush's GPU Puzzles — map, zip, broadcast, and the mental model of one-thread-one-element.
GPU Puzzles, walkthrough IIPRACTICE
Pooling, dot product, convolution and prefix sum — where shared memory and cooperation enter the picture.