/ cuda-model
02 The CUDA Programming Model
From the abstract launch to the metal. Threads to grids, the compilation story, and the primitives you build every kernel out of.
→
Threads, warps, blocks, grids
The execution hierarchy and how it maps onto SMs, plus the indexing arithmetic you will write a thousand times.
→
SIMT & warp divergence
Why 32 threads share a program counter, what happens at an if-statement, and how divergence quietly halves your throughput.
→
Anatomy of a kernel launch
Grid/block dims, launch overhead, streams, and what actually happens between <<<>>> and the first instruction.
→
The memory spaces
Global, shared, local, constant, register, texture — the full map, with latencies and when to reach for each.
→
PTX vs SASS: the compilation storyPTX
nvcc → PTX → ptxas → SASS. What is virtual, what is real, and why you read SASS to find the truth.
→
Compute capability & targeting
sm_90a and friends — how to compile for the right architecture and why the 'a' matters on Hopper.
→
Shared-memory bank conflictsSMEM
32 banks, the conflict rule, and the padding/swizzling fixes that recover the bandwidth you paid for.
→
Atomics & reductions
Warp-shuffle reductions, atomicAdd, and how to sum a million numbers without serializing your GPU.
→
Streams, events & async
Overlapping copy and compute, cp.async, and the concurrency that keeps the SMs fed.
→
Your first kernel, end to end
SAXPY to RGB→grayscale: launch config, boundary checks, and benchmarking done right, from scratch.
→
GPU Puzzles, walkthrough IPRACTICE
Solving the first set of Sasha Rush's GPU Puzzles — map, zip, broadcast, and the mental model of one-thread-one-element.
→
GPU Puzzles, walkthrough IIPRACTICE
Pooling, dot product, convolution and prefix sum — where shared memory and cooperation enter the picture.
