/ cuda-model

02 The CUDA Programming Model

From the abstract launch to the metal. Threads to grids, the compilation story, and the primitives you build every kernel out of.

Threads, warps, blocks, grids

The execution hierarchy and how it maps onto SMs, plus the indexing arithmetic you will write a thousand times.

SIMT & warp divergence

Why 32 threads share a program counter, what happens at an if-statement, and how divergence quietly halves your throughput.

Anatomy of a kernel launch

Grid/block dims, launch overhead, streams, and what actually happens between <<<>>> and the first instruction.

The memory spaces

Global, shared, local, constant, register, texture — the full map, with latencies and when to reach for each.

PTX vs SASS: the compilation storyPTX

nvcc → PTX → ptxas → SASS. What is virtual, what is real, and why you read SASS to find the truth.

Compute capability & targeting

sm_90a and friends — how to compile for the right architecture and why the 'a' matters on Hopper.

Shared-memory bank conflictsSMEM

32 banks, the conflict rule, and the padding/swizzling fixes that recover the bandwidth you paid for.

Atomics & reductions

Warp-shuffle reductions, atomicAdd, and how to sum a million numbers without serializing your GPU.

Streams, events & async

Overlapping copy and compute, cp.async, and the concurrency that keeps the SMs fed.

Your first kernel, end to end

SAXPY to RGB→grayscale: launch config, boundary checks, and benchmarking done right, from scratch.

GPU Puzzles, walkthrough IPRACTICE

Solving the first set of Sasha Rush's GPU Puzzles — map, zip, broadcast, and the mental model of one-thread-one-element.

GPU Puzzles, walkthrough IIPRACTICE

Pooling, dot product, convolution and prefix sum — where shared memory and cooperation enter the picture.