NCCL Collectives & Traffic Patterns

Most distributed training traffic can be described using a small set of collectives.

AllReduce

What it does: Everyone contributes a buffer; everyone gets the reduced result (sum/avg/max…).

Where it appears: Gradient synchronization in data-parallel training.

Traffic pattern intuition: Many-to-many; implemented via rings/trees/hybrids depending on size and topology.

ReduceScatter + AllGather

Many modern optimizers and sharding approaches rely on combinations of:

ReduceScatter (reduce then shard result)
AllGather (gather shards)

These can be more bandwidth-efficient than a full AllReduce in some designs.

Broadcast

Used for distributing parameters, initialization, or control messages.

Stream semantics

NCCL operations are launched into a CUDA stream; they obey stream ordering. If you want to measure correctly, synchronize the stream around the region you time.

Next: Topology & Performance

AllReduce​

ReduceScatter + AllGather​

Broadcast​

Stream semantics​

AllReduce

ReduceScatter + AllGather

Broadcast

Stream semantics