Skip to main content

Topology & Performance

NCCL performance is heavily shaped by physical topology.

Inside a node

Key links:

GPU↔GPU: NVLink / NVSwitch is usually best
GPU↔NIC: PCIe placement and NUMA affinity matters
PCIe hierarchy: extra hops across CPU root complexes often hurt

Practical check:

nvidia-smi topo -m
Look for NVLink connectivity and GPU↔NIC proximity

Across nodes

Key dimensions:

Fabric bandwidth (100/200/400 Gb/s, etc.)
Oversubscription and bisection bandwidth
Congestion control behavior (especially for RoCE)
Routing/ECMP balance and hotspots

Scaling intuition

As you increase GPUs/nodes, collective time often grows because:

More participants → more data movement / coordination
More congestion opportunities
More sensitivity to a few slow paths (stragglers)

Measuring correctly

Separate:

Compute time (kernels)
Communication time (collectives)
Input pipeline and CPU overhead

Next: Debugging

Inside a node
Across nodes
Scaling intuition
Measuring correctly