Skip to main content

Topology & Performance

NCCL performance is heavily shaped by physical topology.

Inside a node

Key links:

  • GPU↔GPU: NVLink / NVSwitch is usually best
  • GPU↔NIC: PCIe placement and NUMA affinity matters
  • PCIe hierarchy: extra hops across CPU root complexes often hurt

Practical check:

  • nvidia-smi topo -m
  • Look for NVLink connectivity and GPU↔NIC proximity

Across nodes

Key dimensions:

  • Fabric bandwidth (100/200/400 Gb/s, etc.)
  • Oversubscription and bisection bandwidth
  • Congestion control behavior (especially for RoCE)
  • Routing/ECMP balance and hotspots

Scaling intuition

As you increase GPUs/nodes, collective time often grows because:

  • More participants → more data movement / coordination
  • More congestion opportunities
  • More sensitivity to a few slow paths (stragglers)

Measuring correctly

Separate:

  • Compute time (kernels)
  • Communication time (collectives)
  • Input pipeline and CPU overhead

Next: Debugging