Skip to main content

NCCL Internals (High Level)

NCCL (NVIDIA Collective Communication Library) is the workhorse behind many multi‑GPU training jobs.

This page gives a high‑level mental model, not a line‑by‑line implementation tour.

1. What NCCL does

NCCL implements collectives such as:

  • AllReduce
  • AllGather
  • ReduceScatter
  • Broadcast

using GPU‑aware transport paths:

  • NVLink / NVSwitch
  • PCIe
  • network (Infiniband, RoCE, TCP)

2. Topology awareness

NCCL inspects:

  • GPU connectivity (NVLink, PCIe hierarchy)
  • proximity of GPUs to NICs (NUMA)
  • available network interfaces

It then builds one or more:

  • rings – good for bandwidth
  • trees – good for latency
  • hybrid structures – for large clusters

As an infra engineer, reading NCCL’s debug output helps you see:

  • how it mapped your GPUs into rings
  • which NICs it chose
  • whether some GPUs are on “worse” paths

3. Why it matters

When scaling training and it doesn’t speed up:

  • sometimes the issue is pure algorithmic / ML
  • often, the issue is that NCCL can’t find efficient paths given topology and fabric

Knowing how to inspect NCCL choices lets you:

  • refine node and GPU placement
  • improve how the scheduler packs jobs
  • provide useful feedback to networking and hardware teams