Skip to main content

NCCL Debugging Workflow

When NCCL is “slow” or “flaky”, a good workflow is:

1) Confirm the mapping

One process per GPU?
CUDA_VISIBLE_DEVICES correct?
Correct rank↔GPU mapping?

2) Turn on logs

Start with:

NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH,COLL

Look for:

Which transport NCCL selected (IB/RoCE/TCP)
Ring/tree construction
Topology graph decisions

3) Reduce the problem

A/B test:

Single node, multi-GPU (isolates fabric issues)
Two nodes only (isolates scale-related issues)
Disable IB or P2P (forces different paths, helps isolate)

4) Check the fabric

Depending on your environment:

NIC counters (drops, ECN marks, PFC pauses)
Switch counters (congestion, buffer utilization)
Routing symmetry and hotspot links

5) Reproduce with a minimal benchmark

Use a minimal AllReduce example (in the examples repo) to confirm whether the issue is NCCL/topology/fabric rather than model code.

Next: go to the GPU Networking Examples repo for runnable NCCL examples.

1) Confirm the mapping
2) Turn on logs
3) Reduce the problem
4) Check the fabric
5) Reproduce with a minimal benchmark