Skip to main content

NCCL Guide

NCCL (NVIDIA Collective Communications Library) is the communication backbone behind most multi-GPU training at scale. It provides highly optimized collective communication primitives (AllReduce, AllGather, ReduceScatter, Broadcast) that frameworks use to synchronize gradients and parameters.

This section is written for ML infra / GPU networking engineers who want a practical understanding of:

What NCCL is responsible for (and what it is not)
How NCCL builds communication patterns from your GPU + NIC topology
How to reason about performance bottlenecks
How to debug the common “NCCL is slow / flaky” failure modes

When NCCL matters

NCCL becomes critical when:

You use multiple GPUs per node (PCIe + NVLink / NVSwitch topology matters)
You use multiple nodes (fabric bandwidth, congestion control, and topology matters)
You care about scaling efficiency (time spent in collectives vs compute)

What this guide covers

Installation & environment expectations (Linux + NVIDIA GPUs)
Communicators and rank management
Collectives and how they map to traffic patterns
Topology & performance mental models
Debugging workflow and logs
Practical code examples (separate examples repo)

Next: Installation & Environment

When NCCL matters
What this guide covers