Distributed Training & Communication
Distributed training lets you use more GPUs and more machines to train a model faster or on larger datasets.
From an infrastructure point of view, distributed training is mostly about:
- how we split the work across GPUs and nodes
- how we move tensors between those GPUs and nodes
- how we keep everything synchronized without wasting too much time
This page gives a practical mental model for the common strategies and what they mean for networking and GPU communication.
1. Data parallelism (most common)
In pure data parallelism:
- every worker has a full copy of the model
- each worker processes a different slice of the batch
- gradients are averaged across all workers each step
The key operation is usually AllReduce on gradients.
High‑level loop:
- Each worker does forward + backward on its local data.
- Workers participate in an AllReduce to combine gradients.
- Each worker updates its model parameters.
From the network’s perspective:
- you have large, regular, bursty communication at each step
- traffic pattern is typically all‑to‑all (or structured as rings/trees by NCCL)
- the cost of the collective grows with number of GPUs and model size
2. Model parallelism
In model parallelism:
- the model is too big to fit on a single GPU
- different layers or tensor slices live on different GPUs
This creates:
- more frequent communication of activations and gradients
- tighter dependencies between GPUs (can’t compute without remote data)
Networking impact:
- more fine‑grained, latency‑sensitive communication
- NVLink / NVSwitch quality inside a node becomes crucial
- inter‑node links can easily become the bottleneck
Model parallelism often appears in:
- large language models (LLMs)
- giant vision or multimodal models
3. Pipeline parallelism
Here:
- the model is split into stages
- micro‑batches flow through the pipeline
Think of it like an assembly line:
- Stage 0 runs on some GPUs
- Stage 1 runs on others
- Activations stream between stages
Networking impact:
- steady flow of activations between specific GPU groups
- both bandwidth and latency matter
- imbalanced stages or slow links create bubbles in the pipeline
In practice, many large‑scale systems combine:
- data parallelism across nodes
- model / pipeline parallelism within and across nodes
4. Collectives: the core communication patterns
Most distributed training communication can be expressed in a handful of collectives:
- AllReduce – everyone contributes, everyone gets the sum/avg
- AllGather – everyone shares their tensor, everyone gets the concatenation
- ReduceScatter – like AllReduce, but the result is sharded back out
- Broadcast – one sender, many receivers
Libraries like NCCL implement these efficiently using:
- rings
- trees
- hybrids optimized for topology and message sizes
For infra / networking engineers, these operations show up as:
- bursts of large flows at synchronization points
- traffic that wants high bisection bandwidth and low tail latency
5. Scaling and where it breaks
Ideal scaling: doubling the number of GPUs roughly halves the training time.
In reality, as you add more GPUs:
- more time is spent waiting at collective operations
- fabric congestion and oversubscription show up
- step time stops shrinking and can even increase
Common failure points:
- not enough network bandwidth between racks or nodes
- poor topology awareness in the job scheduler or NCCL
- noisy neighbors on shared clusters
- misconfigured RDMA / flow control / ECN
6. What to watch as an ML infra engineer
When you’re responsible for the platform, key questions include:
- How does step time change as we add nodes?
- What fraction of time is spent inside collectives (NCCL) vs compute?
- Are we using RDMA / Infiniband efficiently, or falling back to TCP?
- Are GPU‑to‑NIC and GPU‑to‑GPU topologies well‑balanced?
The next docs dive into:
- GPU Networking 101
- RDMA for ML infrastructure
- NCCL internals and debugging