Why Networking Matters for ML

When people try to speed up ML, they usually think about:

bigger or newer GPUs
smarter optimizers
mixed precision or quantization

But in distributed systems, networking is often the invisible bottleneck.

If GPUs are the engines, the network is the gearbox and the road.
You can have the best engine in the world and still go nowhere if the road is narrow, congested, or full of potholes.

This page explains why networking is central to:

distributed training
high‑QPS, low‑latency inference
scalable ML platforms

1. How ML changes the networking game

On a single‑GPU workstation, the network barely matters.
In real systems, you often have:

multiple GPUs per node (4, 8, 16…)
many nodes in a cluster (tens, hundreds, thousands)
huge models and datasets
long‑running jobs and strict SLOs

Now:

GPUs must exchange gradients and parameters every step
nodes must stream data efficiently
model servers must respond within tight latency budgets

All of this is mediated by:

PCIe / NVLink / NVSwitch inside a node
NICs + switches + fabric between nodes

If those paths are slow or unstable, GPU utilization drops, training time increases, and inference latency spikes.

2. Networking in distributed training

In typical data‑parallel training, each worker:

Computes gradients on a batch
Exchanges gradients/parameters with other workers (collectives)
Updates its local copy of the model
Repeats

Step 2 is mostly network traffic:

AllReduce, AllGather, ReduceScatter, Broadcast
Implemented by libraries like NCCL, Gloo, MPI
Running over TCP, RDMA/RoCE, or Infiniband

Problems show up as:

GPUs sitting idle at synchronization barriers
step time increasing as you add more GPUs or more nodes
mysterious “doesn’t scale” behavior

Very often, the root cause is:

topology issues
oversubscribed links
misconfigured NICs or switches
RDMA not being used properly

3. Networking in inference

For online inference, the focus shifts to latency and jitter:

Request path: client → edge/gateway → model server → dependencies (feature store, cache, DB)
Fan‑out: one user request might hit multiple services or models
SLOs: p95/p99 latency budgets

Even small networking overheads add up:

extra hops
unnecessary TLS handshakes
poor connection reuse
bad load‑balancer configuration

On the network side this is all about:

minimizing tail latency
controlling queueing and retries
keeping critical paths short and predictable

4. Why networking engineers have an advantage

If you already understand:

TCP handshakes, retransmissions, congestion control
latency, bandwidth, and bandwidth‑delay product (BDP)
queues, buffers, and tail latency
routing, ECMP, VLANs, overlays
counters, packet captures, and flow logs

…you’re already equipped to reason about many ML infra problems:

TCP vs RDMA/RoCE trade‑offs
QoS and traffic isolation for training vs inference vs “noise”
how topologies and oversubscription affect AllReduce
how to debug “training is slow” when the model code looks fine

ML infra is not a separate universe; it is distributed systems + networking + GPU hardware.

5. Typical networking failure modes in ML

Some patterns you’ll see repeatedly:

5.1 AllReduce is slow

Training step time is dominated by gradient exchange
scaling from 2 → 8 → 64 GPUs gives poor speedup

Likely causes:

weak topology (too many slow links)
oversubscribed fabric
misconfigured MTU, PFC, ECN, or queues
RDMA misconfigurations or fallback to TCP

5.2 Performance is inconsistent

some runs are fast, others slow
only certain node allocations show good performance

Likely causes:

noisy neighbors
poor placement across racks or AZs
uneven hashing or flows on the fabric
transient network issues

5.3 GPUs are underutilized

GPU utilization < 50–60%
CPUs or NICs are hot instead

Likely causes:

data pipeline bottlenecks
too many small RPCs
serialization/compression overhead
lack of kernel‑bypass / RDMA where it would help

6. What this track will give you

By the end of the “Networking for ML” section of this guide, you should be able to:

explain how distributed training depends on the network
distinguish model issues from infra/network issues
talk confidently about topology, RDMA vs TCP, and QoS in ML systems
reason about where to invest: more GPUs vs better fabric vs better configuration

Next up: GPU Networking 101.

1. How ML changes the networking game​

2. Networking in distributed training​

3. Networking in inference​

4. Why networking engineers have an advantage​

5. Typical networking failure modes in ML​

5.1 AllReduce is slow​

5.2 Performance is inconsistent​

5.3 GPUs are underutilized​

6. What this track will give you​

1. How ML changes the networking game

2. Networking in distributed training

3. Networking in inference

4. Why networking engineers have an advantage

5. Typical networking failure modes in ML

5.1 AllReduce is slow

5.2 Performance is inconsistent

5.3 GPUs are underutilized

6. What this track will give you