RDMA for ML Infrastructure Engineers
RDMA (Remote Direct Memory Access) lets one machine read or write memory on another machine without involving the remote CPU and with minimal kernel involvement.
For ML infrastructure this means:
- lower latency
- lower CPU overhead
- higher throughput for distributed training and inference
This doc gives a practical, infra‑oriented view of RDMA, RoCE, and how they relate to ML workloads.
1. Why RDMA instead of TCP?
TCP is:
- general‑purpose
- ubiquitous
- relatively heavy on CPU and kernel involvement
For massive tensor transfers in training:
- copying data into kernel buffers
- handling interrupts
- managing TCP congestion control
all add overhead.
RDMA offers:
- kernel‑bypass for data path
- zero‑copy transfers between user‑space buffers
- offloaded congestion control (Infiniband, DCQCN over RoCE)
- lower per‑message latency
Result: at scale, using RDMA can be the difference between:
- “training finishes in days” vs “takes weeks”
- GPUs at 80–90% utilization vs 40–50%
2. Transport flavors
At a high level, you’ll see:
- Infiniband – a dedicated lossless-ish fabric with its own link and transport layer
- RoCEv2 – RDMA over Converged Ethernet, relying on an Ethernet fabric with features like PFC and ECN
- iWARP (less common these days in ML)
From an ML perspective:
- Infiniband is often the highest‑performance but requires its own fabric
- RoCEv2 runs on Ethernet gear, but demands careful configuration
3. Where RDMA shows up in ML stacks
- NCCL on top of Infiniband/RoCE for gradient exchange
- parameter servers or sharded parameter stores
- high‑throughput data loaders or feature services
- some inference serving paths where latency is extremely tight
As an infra engineer, you don’t necessarily write RDMA verbs yourself, but you:
- enable and configure RDMA in the cluster
- ensure routing, QoS, and congestion control are sane
- debug issues where performance doesn’t match expectations
4. Failure and misconfiguration patterns
Common problems include:
- RDMA silently falling back to TCP
- link‑level flow control (PFC) causing head‑of‑line blocking
- lack of ECN/RED leading to bufferbloat and large queues
- routing asymmetries causing uneven load
Symptoms:
- NCCL collectives much slower than expected
- GPU utilization low despite enough nominal bandwidth
- strong dependence on which nodes / racks are chosen for a job
5. What you should be comfortable with
As an ML infra / GPU networking engineer, aim to:
- explain at a high level how RDMA differs from TCP
- know whether your environment uses Infiniband or RoCE
- understand the role of PFC, ECN, DCQCN in RoCE fabrics
- be able to read basic NIC and switch counters related to RDMA traffic
You don’t need to be a firmware engineer, but you do need to be able to say:
“We’re not hitting our scaling targets because collectives are saturating links here, here, and here — and RDMA is not configured optimally.”