RDMA for ML Infrastructure Engineers

RDMA (Remote Direct Memory Access) lets one machine read or write memory on another machine without involving the remote CPU and with minimal kernel involvement.

For ML infrastructure this means:

lower latency
lower CPU overhead
higher throughput for distributed training and inference

This doc gives a practical, infra‑oriented view of RDMA, RoCE, and how they relate to ML workloads.

1. Why RDMA instead of TCP?

TCP is:

general‑purpose
ubiquitous
relatively heavy on CPU and kernel involvement

For massive tensor transfers in training:

copying data into kernel buffers
handling interrupts
managing TCP congestion control

all add overhead.

RDMA offers:

kernel‑bypass for data path
zero‑copy transfers between user‑space buffers
offloaded congestion control (Infiniband, DCQCN over RoCE)
lower per‑message latency

Result: at scale, using RDMA can be the difference between:

“training finishes in days” vs “takes weeks”
GPUs at 80–90% utilization vs 40–50%

2. Transport flavors

At a high level, you’ll see:

Infiniband – a dedicated lossless-ish fabric with its own link and transport layer
RoCEv2 – RDMA over Converged Ethernet, relying on an Ethernet fabric with features like PFC and ECN
iWARP (less common these days in ML)

From an ML perspective:

Infiniband is often the highest‑performance but requires its own fabric
RoCEv2 runs on Ethernet gear, but demands careful configuration

3. Where RDMA shows up in ML stacks

NCCL on top of Infiniband/RoCE for gradient exchange
parameter servers or sharded parameter stores
high‑throughput data loaders or feature services
some inference serving paths where latency is extremely tight

As an infra engineer, you don’t necessarily write RDMA verbs yourself, but you:

enable and configure RDMA in the cluster
ensure routing, QoS, and congestion control are sane
debug issues where performance doesn’t match expectations

4. Failure and misconfiguration patterns

Common problems include:

RDMA silently falling back to TCP
link‑level flow control (PFC) causing head‑of‑line blocking
lack of ECN/RED leading to bufferbloat and large queues
routing asymmetries causing uneven load

Symptoms:

NCCL collectives much slower than expected
GPU utilization low despite enough nominal bandwidth
strong dependence on which nodes / racks are chosen for a job

5. What you should be comfortable with

As an ML infra / GPU networking engineer, aim to:

explain at a high level how RDMA differs from TCP
know whether your environment uses Infiniband or RoCE
understand the role of PFC, ECN, DCQCN in RoCE fabrics
be able to read basic NIC and switch counters related to RDMA traffic

You don’t need to be a firmware engineer, but you do need to be able to say:

“We’re not hitting our scaling targets because collectives are saturating links here, here, and here — and RDMA is not configured optimally.”

1. Why RDMA instead of TCP?​

2. Transport flavors​

3. Where RDMA shows up in ML stacks​

4. Failure and misconfiguration patterns​

5. What you should be comfortable with​

1. Why RDMA instead of TCP?

2. Transport flavors

3. Where RDMA shows up in ML stacks

4. Failure and misconfiguration patterns

5. What you should be comfortable with