GPU Networking 101

Modern machine learning workloads depend heavily on how GPUs talk to each other.
You can have the fastest GPU in the world — but if data can't move quickly between GPUs, your training job will stall.

This page explains the fundamentals of GPU connectivity, both inside a server and across multiple servers, and how these choices affect performance.

🚀 1. Three Ways GPUs Communicate

GPUs exchange tensors, gradients, activations, and parameters over three primary channels:

1.1 PCIe (Peripheral Component Interconnect Express)

Standard CPU ↔ GPU interface
Used in nearly all servers
Shared with NICs, NVMe, SSD controllers
High bandwidth, but not the fastest for GPU-to-GPU traffic
Topology matters (which GPU is plugged into which PCIe root complex)

Typical bandwidth:

PCIe Gen4 x16 → ~32 GB/s (bidirectional)
PCIe Gen5 x16 → ~64 GB/s (bidirectional)

1.2 NVLink (NVIDIA-only interconnect)

High-speed GPU ↔ GPU path
Much lower latency than PCIe
3–10× more bandwidth depending on generation
Used heavily in multi-GPU training (ResNet, LLMs, diffusion models)

Typical bandwidth:

NVLink 3.0 → ~50 GB/s each direction
NVLink 4.0 → ~75 GB/s each direction
NVSwitch → up to ~900 GB/s aggregate per GPU (H100 systems)

NVLink is a mesh, ring, or full crossbar (NVSwitch) depending on the system.

1.3 Network Fabric (NICs → Switches → NICs)

When GPUs are in different nodes, they talk over the network:

Infiniband (NDR: 400 Gb/s, HDR: 200 Gb/s)
RoCEv2 (RDMA over Converged Ethernet)
Ethernet TCP (slowest option for training)

The network layer is what matters most for:

Multi-node distributed training
Scaling beyond 8 or 16 GPUs
Large clusters running many jobs
Model-parallel workloads

Typical NIC bandwidth:

NIC Type	Speed	Notes
Ethernet TCP	100–400 Gb/s	Highest overhead
RoCEv2 RDMA	100–400 Gb/s	Low latency, kernel-bypass
Infiniband	100–400 Gb/s	Most efficient for ML

🧠 2. Understanding GPU Topology

A GPU topology describes which GPU can talk to which other GPU, and over what path:

GPU ↔ GPU (NVLink or PCIe)
GPU ↔ NIC (PCIe)
NIC ↔ Switch (Network)
Switch ↔ NIC (Network)
NIC ↔ GPU (PCIe)

Good topology = efficient training.
Bad topology = GPUs stall waiting for communication.

There are two kinds of topologies to understand:

2.1 Intra-node GPU topology (inside one server)

Example for an 8-GPU server:

GPU0 ─┬─ NVLink ── GPU1
      └─ PCIe ─── NIC

GPU2 ─┬─ NVLink ── GPU3
      └─ PCIe ─── NIC

GPU4 ─┬─ NVLink ── GPU5
      └─ PCIe ─── NIC

GPU6 ─┬─ NVLink ── GPU7
      └─ PCIe ─── NIC

Key questions:

Which GPUs are connected by NVLink vs PCIe only?
How many NVLink hops exist between GPU0 and GPU7?
Do GPUs share PCIe root complexes with NICs?

Systems with NVSwitch eliminate many of these issues by creating a full crossbar between GPUs:

GPU0 ─┐
GPU1 ─┼─ NVSwitch ─ full bandwidth to all GPUs
GPU2 ─┼─
GPU3 ─┘
...

2.2 Inter-node topology (between servers)

A simple 2-node setup:

Node A (8 GPUs)   <——>   Node B (8 GPUs)
     |                          |
     |                          |
    NIC                        NIC
     \                         /
      \                       /
        —— Switch / Fabric ——

In a cluster:

Each node may have 1 or 2 NICs
NICs may connect to different leaf switches
Switches connect to spine switches
The overall design may be CLOS/Leaf-Spine

Training communication performance depends heavily on:

oversubscription
ECMP hashing
RDMA configuration
congestion control (DCQCN, ECN, PFC)
cable layout
switch buffers

🔥 3. NCCL: How GPUs Use the Network

NVIDIA Collective Communication Library (NCCL) is the backbone of multi-GPU training.

NCCL selects communication paths based on:

GPU topology (NVLink availability)
NIC affinities (which GPU is closest to which NIC)
Network bandwidth and latency
Number of rings needed for the collective

For example, in AllReduce:

GPU → NVLink → GPU → NIC → Switch → NIC → GPU → NVLink → GPU

NCCL builds:

Rings (for bandwidth)
Trees (for latency)
Hybrid topologies (for large clusters)

Understanding NCCL’s topology logs is one of the most underrated skills in ML infra engineering.

⚡ 4. PCIe vs NVLink vs Network: What actually matters?

PCIe

Good for CPU → GPU transfers
Shared bus
Higher latency than NVLink

NVLink

Optimal for GPU ↔ GPU communication
~10× the bandwidth of PCIe
~5× lower latency
Crucial for multi-GPU model training (LLMs, diffusion)

Network (RDMA or IB)

Crucial once you cross servers
Performance bottleneck if:
- You use TCP instead of RDMA
- The cluster fabric is oversubscribed
- ECMP hashing is uneven
- NICs are misaligned with GPUs

Rule of thumb:

Inside a node → NVLink dominates.
Across nodes → RDMA/Infiniband dominates.

🧰 5. How to inspect GPU topology

Run:

nvidia-smi topo --matrix

Output shows:

NVLink connectivity
PCIe hops
NIC proximity

Terms:

NV1 = NVLink
PIX = same PCIe switch
PHB = through CPU root complex (slow)

📉 6. Common GPU Networking Bottlenecks

→ Causes contention → reduces bandwidth.

2. NVLink rings are unbalanced

→ One GPU becomes a choke point.

3. Oversubscribed spine switches

→ Multi-node scaling falls apart.

4. RDMA fallback to TCP

→ Training becomes 3–10× slower.

5. Wrong NUMA alignment

→ NIC connected to a GPU on another CPU socket.

🏁 7. Summary

GPU networking is the foundation of ML infrastructure.
Understanding:

PCIe
NVLink
NIC placement
Network fabrics
NCCL behavior

is essential for diagnosing performance issues and designing scalable training clusters.

🚀 1. Three Ways GPUs Communicate

1.1 PCIe (Peripheral Component Interconnect Express)​

1.2 NVLink (NVIDIA-only interconnect)​

1.3 Network Fabric (NICs → Switches → NICs)​

🧠 2. Understanding GPU Topology

2.1 Intra-node GPU topology (inside one server)​

2.2 Inter-node topology (between servers)​

🔥 3. NCCL: How GPUs Use the Network

⚡ 4. PCIe vs NVLink vs Network: What actually matters?

PCIe​

NVLink​

Network (RDMA or IB)​