Skip to main content

Observability for ML Infrastructure

Observability is how you see your ML infrastructure:

Are GPUs busy or idle?
Is the network saturated or fine?
Are bottlenecks in data, compute, or communication?

This doc is a high‑level entry point to thinking about metrics, logs, and traces for ML systems.

1. Key categories of signals

For most ML infra, you’ll care about:

GPU metrics – utilization, memory, SM occupancy, kernels in flight
CPU metrics – load, run queues, context switches
network metrics – throughput, drops, retransmits, RTT, queue depth
storage metrics – IOPS, latency, throughput
application metrics – step time, loss curves, throughput, latency SLOs

The job of an infra engineer is to connect these dots.

2. Typical questions observability should answer

Why is training slower today than yesterday?
Why is p99 latency bad only for a subset of requests?
Why does scaling to more GPUs stop improving performance?

Good observability lets you say whether the primary constraint is:

compute
memory
network
storage
coordination / scheduling

1. Key categories of signals
2. Typical questions observability should answer