Observability for ML Infrastructure
Observability is how you see your ML infrastructure:
- Are GPUs busy or idle?
- Is the network saturated or fine?
- Are bottlenecks in data, compute, or communication?
This doc is a high‑level entry point to thinking about metrics, logs, and traces for ML systems.
1. Key categories of signals
For most ML infra, you’ll care about:
- GPU metrics – utilization, memory, SM occupancy, kernels in flight
- CPU metrics – load, run queues, context switches
- network metrics – throughput, drops, retransmits, RTT, queue depth
- storage metrics – IOPS, latency, throughput
- application metrics – step time, loss curves, throughput, latency SLOs
The job of an infra engineer is to connect these dots.
2. Typical questions observability should answer
- Why is training slower today than yesterday?
- Why is p99 latency bad only for a subset of requests?
- Why does scaling to more GPUs stop improving performance?
Good observability lets you say whether the primary constraint is:
- compute
- memory
- network
- storage
- coordination / scheduling