GPU Metrics to Watch
For GPU‑heavy ML workloads, a few metrics tell most of the story:
- GPU utilization (%)
- memory usage (GB)
- SM (compute) utilization
- PCIe / NVLink utilization
- kernel launch stats (if available)
As an infra engineer, watch for:
- low utilization with high queueing elsewhere
- plateaus in utilization as you add GPUs
- correlation between collective operations and drops in utilization