Skip to main content

GPU Metrics to Watch

For GPU‑heavy ML workloads, a few metrics tell most of the story:

GPU utilization (%)
memory usage (GB)
SM (compute) utilization
PCIe / NVLink utilization
kernel launch stats (if available)

As an infra engineer, watch for:

low utilization with high queueing elsewhere
plateaus in utilization as you add GPUs
correlation between collective operations and drops in utilization