Metrics
Time-series charts for GPU, CPU, and vLLM metrics across the cluster, organized into sections:- GPU metrics — utilization, memory usage, temperature, power draw, and PyTorch-level metrics like
torch_allocated_gb, per GPU. You can filter by role (trainer vs inference) and group by node - CPU metrics — CPU utilization and system memory usage per node
- vLLM metrics — per-server stats like requests running/waiting, KV cache usage, token throughput, and latencies

