The Infra page shows the hardware and system-level state of the cluster. It has three tabs: Metrics, Topology, and Model.Documentation Index
Fetch the complete documentation index at: https://docs.telescope.training/llms.txt
Use this file to discover all available pages before exploring further.
Metrics
Time-series charts for GPU, CPU, and vLLM metrics across the cluster, organized into sections:- GPU metrics — utilization, memory usage, temperature, power draw, and PyTorch-level metrics like
torch_allocated_gb, per GPU. You can filter by role (trainer vs inference) and group by node - CPU metrics — CPU utilization and system memory usage per node
- vLLM metrics — per-server stats like requests running/waiting, KV cache usage, token throughput, and latencies

