Skip to main content
Telescope tracks two kinds of metrics during training, both visible in real time on the Metrics page. Step metrics are logged once per training step. By default Telescope logs entropy, KL divergence, gradient norm, and learning rate. You can add your own by returning extra keys from train_step() using a section/group/metric naming convention (e.g., custom/loss/auxiliary), and they will automatically appear as charts organized by their section and group. Sample metrics are logged per sample from your environment’s compute_reward(). You can return anything — reward components, response length, number of reasoning steps, tool call counts, time spent computing the reward — via the sample_metrics dict in RewardResult. The UI aggregates them per step into mean, std, min, and max. Both kinds of metrics support range declarations. reward_min / reward_max set the expected reward range per environment. metrics_ranges on the environment class sets expected ranges for individual sample metrics. These declarations don’t affect training — they only tell the UI the effective min and max so charts can be normalized to a meaningful scale. When ranges are not declared, the observed min/max from the run’s data is used instead.

Step metrics

The trainer backend returns a metrics dictionary from train_step() every step. By default this includes:
KeyDescription
entropyPolicy entropy over masked token positions
kl_divergence_inferenceKL divergence between the current policy and the rollout policy (from vLLM logprobs)
grad_normGradient norm (averaged across minibatch groups)
learning_rateCurrent learning rate from the scheduler
Keys use / as a delimiter to create a section/group/metric hierarchy. Keys without / (like the defaults above) are placed under a General section. Keys with one / (e.g., loss/policy) become section=loss, metric=policy. Keys with two or more / (e.g., timing/forward/total) become section=timing, group=forward, metric=total. You can log additional step metrics by returning extra keys from train_step() in a custom backend:
def train_step(self, trainer_data: dict) -> dict:
    metrics = super().train_step(trainer_data)
    metrics["custom/my_metric"] = some_value       # section="custom", metric="my_metric"
    metrics["debug/grads/layer_norm"] = other_value # section="debug", group="grads", metric="layer_norm"
    return metrics
Step metrics appear in the Metrics page organized into their sections and groups. The Custom view lets you build your own dashboard layout by creating sections, adding metrics from the catalog, and reordering with drag-and-drop. The layout persists across sessions.

Sample metrics

Environments return per-sample metrics alongside the total reward via the sample_metrics field of RewardResult.
def compute_reward(self, completion: str, sample: Sample, eos_token: str = "") -> RewardResult:
    is_correct = check_answer(completion, sample.answer)
    return RewardResult(
        total_reward=float(is_correct),
        sample_metrics={
            "correct": float(is_correct),
            "reasoning_steps": count_steps(completion),
            "response_words": len(completion.split()),
            "reward_compute_time": elapsed,
        },
    )
No pre-registration is required — any key you include in sample_metrics is automatically tracked. The UI aggregates them per step (mean, std, min, max) and computes a Gini coefficient per prompt group, measuring how concentrated the metric values are within each group.

Tool metrics

For tool-calling environments, get_tool_metrics(state) returns a dict with tool usage stats that can be merged into sample_metrics:
{
    "total_tool_calls": 3,
    "tool_success_count": 2,
    "tool_error_count": 1,
    "tool_success_rate": 0.67,
    "unique_tools_used": 2,
    "add_calls": 2,
    "subtract_calls": 1,
}
See Tool Calling for details.

Ranges

Reward ranges

reward_min and reward_max declare the expected reward range for an environment. These don’t affect training — they tell the UI the effective bounds so reward charts are normalized to a meaningful scale. This is especially useful when training with multiple environments whose rewards have different scales (e.g., 0–2 for one, 0–1 for another).
environments:
  - name: "countdown"
    weight: 0.5
    reward_min: 0.0
    reward_max: 2.0
  - name: "hendrycks_math"
    weight: 0.5
    reward_min: 0.0
    reward_max: 1.0

Metrics ranges

metrics_ranges declares expected ranges for individual sample metrics. Like reward ranges, these don’t affect training — they tell the UI the effective bounds so each sample metric chart is normalized to a meaningful scale.
class MyEnvironment(BaseEnvironment):
    metrics_ranges = {
        "correct": {"min": 0.0, "max": 1.0},
        "reasoning_steps": {"min": 0.0, "max": 20.0},
    }
When neither reward_min/reward_max nor metrics_ranges are set, the observed min/max from the run’s data is used instead.

Reward

The total reward for each sample comes from RewardResult.total_reward returned by the environment’s compute_reward(). The UI computes per-step statistics (mean, std, min, max) from all samples in each training step. The UI also computes a Gini coefficient per prompt group and averages it — it measures reward sparsity (0 = all samples got the same reward, 1 = extreme concentration).

Advantage

The UI shows per-step advantage statistics (mean, std, min, max). How advantages are normalized depends on the algorithm — see Algorithms.

Rollouts

The UI computes token length distributions from the prompts and completions in each step (mean, std, min, max for each):
  • Tokens (Prompt) — prompt token counts
  • Tokens (Completion) — completion token counts (sum across turns for multi-turn)
  • Tokens (Total) — total tokens per sample
Additional metrics:
MetricDescription
stop_reason_length_pctPercentage of samples that hit the token length limit. A high value may indicate that max_token_len is too low.
group_length_gini_meanGini coefficient of completion lengths within prompt groups, averaged across groups. Measures within-group length inequality.
group_length_max_median_ratio_meanMax/median ratio of completion lengths within prompt groups, averaged across groups. A straggler indicator — values close to 1 mean completions are roughly equal length.

Discarded rollouts

Metrics for rollouts that were discarded and not used for training. Rollouts can be discarded for two reasons:
  • max_async — the rollout was generated with weights that are too many steps behind (max_off_policy_steps)
  • zero_advantage — all samples in the group received the same reward, so the advantage is zero (controlled by discard_group_zero_advantage)
Tracked metrics include discard counts and percentages by reason, a zero-advantage breakdown (all rewards = 0 vs all rewards > 0 vs mean reward), canceled rollouts, and token length distributions for discarded samples.

Timeline

Step timing breakdowns from the trainer’s GPU timeline. The first step is typically much slower due to compilation. Full step (total time per operation):
MetricDescription
timing_step_totalTotal wall time per training step
timing_step_activeActive time excluding waiting for data
timing_forward_totalTotal forward pass time
timing_backward_totalTotal backward pass time
timing_loss_computation_totalLoss computation time
timing_compute_kl_totalKL divergence computation time
timing_compute_entropy_totalEntropy computation time
timing_data_to_device_totalTime moving data to GPU
timing_prepare_tensors_totalTensor preparation time
timing_waiting_for_dataTime spent waiting for rollout data
timing_weight_sync_trainer_totalWeight sync time (trainer side)
timing_weight_sync_inference_totalWeight sync time (inference side)
Per microbatch (mean time): same operations as above averaged per microbatch (e.g., timing_forward_microbatch_mean).

W&B scalar metrics

Step metrics are also logged to W&B as scalar metrics (via wandb.log()), so they appear as charts in the W&B dashboard. This is useful for quick debugging.