Configuration
Add evals to your training config under theevals key:
| Field | Default | Description |
|---|---|---|
name | (required) | Eval or environment name |
eval_every | 10 | Run every N training steps |
num_samples | -1 (all) | Number of samples to evaluate |
temperature | inherited | Override sampling temperature |
top_p | inherited | Override nucleus sampling |
max_tokens | inherited | Override max generation tokens |
pass_k | {} | pass@k / pass^k configuration |
separate_eval_samples | false | Exclude eval samples from training data |
kwargs | {} | Extra kwargs passed to the eval/environment |
Eval servers
Evals run on dedicated inference servers that are reserved from the inference pool:eval_num_servers: 0 disables periodic evals during training.
Baseline and final evals
In addition to periodic evals during training, Telescope can run evals before the first training step (baseline) and after the last step (final):eval_start_end_use_all_servers is true, baseline and final evals use every inference server (not just the dedicated eval servers), since no training is happening at those points.
Eval sources
When Telescope resolves an evalname, it checks two locations in order:
- Dedicated eval —
src/telescope/evals/<name>/eval.py - Environment fallback —
src/telescope/environments/<name>/environment.py
Built-in evals
Telescope ships with several standalone evals that can be used out of the box:| Eval | Samples | Dataset | Description |
|---|---|---|---|
math500 | 500 | HuggingFaceH4/MATH-500 | MATH benchmark test set with symbolic answer verification |
aime_2024 | 30 | HuggingFaceH4/aime_2024 | 2024 American Invitational Mathematics Examination |
aime_2025 | 30 | opencompass/AIME2025 | 2025 AIME (Parts I and II) |
gpqa_diamond | 198 | Idavidrein/gpqa | Graduate-level multiple-choice scientific reasoning (physics, chemistry, biology) |
math500, aime_2024, and aime_2025 require math_verify (uv add math-verify). gpqa_diamond may require HuggingFace authentication (huggingface-cli login).
Building a dedicated eval
Create a folder undersrc/telescope/evals/ with an eval.py file. The simplest approach is wrapping an existing environment:
environment_name gives the eval access to the environment’s prompt formatting, reward function, and compute_eval_metrics. You only need to override what you want to change.
For fully standalone evals (no training environment), leave environment_name as None and implement everything directly:
pass@k
For tasks where you want to measure whether the model can solve a problem in k attempts, configurepass_k:
- pass@k — probability that at least 1 of k completions is correct (unbiased estimator from Chen et al., 2021)
- pass^k — probability that all k completions are correct
max(k) completions per sample and computes the pass@k estimates from those.

