Examples - Telescope

Telescope ships with example configs that run end-to-end out of the box. Each example trains a small model on a specific task so you can see the full pipeline in action. All examples use the same command pattern:

uv run train.py --config configs/examples/<example>.yaml

Countdown

A mathematical reasoning task where the model must create an equation from a set of numbers to reach a target value. This is the simplest example and a good starting point.

uv run train.py --config configs/examples/example_countdown.yaml

Prompt:

Using the numbers [1, 2, 3, 5], create an equation that equals 12.
You can use basic arithmetic operations (+, -, *, /) and each number
can only be used once.

Model response:

<think>
Let me try some combinations:
- 1 + 2 + 3 + 5 = 11 (not 12)
- (5 * 2) + (3 - 1) = 10 + 2 = 12 (this works!)
</think>
<answer>(5 * 2) + (3 - 1)</answer>

Since Qwen2.5-3B is a base model that doesn’t natively produce reasoning traces, add_thinking_prefix: true prepends a <think> tag to the assistant’s response, nudging the model to learn chain-of-thought reasoning. Reward: format_reward = 1.0 (correct <think>...<answer> format) + equation_reward = 1.0 (uses exactly [1, 2, 3, 5] and evaluates to 12) = 2.0

Config

model: "Qwen/Qwen2.5-3B"

inference_num_workers: 2
trainer_num_workers: 2

environments:
  - name: "countdown"
    reward_min: 0.0
    reward_max: 2.0
    kwargs:
      add_thinking_prefix: true

prompts_batch_size_for_trainer: 16
group_size: 8
number_of_steps: 300
algorithm: "grpo"

max_model_len: 4096
max_tokens: 3600

Dataset: Jiayi-Pan/Countdown-Tasks-3to4

This config uses 2 GPUs for training and 2 for inference (4 total). Adjust trainer_num_workers and inference_num_workers to match your setup.

Hendrycks MATH

Competition-level math problems. The model must solve problems and provide the final answer inside \boxed{}. Answers are verified symbolically using math_verify, so equivalent forms like \frac{1}{2} and 0.5 are both accepted.

uv run train.py --config configs/examples/example_hendrycks_math.yaml

Prompt:

Compute $\dbinom{10}{5}$.

Model response:

<think>
The binomial coefficient is:
10! / (5! * 5!) = (10 * 9 * 8 * 7 * 6) / (5 * 4 * 3 * 2 * 1) = 30240 / 120 = 252
</think>

The answer is $\boxed{252}$.

Reward: correct = 1.0 (252 matches ground truth via symbolic verification) = 1.0

Config

model: "Qwen/Qwen2.5-3B"

inference_num_workers: 2
trainer_num_workers: 2

environments:
  - name: "hendrycks_math"
    reward_min: 0.0
    reward_max: 1.0

prompts_batch_size_for_trainer: 16
group_size: 8
number_of_steps: 300
algorithm: "grpo"

max_model_len: 4096
max_tokens: 3000

Dataset: justus27/math-hendrycks-genesys-format

This config uses 2 GPUs for training and 2 for inference (4 total). Adjust trainer_num_workers and inference_num_workers to match your setup.

This example requires math_verify for symbolic answer verification. Install it with uv add math-verify.

Hendrycks MATH with Evals

The same math training setup as above, but with periodic evaluations enabled. Runs a MATH-500 eval and a held-out Hendrycks MATH eval every 10 steps, plus baseline and final evals. This is a good starting point for understanding how evals work.

uv run train.py --config configs/examples/example_hendrycks_math_with_eval.yaml

Config

model: "Qwen/Qwen2.5-3B"

inference_num_workers: 2
trainer_num_workers: 2

environments:
  - name: "hendrycks_math"
    reward_min: 0.0
    reward_max: 1.0

prompts_batch_size_for_trainer: 16
group_size: 8
number_of_steps: 300
algorithm: "grpo"

max_model_len: 4096
max_tokens: 3000

evals:
  - name: "math500"
    eval_every: 10
    num_samples: 100
    max_tokens: 3000
  - name: "hendrycks_math"
    eval_every: 10
    num_samples: 100
    separate_eval_samples: true

eval_before_training: true
eval_after_training: true
eval_num_servers: 1
eval_start_end_use_all_servers: true

Two evals are configured:

math500 — 100 samples from the MATH-500 test set (a dedicated eval benchmark)
hendrycks_math — 100 held-out samples from the training environment, with separate_eval_samples: true so these samples are excluded from training data

One inference server is reserved for evals (eval_num_servers: 1). Baseline and final evals use all servers for faster completion (eval_start_end_use_all_servers: true).

This config uses 2 GPUs for training and 2 for inference (4 total). Adjust trainer_num_workers and inference_num_workers to match your setup.

This example requires math_verify for symbolic answer verification. Install it with uv add math-verify.

Wordle

A multi-turn interactive game. The model plays Wordle by guessing words, receiving letter-by-letter feedback each turn (G = correct position, Y = wrong position, X = not in word), and refining its guesses. This example demonstrates the multi-turn environment loop.

uv run train.py --config configs/examples/example_wordle.yaml

Turn 1 — Model guesses SLATE:

<guess>[SLATE]</guess>

Environment feedback:

S L A T E
X X Y Y Y       ← A, T, E in word but wrong position
You have 5 guesses left.

Turn 2 — Model guesses TRACE:

<guess>[TRACE]</guess>

Environment feedback:

T R A C E
Y G Y X Y       ← R correct position, T/A/E wrong position
You have 4 guesses left.

Turn 3 — Model guesses HEART:

<guess>[HEART]</guess>

Game won on turn 3! Reward: correct_answer = 1.0 + partial_answer = 0.0 + length_bonus = 0.33 (won in 3 guesses) + format_reward = 0.16 (score 0.8, weight 0.2) = 1.49

Config

model: "Qwen/Qwen3-4B"

inference_num_workers: 2
trainer_num_workers: 2

environments:
  - name: "wordle"
    weight: 1.0
    reward_min: 0.0
    reward_max: 2.2
    kwargs:
      max_turns: 6

prompts_batch_size_for_trainer: 16
group_size: 8
number_of_steps: 300
algorithm: "grpo"

max_model_len: 4096
max_tokens: 2000

This config uses 2 GPUs for training and 2 for inference (4 total). Adjust trainer_num_workers and inference_num_workers to match your setup.

DAPO Math

Mathematical reasoning using the DAPO Math 17K dataset — a diverse collection of ~17K math problems. The model must solve problems and provide the final answer inside \boxed{}. Answers are verified symbolically using math_verify.

uv run train.py --config configs/examples/example_dapo_math.yaml

Config

model: "Qwen/Qwen2.5-3B"

inference_num_workers: 2
trainer_num_workers: 2

environments:
  - name: "dapo_math"
    reward_min: 0.0
    reward_max: 1.0

prompts_batch_size_for_trainer: 16
group_size: 8
number_of_steps: 300
algorithm: "grpo"

max_model_len: 4096
max_tokens: 3000

Dataset: BytedTsinghua-SIA/DAPO-Math-17k

This config uses 2 GPUs for training and 2 for inference (4 total). Adjust trainer_num_workers and inference_num_workers to match your setup.

This example requires math_verify for symbolic answer verification. Install it with uv add math-verify.

DeepScaler

Mathematical reasoning for iterative context lengthening, using the DeepScaleR dataset. The model must solve problems and provide the final answer inside \boxed{}. Answers are verified symbolically using math_verify.

uv run train.py --config configs/examples/example_deepscaler.yaml

Config

model: "Qwen/Qwen2.5-3B"

inference_num_workers: 2
trainer_num_workers: 2

environments:
  - name: "deepscaler"
    reward_min: 0.0
    reward_max: 1.0

prompts_batch_size_for_trainer: 16
group_size: 8
number_of_steps: 300
algorithm: "grpo"

max_model_len: 4096
max_tokens: 3000

Dataset: agentica-org/DeepScaleR-Preview-Dataset

This config uses 2 GPUs for training and 2 for inference (4 total). Adjust trainer_num_workers and inference_num_workers to match your setup.

This example requires math_verify for symbolic answer verification. Install it with uv add math-verify.

Code Generation

Code generation with sandboxed test execution. The model generates Python code that is executed against hidden test cases in an isolated sandbox.

uv run train.py --config configs/examples/example_i3_code.yaml

Prompt:

Write a function `max_subarray_sum(nums)` that finds the maximum
sum of a contiguous subarray.

Example:
  Input: [-2, 1, -3, 4, -1, 2, 1, -5, 4]
  Output: 6  (subarray [4, -1, 2, 1])

Model response:

def max_subarray_sum(nums):
    max_current = max_global = nums[0]
    for i in range(1, len(nums)):
        max_current = max(nums[i], max_current + nums[i])
        max_global = max(max_global, max_current)
    return max_global

Reward: Code is extracted, executed in a sandbox against the hidden test suite. passed = 1.0 (all test cases pass) = 1.0

Config

model: "Qwen/Qwen3-4B"

inference_num_workers: 2
trainer_num_workers: 2

environments:
  - name: "i3_code"
    weight: 1.0
    reward_min: 0.0
    reward_max: 1.0
    kwargs:
      provider: "daytona"
      pool_size: 10       # concurrent sandboxes (10 is the free limit in Daytona)
      memory_mb: 512
      disk_size_gb: 1

prompts_batch_size_for_trainer: 16
group_size: 8
number_of_steps: 300
algorithm: "grpo"

max_model_len: 4096
max_tokens: 2000

Dataset: PrimeIntellect/INTELLECT-3-RL (code subset)

This config uses 2 GPUs for training and 2 for inference (4 total). Adjust trainer_num_workers and inference_num_workers to match your setup.

This example requires a sandbox provider. Several providers are pre-configured in environments/_sandbox/ (prime, modal, daytona, e2b) — see Sandbox execution.

Documentation Index

​Countdown

​Hendrycks MATH

​Hendrycks MATH with Evals

​Wordle

​DAPO Math

​DeepScaler

​Code Generation

Countdown

Hendrycks MATH

Hendrycks MATH with Evals

Wordle

DAPO Math

DeepScaler

Code Generation