Countdown
A mathematical reasoning task where the model must create an equation from a set of numbers to reach a target value. This is the simplest example and a good starting point.add_thinking_prefix: true prepends a <think> tag to the assistant’s response, nudging the model to learn chain-of-thought reasoning.
Reward: format_reward = 1.0 (correct <think>...<answer> format) + equation_reward = 1.0 (uses exactly [1, 2, 3, 5] and evaluates to 12) = 2.0
Config
Config
This config uses 2 GPUs for training and 2 for inference (4 total). Adjust
trainer_num_workers and inference_num_workers to match your setup.Hendrycks MATH
Competition-level math problems. The model must solve problems and provide the final answer inside\boxed{}. Answers are verified symbolically using math_verify, so equivalent forms like \frac{1}{2} and 0.5 are both accepted.
correct = 1.0 (252 matches ground truth via symbolic verification) = 1.0
Config
Config
This config uses 2 GPUs for training and 2 for inference (4 total). Adjust
trainer_num_workers and inference_num_workers to match your setup.This example requires
math_verify for symbolic answer verification. Install it with uv add math-verify.Hendrycks MATH with Evals
The same math training setup as above, but with periodic evaluations enabled. Runs a MATH-500 eval and a held-out Hendrycks MATH eval every 10 steps, plus baseline and final evals. This is a good starting point for understanding how evals work.Config
Config
- math500 — 100 samples from the MATH-500 test set (a dedicated eval benchmark)
- hendrycks_math — 100 held-out samples from the training environment, with
separate_eval_samples: trueso these samples are excluded from training data
eval_num_servers: 1). Baseline and final evals use all servers for faster completion (eval_start_end_use_all_servers: true).
This config uses 2 GPUs for training and 2 for inference (4 total). Adjust
trainer_num_workers and inference_num_workers to match your setup.This example requires
math_verify for symbolic answer verification. Install it with uv add math-verify.Wordle
A multi-turn interactive game. The model plays Wordle by guessing words, receiving letter-by-letter feedback each turn (G = correct position, Y = wrong position, X = not in word), and refining its guesses. This example demonstrates the multi-turn environment loop.correct_answer = 1.0 + partial_answer = 0.0 + length_bonus = 0.33 (won in 3 guesses) + format_reward = 0.16 (score 0.8, weight 0.2) = 1.49
Config
Config
This config uses 2 GPUs for training and 2 for inference (4 total). Adjust
trainer_num_workers and inference_num_workers to match your setup.DAPO Math
Mathematical reasoning using the DAPO Math 17K dataset — a diverse collection of ~17K math problems. The model must solve problems and provide the final answer inside\boxed{}. Answers are verified symbolically using math_verify.
Config
Config
This config uses 2 GPUs for training and 2 for inference (4 total). Adjust
trainer_num_workers and inference_num_workers to match your setup.This example requires
math_verify for symbolic answer verification. Install it with uv add math-verify.DeepScaler
Mathematical reasoning for iterative context lengthening, using the DeepScaleR dataset. The model must solve problems and provide the final answer inside\boxed{}. Answers are verified symbolically using math_verify.
Config
Config
This config uses 2 GPUs for training and 2 for inference (4 total). Adjust
trainer_num_workers and inference_num_workers to match your setup.This example requires
math_verify for symbolic answer verification. Install it with uv add math-verify.Code Generation
Code generation with sandboxed test execution. The model generates Python code that is executed against hidden test cases in an isolated sandbox.passed = 1.0 (all test cases pass) = 1.0
Config
Config
This config uses 2 GPUs for training and 2 for inference (4 total). Adjust
trainer_num_workers and inference_num_workers to match your setup.This example requires a sandbox provider. Several providers are pre-configured in
environments/_sandbox/ (prime, modal, daytona, e2b) — see Sandbox execution.
