Skip to main content
Telescope uses a flat, Pydantic-validated configuration schema. Every field name is globally unique and self-descriptive. Typos in YAML keys are caught at load time (extra="forbid").

Config loading

Configuration is resolved by merging three layers:
  1. Defaultsconfigs/defaults/default_train.yaml (ships with Telescope)
  2. Run config — your YAML file passed via --config
  3. CLI overrides — individual flags like --learning_rate 5e-7
Later layers override earlier ones. You only need to specify the fields you want to change.
uv run train.py --config configs/my_run.yaml --learning_rate 5e-7

Summary

The most commonly configured parameters for a training run:
# Model
model: "Qwen/Qwen2.5-7B"        # HuggingFace model ID or local path
model_dtype: "bfloat16"         # model weight precision

# Environments
environments:
  - name: "hendrycks_math"      # registered environment name
    weight: 1.0                 # sampling weight for multi-env training
    reward_min: 0.0             # expected min reward (for UI charts)
    reward_max: 1.0             # expected max reward (for UI charts)

# Workers
inference_num_workers: 4        # number of vLLM inference servers
trainer_num_workers: 4          # number of FSDP/Megatron trainer workers

# Training
algorithm: "grpo"               # RL algorithm (grpo, rloo, reinforce_pp, dr_grpo, cispo, gspo, sapo)
group_size: 8                   # completions sampled per prompt
learning_rate: 1.0e-6           # optimizer learning rate
prompts_batch_size_for_trainer: 16      # number of groups per training batch
number_of_steps: 300            # total training steps

# Sampling
temperature: 1.0                # sampling temperature
max_tokens: 3700                # max generation tokens per completion
seq_len: 4096                   # max sequence length for packing

# Async
max_async_rollout: 2            # how many steps inference can run ahead (0 = synchronous)

# Checkpointing
checkpoint_every: 50            # save a checkpoint every N steps

# Logging
wandb_project: "my-project"    # Weights & Biases project name
wandb_run_name: "my-run"       # Weights & Biases run name
For the full list of parameters, see the sections below.

General

ParameterTypeDefaultDescription
debugboolfalseEnable debug mode

Model

ParameterTypeDefaultDescription
modelstr"Qwen/Qwen2.5-3B"HuggingFace model identifier or local path
model_dtype"float32" | "float16" | "bfloat16""float32"Model weight precision
mixed_precision_dtype"float32" | "float16" | "bfloat16" | null"bfloat16"FSDP mixed precision compute dtype. null to disable

Environments

At least one environment must be configured. Each entry is an object with these fields:
ParameterTypeDefaultDescription
namestrrequiredEnvironment name (must match a registered environment)
weightfloat1.0Sampling weight for multi-environment training (must be > 0)
reward_minfloat | nullnullExpected minimum reward (used by the UI for normalized charts)
reward_maxfloat | nullnullExpected maximum reward (used by the UI for normalized charts)
kwargsdict{}Environment-specific keyword arguments
environments:
  - name: "hendrycks_math"
    weight: 0.5
    reward_min: 0.0
    reward_max: 1.0
  - name: "countdown"
    weight: 0.5
    reward_min: 0.0
    reward_max: 2.0

Workers

ParameterTypeDefaultDescription
inference_num_workersint2Number of vLLM inference workers. Must be divisible by inference_tensor_parallel_size
inference_tensor_parallel_sizeint1Tensor parallelism degree for each inference server
trainer_num_workersint2Number of FSDP/Megatron trainer workers

Orchestrator

ParameterTypeDefaultDescription
max_concurrent_prompts_per_serverint32Max concurrent prompts per inference server
prompts_batch_size_for_trainerint16Number of prompt groups per training batch
number_of_stepsint300Total training steps
max_async_rolloutint2How many training steps inference can run ahead. 0 = fully synchronous
discard_group_zero_advantagebooltrueDrop groups where all samples got identical rewards (no learning signal)
enable_prompt_prefetchbooltruePrefetch prompts from the dataset
prompt_prefetch_buffer_sizeint24Number of prompts to prefetch
enable_individual_sample_lanesbooltrueEach sample gets its own lane slot instead of one per group
free_lane_after_generationbooltrueFree inference lane after generation completes (before reward computation). Improves throughput when reward computation is slow (e.g., sandbox environments). Requires enable_individual_sample_lanes
max_off_policy_stepsint8Cancel in-flight rollouts after this many weight updates

Trainer

ParameterTypeDefaultDescription
learning_ratefloat1.0e-6Learning rate (must be > 0)
weight_decayfloat0.0Weight decay coefficient
grad_clipfloat1.0Maximum gradient norm for clipping
lr_scheduler"none" | "constant" | "linear" | "cosine""constant"LR schedule. "none" = fixed LR; "constant" = warmup then constant; "linear" / "cosine" = warmup then decay
warmup_stepsint0Number of linear warmup steps
min_lr_ratiofloat0.0Minimum LR as a fraction of learning_rate for linear/cosine decay. Range: [0, 1]
train_backend"fsdp" | "megatron""fsdp"Training backend

Algorithm

ParameterTypeDefaultDescription
algorithm"grpo" | "rloo" | "reinforce_pp" | "dr_grpo" | "cispo" | "gspo" | "sapo""grpo"RL algorithm. See Algorithms
number_of_minibatchesint1PPO-style gradient steps per rollout batch. >1 enables multiple update passes
advantage_norm"group" | "batch""group"Normalize advantages within each prompt group or across the training batch
use_ppo_clipboolfalseApply PPO ratio clipping. Incompatible with cispo, gspo, sapo
ppo_clip_ref_logprobs"rollout" | "batch""rollout"Reference logprobs for PPO clipping. "rollout" = vLLM logprobs; "batch" = recompute with trainer
clip_lowfloat0.4PPO clip lower bound: ratio is clamped to 1 - clip_low. Range: [0, 1]
clip_highfloat0.5PPO clip upper bound: ratio is clamped to 1 + clip_high
sapo_tau_posfloat1.0SAPO sigmoid sharpness for positive advantages
sapo_tau_negfloat1.05SAPO sigmoid sharpness for negative advantages
dr_grpo_loss_agg_mode"token_mean" | "token_sum_norm""token_mean"DR-GRPO loss aggregation. "token_sum_norm" removes response-level length bias
use_tisboolfalseTruncated importance sampling for async vLLM/trainer logprob correction
tis_capfloat2.0Max importance weight for TIS
tis_logprob_clampfloat20.0Logprob clamping threshold for TIS
entropy_chunk_sizeint1024Chunk size for entropy calculation
use_tis=true + use_ppo_clip=true + ppo_clip_ref_logprobs="rollout" is invalid — it double-counts the importance sampling correction. Use ppo_clip_ref_logprobs: "batch" instead.

Rollout / Sampling

ParameterTypeDefaultDescription
group_sizeint8Number of completions sampled per prompt
temperaturefloat1.0Sampling temperature
top_pfloat | nullnullNucleus sampling probability. null = disabled
max_tokensint3700Maximum generation tokens per completion
interleaved_rolloutsbooltrueExact token reuse across turns — avoids re-tokenization mismatches in multi-turn environments

Sequence Packing

ParameterTypeDefaultDescription
seq_lenint4096Maximum sequence length for packing
pad_to_multiple_ofint64Pad packed sequences to this alignment

Inference Server

ParameterTypeDefaultDescription
inference_hoststr"0.0.0.0"Host address for inference servers
inference_base_portint8100Base port for inference servers
gpu_memory_utilizationfloat0.9Fraction of GPU memory allocated to vLLM. Range: (0, 1]
max_model_lenint4096Maximum context length for vLLM
vllm_scheduling_policy"priority" | "fcfs""priority""priority" = turn-aware scheduling for multi-turn; "fcfs" = first-come-first-served
enable_thinkingboolfalsePass enable_thinking=True to apply_chat_template() for models with native thinking support
chat_templatestr | nullnullJinja2 chat template string to override the tokenizer’s built-in template. Required for base models that don’t include a chat template in their tokenizer

Checkpointing

See Checkpointing for detailed usage.
ParameterTypeDefaultDescription
checkpoint_everyint | boolfalseSave every N steps. 0 or false to disable
checkpoint_save_training_statebooltrueInclude optimizer/scheduler state for resume. false = weights-only
resume_from_checkpointbool | intfalsetrue = resume from latest; integer = resume from specific step
checkpoint_dirstr | nullnullCustom checkpoint path. Default: RUN_DIR/checkpoints
checkpoint_keep_lastint | nullnullKeep only the N most recent checkpoints
checkpoint_keep_everyint | nullnullAlways keep checkpoints at these step multiples

Evals

See Evals for detailed usage.
ParameterTypeDefaultDescription
eval_before_trainingbooltrueRun evaluations before training starts
eval_after_trainingbooltrueRun evaluations after training ends
eval_num_serversint1Number of dedicated eval servers reserved from the inference pool. 0 = no periodic evals during training
eval_start_end_use_all_serversbooltrueUse all inference servers for start/end evals
Each eval entry supports:
ParameterTypeDefaultDescription
namestrrequiredEval name (environment name or dedicated eval like math500)
eval_everyint10Run this eval every N training steps
num_samplesint-1Number of samples to evaluate. -1 = full dataset
separate_eval_samplesboolfalseUse a separate set of eval samples
kwargsdict{}Environment-specific keyword arguments
pass_kdict{}pass@k / pass^k configuration
temperaturefloat | nullnullOverride sampling temperature (null = inherit from training config)
top_pfloat | nullnullOverride nucleus sampling (null = inherit)
max_tokensint | nullnullOverride max tokens (null = inherit)
evals:
  - name: "math500"
    eval_every: 10
    num_samples: 500
    kwargs:
      dataset_name: "HuggingFaceH4/MATH-500"
      dataset_split: "test"
    temperature: 1.0
    max_tokens: 3000

Logging

ParameterTypeDefaultDescription
use_wandbbooltrueEnable Weights & Biases logging
wandb_projectstr"telescope"W&B project name
wandb_run_namestr"telescope_run"W&B run name
wandb_tagslist[str]["telescope"]W&B tags
wandb_upload_codebooltrueUpload source code to W&B
wandb_upload_logsbooltrueUpload trainer/inference logs to W&B
wandb_upload_logs_detailedboolfalseInclude detailed logs in W&B upload
wandb_upload_logs_stdoutboolfalseInclude stdout logs in W&B upload
wandb_code_max_file_size_mbfloat2.0Max file size for code upload
wandb_code_exclude_patternslist[str][".git", ".venv", ...]Glob patterns to exclude from code upload
system_metrics_collection_interval_secondsfloat1.0System metrics collection interval
torch_memory_sample_interval_secondsfloat0.1PyTorch memory sampling interval
event_tail_window_secondsint60Tail window for event aggregation
event_block_duration_secondsint1800Duration of event blocks
event_upload_interval_secondsint5Event upload interval
metrics_logger_interval_secondsfloat2.0Metrics logger flush interval
ray_torch_memory_drain_interval_secondsfloat0.5Interval for draining Ray torch memory metrics
rollout_block_sizeint500Block size for rollout event grouping
track_gpu_eventsbooltrueTrack GPU utilization events

Ray Cluster

ParameterTypeDefaultDescription
ray_addressstr"auto"Ray cluster address
ray_auto_start_localbooltrueAuto-start a local Ray cluster if none is running
ray_namespacestr"telescope"Ray namespace
ray_log_to_driverbooltrueForward worker logs to the driver process
ray_runtime_envdict | nullnullCustom Ray runtime environment
ray_disable_runtime_env_hookbooltrueDisable Ray’s runtime environment hook
ray_pin_py_executablebooltruePin the Python executable path in workers
ray_propagate_active_venvbooltruePropagate the active virtualenv to workers
ray_propagate_run_dirbooltruePropagate the run directory to workers
ray_broadcast_init_timeout_sint300Timeout for weight broadcast initialization
ray_broadcast_prefer_loopback_if_single_nodebooltruePrefer loopback interface for single-node broadcasts
ray_shutdown_on_exitboolfalseShut down the Ray cluster on exit
ray_inference_cpus_per_workerfloat4.0CPU cores allocated per inference worker
ray_trainer_cpus_per_workerfloat4.0CPU cores allocated per trainer worker
ray_inference_placement_strategy"PACK" | "SPREAD" | "STRICT_PACK" | "STRICT_SPREAD""PACK"Placement strategy for inference workers
ray_trainer_placement_strategy"PACK" | "SPREAD" | "STRICT_PACK" | "STRICT_SPREAD""PACK"Placement strategy for trainer workers
ray_placement_timeout_sint900Timeout for Ray placement group creation

Weight Broadcasting

ParameterTypeDefaultDescription
weight_broadcast_mode"flattened_bucket" | "per_tensor""flattened_bucket"Strategy for broadcasting updated weights to inference servers
weight_broadcast_bucket_mbint256Bucket size in MB for flattened broadcast
weight_broadcast_cpu_stagingboolfalseUse CPU staging buffer for weight transfers
weight_broadcast_pin_memorybooltrueUse pinned CPU memory for faster H2D transfers
weight_broadcast_free_grad_buffersbooltrueFree Megatron gradient buffers during broadcast (saves ~14 GB)

Megatron

These settings are ignored when train_backend: "fsdp". Only relevant when train_backend: "megatron".
ParameterTypeDefaultDescription
megatron_tensor_parallel_sizeint1Tensor parallelism degree
megatron_pipeline_parallel_sizeint1Pipeline parallelism degree
megatron_context_parallel_sizeint1Context parallelism degree
megatron_expert_parallel_sizeint1Expert parallelism degree (MoE models)
megatron_global_batch_sizeint | nullnullGlobal batch size. Defaults to prompts_batch_size_for_trainer
megatron_disable_unified_memory_jitbooltrueDisable unified memory JIT
megatron_optimizer_cpu_offloadboolfalseOffload optimizer states to CPU
megatron_optimizer_offload_fractionfloat1.0Fraction of optimizer state to offload. Range: [0, 1]
megatron_overlap_cpu_optimizer_d2h_h2dbooltrueOverlap CPU optimizer D2H/H2D transfers
megatron_use_precision_aware_optimizerboolfalseUse precision-aware optimizer
megatron_main_grads_dtypestr"float32"Data type for main gradients
megatron_main_params_dtypestr"float32"Data type for main parameters
megatron_exp_avg_dtypestr"float32"Data type for Adam exponential average
megatron_exp_avg_sq_dtypestr"float32"Data type for Adam exponential average squared
megatron_grad_reduce_in_fp32booltrueReduce gradients in FP32. false halves grad buffer memory (~14 GB savings)
megatron_gradient_checkpointingbooltrueRecompute activations in backward pass (~15-20 GB memory savings)
megatron_sequence_parallelbooltrueShard sequence dimension in LayerNorm/dropout (requires TP > 1)
megatron_use_distributed_optimizerbooltrueShard optimizer states across DP ranks (requires DP > 1)
megatron_overlap_grad_reduceboolfalseOverlap gradient allreduce with backward compute
megatron_use_transformer_engineboolfalseUse TransformerEngine layer spec (fused attention, FP8-ready)
megatron_fp8boolfalseEnable FP8 compute (requires TransformerEngine + Hopper GPU)

vLLM Tracing

ParameterTypeDefaultDescription
enable_vllm_tracingbooltrueEnable OpenTelemetry tracing for vLLM
otlp_receiver_portint4318OTLP HTTP receiver port. Range: [1, 65535]