Async training
Inference and training run concurrently on separate GPU pools. While the trainer updates weights, inference servers keep generating samples for the next batch. This eliminates idle GPU time and significantly increases throughput. Controlled bymax_async_rollout. See Async Training.
Stale rollout cancellation
When training runs ahead of inference, older in-flight rollouts become off-policy. Telescope automatically cancels rollouts that are too many weight updates behind, avoiding wasted compute on stale samples. Controlled bymax_off_policy_steps.
Truncated importance sampling (TIS)
Corrects for the logprob mismatch between inference-time and training-time weights caused by async training. Applies importance sampling weights clamped to a maximum value, keeping gradient estimates unbiased without high variance. Controlled byuse_tis and tis_cap.
Sequence packing
Multiple samples are packed into a single sequence up toseq_len tokens, so the trainer processes dense batches without wasting compute on padding. Attention masks ensure samples don’t attend to each other.
Minibatches
Each rollout batch can be split into multiple minibatches for multiple gradient steps per batch (PPO-style). This extracts more learning signal from each set of rollouts, useful when rollout generation is expensive. Controlled bynumber_of_minibatches.
Prompt prefetch
The orchestrator pre-tokenizes and prepares upcoming prompts in a background buffer while the current batch is being generated. This eliminates prompt preparation latency between rollout groups. Controlled byenable_prompt_prefetch and prompt_prefetch_buffer_size.
Zero-advantage filtering
Groups where all samples received identical rewards carry no learning signal (zero advantage). Telescope drops these from the training batch so the trainer only processes informative samples. Controlled bydiscard_group_zero_advantage.
Multi-environment training
Train on multiple environments simultaneously with weighted sampling. Each environment contributes prompts proportionally to its weight, and rewards are normalized per-environment. See Environments.FSDP and Megatron backends
Two training backends cover models from 1B to 100B+ parameters. FSDP (data parallel) is simpler and works well up to ~14B. Megatron adds tensor, pipeline, context, and expert parallelism for larger models. See Architecture.In-place weight synchronization
After each training step, updated weights are broadcast from the trainer to all vLLM inference servers via NCCL. The vLLM workers update their model in-place without restarting, so inference immediately uses the latest weights with minimal overhead.Multi-turn environments
Environments can define interactive loops where the model and environment exchange messages over multiple rounds. The orchestrator manages the full trajectory, and reward is computed over the complete interaction. See Environments.Tool environment base class
ToolEnvironment extends MultiTurnEnvironment with built-in tool calling support. Define Python functions as tools, and the environment automatically converts them to OpenAI-compatible schemas, parses tool calls from model output (XML tags by default), executes them, and returns results. Handles error recovery and tracks tool usage metrics. See Tool Calling & Agentic Training.
Sandbox execution
For code generation and agentic environments, Telescope provides a pluggable sandbox system for safe code execution. Supports multiple providers (Prime, Modal, Daytona, E2B) behind a unified API, so environment code works the same regardless of infrastructure. Each sandbox gets configurable CPU, memory, disk, and timeout limits. See Tool Calling — Sandbox execution.Interleaved rollouts
In multi-turn environments, token IDs are reused exactly across turns instead of re-tokenizing from text. This avoids subtle tokenization mismatches between turns that could corrupt logprob computation. Enabled by default withinterleaved_rollouts.
7 RL algorithms
GRPO, RLOO, REINFORCE++, DR-GRPO, CISPO, GSPO, and SAPO — each with different tradeoffs for advantage estimation and gradient weighting. All can be combined with PPO clipping for trust-region updates. See Algorithms.PPO clipping
PPO-style ratio clipping can be layered on top of GRPO, RLOO, REINFORCE++, or DR-GRPO for trust-region updates. Clipping bounds are configurable withclip_low and clip_high, and reference logprobs can come from the rollout or be recomputed by the trainer. Controlled by use_ppo_clip.
Group and batch advantage normalization
Advantages can be normalized within each prompt group (advantage_norm: "group") or across the full training batch (advantage_norm: "batch"). Group normalization is the default and works well for most setups; batch normalization is required by REINFORCE++ and can stabilize training with large batch sizes.
Periodic evals
Run evaluations on dedicated inference servers during training without interrupting rollout generation. Supports pass@k metrics, custom eval benchmarks, and baseline/final evals. See Evals.Separate eval samples
When using the same environment for both training and evaluation,separate_eval_samples reserves the first N samples exclusively for eval and excludes them from training. This prevents data leakage between training and evaluation.
Checkpointing and resume
Save checkpoints periodically with configurable retention policies. Resume training from any checkpoint with full state restoration (model, optimizer, scheduler, orchestrator counters, dataset position). Convert checkpoints to HuggingFace format for deployment. See Checkpointing.HuggingFace checkpoint conversion
A built-in converter transforms native training checkpoints (FSDP DCP or Megatron dist_checkpointing) into standard HuggingFace format (safetensors + config.json + tokenizer). Works for both single and batch conversion, and handles large models with automatic sharding. See Checkpointing.FP8 training
On Hopper GPUs (H100), the Megatron backend supports FP8 compute via Transformer Engine for faster training with lower memory usage. Controlled bymegatron_fp8.
Mixed precision
FSDP usesbfloat16 mixed precision by default for efficient training. The model dtype and mixed precision dtype are independently configurable. Controlled by model_dtype and mixed_precision_dtype.
Flash Attention 2
The FSDP backend uses Flash Attention 2 for memory-efficient packed-sequence training when available, falling back to PyTorch SDPA ifflash-attn is not installed. Flash Attention’s varlen kernel handles variable-length sequences correctly through the attention mask, which is critical for sequence packing.
Gradient checkpointing
Both backends support gradient checkpointing (activation recomputation) to reduce GPU memory usage by ~15–20 GB at the cost of additional compute. Enabled by default in FSDP and configurable for Megatron viamegatron_gradient_checkpointing.
CPU optimizer offload
Megatron can offload optimizer states to CPU memory, freeing GPU memory for larger models or batch sizes. Overlaps D2H/H2D transfers with compute to minimize overhead. Controlled bymegatron_optimizer_cpu_offload.
Distributed optimizer
Shards optimizer states across data-parallel ranks so each GPU only stores a fraction of the full optimizer. Reduces per-GPU memory proportionally to the number of DP ranks. Controlled bymegatron_use_distributed_optimizer.
Sequence parallelism
Shards the sequence dimension in LayerNorm and dropout regions across tensor-parallel ranks, reducing activation memory. Requires TP > 1. Controlled bymegatron_sequence_parallel.
Context parallelism
Shards the sequence dimension across multiple GPUs for training on very long sequences that don’t fit in a single GPU’s memory. Controlled bymegatron_context_parallel_size.
Expert parallelism
For Mixture-of-Experts models, shards expert layers across GPUs so each GPU handles a subset of experts. Controlled bymegatron_expert_parallel_size.
LR scheduling
Supports constant, linear decay, and cosine decay schedules with configurable warmup steps and minimum LR ratio. Controlled bylr_scheduler, warmup_steps, and min_lr_ratio.
GPU timeline logging
Records fine-grained GPU events during each training step using CUDA events for accurate timing. Events include forward pass, backward pass, gradient reduction, optimizer step, and weight broadcast — visualized as a Gantt chart in the Telescope UI.vLLM tracing
Captures per-request OpenTelemetry traces from vLLM inference servers, including queue time, time to first token, prefill time, and decode time. Traces are collected via a built-in OTLP receiver and logged alongside training events. Controlled byenable_vllm_tracing.
System metrics collection
Continuously monitors GPU utilization, memory, temperature, power draw, clock speeds, and CPU/system memory across all nodes. Metrics are sampled every second and uploaded to W&B for real-time hardware monitoring.vLLM metrics collection
Scrapes Prometheus metrics from each vLLM inference server, including running/waiting requests, KV cache utilization, prefix cache hit rate, token throughput, and request latencies. Provides real-time visibility into inference server health during training.Placement strategies
Control how Ray distributes inference and trainer workers across nodes. Pack workers together to maximize NVLink usage, or spread them for better fault isolation. See Multi-Node Training.Priority scheduling
vLLM inference uses priority-aware scheduling that prioritizes earlier turns in multi-turn rollouts, reducing head-of-line blocking and improving multi-turn throughput. Controlled byvllm_scheduling_policy.
Configurable weight broadcast
Weight synchronization supports bucketed flattening for throughput or per-tensor mode for memory efficiency. Options for CPU staging, pinned memory, and freeing Megatron grad buffers during broadcast to save ~14 GB of GPU memory.Individual sample lanes
Each sample gets its own concurrency slot instead of one per prompt group, improving throughput for environments with high variance in completion time. Enabled by default. Controlled byenable_individual_sample_lanes.
Free lane after generation
When enabled, inference lane slots are freed as soon as generation completes — before reward computation starts. This allows queued samples to begin generating immediately, improving throughput when reward computation is slow (e.g., sandbox-based environments like code generation). Enabled by default. Requiresenable_individual_sample_lanes. Controlled by free_lane_after_generation.
Log upload
Trainer and inference process logs are captured via file tailers and uploaded to W&B as compressed archives. The Telescope UI can then display these logs in its Logs page, with filtering by component, level, and source. Controlled bywandb_upload_logs, wandb_upload_logs_detailed, and wandb_upload_logs_stdout.
Reward normalization
When training on multiple environments with different reward scales,reward_min and reward_max per environment allow Telescope to normalize rewards to a common range. This prevents environments with larger reward magnitudes from dominating the training signal. See Metrics.

