Skip to main content
Telescope supports training models that use tools during rollouts via environment-level tool calling using the ToolEnvironment base class.

ToolEnvironment

ToolEnvironment extends MultiTurnEnvironment with built-in tool calling support. You define tools as Python functions, and the class handles schema generation, prompt injection, tool call parsing, execution, and result formatting.

Defining tools

Tools are regular Python functions with type hints and docstrings:
def add(a: float, b: float) -> float:
    """Add two numbers together."""
    return a + b

def subtract(a: float, b: float) -> float:
    """Subtract b from a."""
    return a - b
Pass them to the constructor:
class MyToolEnvironment(ToolEnvironment):
    def __init__(self, **kwargs):
        super().__init__(
            tools=[add, subtract, multiply, divide],
            max_turns=5,
            tool_call_format="xml",
            system_prompt="You are a math assistant. Use tools for calculations.",
            **kwargs,
        )
Telescope automatically converts each function to an OpenAI-compatible tool schema using func_to_tool_schema(). The type hints map to JSON Schema types (floatnumber, strstring, intinteger, boolboolean), and the docstring becomes the tool description.

How tool calls are processed

Each turn follows this cycle:
Model output → parse_tool_calls() → execute_tool() → format_tool_result() → next turn
  1. Parse — The model’s completion is scanned for tool calls. By default, XML format is used with a JSON object inside the tags:
    <tool_call>
    {"name": "add", "arguments": {"a": 15, "b": 7}}
    </tool_call>
    
  2. Execute — Each parsed ToolCall is executed by calling the corresponding Python function with the parsed arguments.
  3. Format — Results are formatted as tool response messages and appended to the conversation.
  4. Checkis_final_answer() determines if the model is providing a final answer (no tool calls) or wants to continue using tools.

Building a ToolEnvironment

The minimal subclass needs load_dataset and compute_reward:
from telescope.environments.tool_env import ToolEnvironment
from telescope.environments.base import Sample, RewardResult, RolloutState
from telescope.environments.parsers import extract_xml_tag


class MyToolEnv(ToolEnvironment):
    def __init__(self, **kwargs):
        super().__init__(
            tools=[my_tool_a, my_tool_b],
            max_turns=8,
            system_prompt="Use tools to solve the task.",
            **kwargs,
        )

    def load_dataset(self, num_samples=-1, **kwargs):
        # Load and return list[Sample]
        ...

    def is_final_answer(self, completion, state):
        """Check if the model is giving a final answer instead of a tool call."""
        answer = extract_xml_tag(completion, "answer")
        if answer:
            return True
        return len(self.parse_tool_calls(completion)) == 0

    def compute_reward(self, state, eos_token=""):
        tool_metrics = self.get_tool_metrics(state)
        correct = check_answer(state)  # your verification logic
        return RewardResult(
            total_reward=1.0 if correct else 0.0,
            sample_metrics={**tool_metrics, "correct": float(correct)},
        )

Override points

MethodDefault behaviorWhen to override
parse_tool_calls(text)Parses XML-tagged tool callsCustom tool call format
execute_tool(tool_call)Calls the matching Python functionTools need side effects, async I/O, or sandbox execution
format_tool_result(result)Formats as XML tool responseCustom result formatting
is_final_answer(completion, state)True if no tool calls foundCustom completion detection (e.g., <answer> tags)

Tool metrics

get_tool_metrics(state) returns a dict with usage stats from the trajectory:
{
    "total_tool_calls": 3,
    "tool_success_count": 2,
    "tool_error_count": 1,
    "tool_success_rate": 0.67,
    "unique_tools_used": 2,
    "add_calls": 2,         # per-tool call counts
    "subtract_calls": 1,
}
These are useful both for reward computation (e.g., penalizing excessive tool use) and for monitoring via sample_metrics. See Metrics for details on how sample metrics are tracked and displayed in the UI.

Sandbox execution

For environments that need to execute code (not just call Python functions), Telescope provides a pluggable sandbox system.

SandboxConfig

from telescope.environments._sandbox import SandboxConfig, get_provider

config = SandboxConfig(
    image="python:3.11-slim",
    cpu=2,
    memory_mb=4096,
    disk_size_gb=10,
    gpu_count=0,
    timeout_seconds=300,
    environment_vars={"MY_VAR": "value"},
    name="my-sandbox",             # optional identifier
    extra={"template": "custom"},  # provider-specific overrides
)

provider = get_provider("prime")  # or "modal", "daytona", "e2b"
handle = await provider.create(config)
result = await provider.execute(handle, "python -c 'print(1+1)'", timeout=30)
# result.exit_code, result.stdout, result.stderr
await provider.destroy(handle)

Supported providers

Telescope is agnostic to which sandbox provider is used — any provider that implements the SandboxProvider interface (create, execute, upload_bytes, upload_file, destroy) will work. For convenience, the following providers come pre-configured:
ProviderDescriptionCredentials
primePrime infrastructurePRIME_API_KEY env var or prime login
modalCloud-based sandboxes with fast cold startsMODAL_TOKEN_ID env var or Modal SDK auth
daytonaSelf-hosted sandbox environmentsDAYTONA_API_KEY or DAYTONA_JWT_TOKEN env var
e2bCloud sandboxes for prototyping and developmentE2B_API_KEY env var
All providers validate credentials at startup and fail fast if the required SDK package is missing or credentials are invalid.

Using sandboxes in environments

A typical sandbox environment follows this pattern:
  1. Create sandboxes in create_initial_state() with concurrency control via semaphores
  2. Execute commands in env_response() by parsing tool calls and running them in the sandbox
  3. Clean up in a destroy hook when the rollout completes
Sandbox environments use async I/O throughout. The sandbox provider handles timeout enforcement, error translation, and resource cleanup.

Multi-turn configuration for agentic tasks

Key config parameters for tool-using and agentic environments:
# Scheduling: prioritize earlier turns to reduce head-of-line blocking
vllm_scheduling_policy: "priority"

# Reuse exact token IDs across turns (avoids tokenization mismatches)
interleaved_rollouts: true

# Limit concurrent requests per server (important for multi-turn)
max_concurrent_prompts_per_server: 32

# Maximum turns before stopping
# Set in your environment's __init__ via max_turns parameter
priority scheduling is important for multi-turn environments: it ensures the model completes earlier turns before starting new ones, preventing scenarios where later turns queue behind a flood of first-turn requests. interleaved_rollouts (enabled by default) reuses token IDs from previous turns exactly, avoiding subtle tokenization differences that could corrupt logprob computation across turns.