Telescope uses Ray to distribute work across machines. On a single node, Ray starts automatically when you runDocumentation Index
Fetch the complete documentation index at: https://docs.telescope.training/llms.txt
Use this file to discover all available pages before exploring further.
train.py. For multi-node training, you start a Ray cluster first and then launch training against it.
Starting a Ray cluster
Start the head node on the first machine:<head-node-ip> with the IP of the head node, and <this-node-ip> with each machine’s cluster-reachable IP address. Once all nodes have joined, launch training from the head node:
ray_address: "auto" (the default), discovers all available GPUs, and places workers across nodes automatically.
Worker placement
Telescope places inference servers and trainer workers using Ray placement groups. Two config options control how workers are distributed across nodes:PACK— pack workers onto as few nodes as possible. Minimizes cross-node communication.SPREAD— spread workers across nodes evenly.STRICT_PACK— like PACK but fails if workers can’t fit on a single node.STRICT_SPREAD— like SPREAD but requires each worker on a different node.
PACK is the right choice — it keeps NCCL communication within a node where NVLink is available, and only goes cross-node when necessary.
Example: 2-node setup
With 2 nodes of 8 GPUs each (16 GPUs total), you might configure:PACK placement, Ray will try to place related workers together on the same node, minimizing cross-node traffic.
Example: 4-node Megatron setup
For a large model across 4 nodes of 8 GPUs (32 GPUs total):Docker on multiple nodes
Run the same Docker container on each node, then start Ray inside: Head node:--network=host is required for multi-node so that Ray and NCCL can communicate between containers across machines.Shared filesystem
All nodes need access to the same model weights and checkpoint directory. The simplest approach is a shared filesystem (NFS, Lustre, etc.) mounted at the same path on all nodes. If you don’t have a shared filesystem, make sure:- The model is cached on every node (HuggingFace downloads it on first use)
checkpoint_dirpoints to a shared location, or checkpoints are only saved from one node

