load_dataset, compute_reward), see Environments.
Dataset format
Every environment implementsload_dataset() which returns a list of Sample objects. A Sample has three fields:
prompt— the question or task (string or list of chat messages)answer— ground truth for reward computationmetadata— any extra data needed by your reward function
Sample objects.
Loading from HuggingFace
The most common pattern — load from the HuggingFace Hub:Loading from local files
For JSON, JSONL, CSV, or Parquet files:Procedural generation
For game and puzzle environments, generate samples programmatically:Prompt formatting
Single-turn prompts
SingleTurnEnvironment handles prompt formatting automatically using the model’s tokenizer apply_chat_template(). You control the framing with two parameters:
apply_chat_template() to produce the final token sequence with the model’s expected special tokens.
Thinking mode
Whenadd_thinking_prefix: true is set as an environment kwarg, the environment prepends a <think> prefix to the model’s generation. This teaches the model to reason in a thinking block before producing an answer. The instruction prompt should mention the expected format:
Multi-turn prompts
Multi-turn environments useget_initial_prompt() which returns a list of ChatMessage dicts:
Multi-environment datasets
When training on multiple environments, each environment loads its own dataset independently. Prompts are sampled proportionally to theweight of each environment:
Avoiding data leakage
If you use the same environment for both training and evaluation, enableseparate_eval_samples to reserve a portion of the dataset exclusively for eval:
Dataset sizing
The relationship between dataset size and training configuration:number_of_steps— how many training steps to runprompts_batch_size_for_trainer— number of prompt groups per training batchgroup_size— completions generated per prompt (each prompt producesgroup_sizesamples)
prompts_batch_size_for_trainer (each group is one prompt). Over a full run, you need at least number_of_steps * prompts_batch_size_for_trainer unique prompts. If the dataset is smaller, prompts will be recycled.
For most RL training, some recycling is fine — the model generates different completions each time, so the same prompt produces different training signal. However, excessive recycling on a very small dataset can lead to overfitting to specific prompts.
