GRPO
Group Relative Policy Optimization. The default algorithm and a good starting point. Advantages are computed by normalizing rewards within each prompt group (mean-centered and divided by standard deviation). The loss is a standard policy gradient:-log_probs * advantage.
RLOO
REINFORCE Leave-One-Out. Uses a leave-one-out baseline instead of the group mean, making it less sensitive to outliers in the reward distribution. For each sample, the advantage is scaled byn / (n - 1) to correct for the leave-one-out bias. Requires per-group normalization (batch normalization is not compatible).
REINFORCE++
Uses two-stage advantage normalization: first subtracts the group mean, then re-normalizes across the full training batch. This gives more stable gradients when batch sizes are large. Requires batch-level normalization (per the paper).DR-GRPO
Distributional Reward GRPO. A variant of GRPO that removes response-level length bias. Advantages are mean-centered but not divided by standard deviation, and the loss can be aggregated in a way that normalizes bynum_samples * seq_len instead of total valid tokens.
CISPO
Clipped Importance-Sampled Policy Optimization. Computes the importance sampling ratio between current and reference policy logprobs, clamps it, and uses the clamped ratio as a per-token weighting factor on the policy gradient.GSPO
Geometric Sequence Policy Optimization. Designed for packed sequences. Uses sequence-level geometric mean of log ratios with per-token gradient scaling, so gradient magnitude is independent of sequence length. Applies PPO-style clipping on the sequence-level ratio.SAPO
Sigmoid-Gated Advantage Policy Optimization. Uses a sigmoid gate on the importance sampling ratio for smoother updates than hard clipping. The gate sharpness can be tuned separately for positive and negative advantages.PPO clipping
PPO-style ratio clipping can be layered on top of GRPO, RLOO, REINFORCE++, or DR-GRPO for trust-region updates. It is not compatible with CISPO, GSPO, or SAPO, which have their own ratio handling.Minibatches
By default, each training batch is used for a single gradient step. Settingnumber_of_minibatches > 1 splits the batch into minibatches and runs multiple gradient steps per rollout batch (PPO-style):

