Configuration
checkpoint_every interval.
Checkpoint directory structure
Checkpoints are saved undercheckpoints/ in your run directory (or the path set by checkpoint_dir):
Cleanup rules
By default, all checkpoints are kept. Usecheckpoint_keep_last and checkpoint_keep_every to manage disk space:
step_50, step_100, step_110, step_120.
Resuming training
Setresume_from_checkpoint to continue from where training left off:
checkpoint_save_training_state: true (the default) on the original run.
Weights-only checkpoints
Setcheckpoint_save_training_state: false to save only model weights without optimizer or scheduler state. This produces smaller checkpoints that can be converted to HuggingFace format but cannot be used for resume.
Converting to HuggingFace format
Checkpoints are saved in the training backend’s native format. To use a checkpoint with vLLM, HuggingFace, or other tools, convert it to standard HuggingFace format (safetensors + config.json): Single checkpoint:meta.json backend field. Already-converted checkpoints are skipped in batch mode.
