Training Configuration
gradient_accumulation_steps
gradient_accumulation_steps
controls the number of forward and backward passes to skip before updating the model's weights. It is useful when the batch size is too large to fit into GPU memory at once. Gradients are accumulated over the specified number of steps before performing weight updates.
micro_batch_size
micro_batch_size
specifies the number of samples included in each batch sent to each GPU. It helps in determining the mini-batch size for training.
eval_batch_size
eval_batch_size
sets the batch size for evaluation. It determines how many samples are processed in each evaluation step.
num_epochs
num_epochs
defines the number of training epochs, which represent the number of times the entire dataset is processed during training.
warmup_steps
warmup_steps
specifies the number of warm-up steps for the learning rate scheduler. During the warm-up phase, the learning rate gradually increases to its full value.
learning_rate
learning_rate
sets the initial learning rate for training. It is a critical hyperparameter that determines the step size for weight updates during optimization.
lr_quadratic_warmup
lr_quadratic_warmup
is a field related to learning rate scheduling. It is used to specify a quadratic warm-up schedule for the learning rate, which can be beneficial for certain training scenarios.
logging_steps
logging_steps
sets the frequency at which training logs are generated. It controls how often training progress is reported.
save_strategy
save_strategy
determines when model checkpoints are saved during training. Setting it to 'no' skips checkpoint saves, while other options control the timing of saves.
save_steps
save_steps
specifies the frequency at which model checkpoints are saved. You can leave it empty to save at each epoch or specify a different number of steps.
eval_steps
eval_steps
controls the frequency of model evaluation during training. It can be specified as an integer for every N steps or as a decimal for a fraction of total steps.
save_total_limit
save_total_limit
limits the maximum number of checkpoints saved at a time. Older checkpoints are deleted to keep the total number within this limit.
max_steps
max_steps
defines the maximum number of iterations to train for. It takes precedence over num_epochs
. For example, if you set max_steps
to 100, the training will stop after 100 steps, regardless of the number of epochs.
eval_table_size
eval_table_size
specifies the approximate number of predictions sent to wandb (Weights and Biases) depending on the batch size. This field is enabled above 0 and is useful for tracking evaluation metrics.
eval_table_max_new_tokens
eval_table_max_new_tokens
sets the total number of tokens generated for predictions sent to wandb. It helps control the amount of data sent for monitoring.
save_safetensors
When specified, save_safetensors
indicates saving the model as safetensors, requiring the safetensors package for compatibility.
train_on_inputs
train_on_inputs
determines whether to mask out or include the human's prompt from the training labels. Setting it to 'false' omits the prompt from training.
group_by_length
When set to 'true,' group_by_length
groups data with similar sequence lengths together to minimize padding. This can help improve training efficiency but may lead to an oscillating training loss.
gradient_checkpointing
gradient_checkpointing
controls whether to use gradient checkpointing, a technique that can reduce memory consumption during training. When enabled, it trades off computation for memory.
early_stopping_patience
early_stopping_patience
determines when to stop training if evaluation losses increase consecutively for a specified number of times. It helps prevent overfitting.
lr_scheduler
lr_scheduler
specifies the learning rate scheduler to use during training. Options include 'one_cycle,' 'log_sweep,' or leaving it empty for cosine scheduling.
lr_scheduler_kwargs
lr_scheduler_kwargs
can be used to provide additional arguments to the learning rate scheduler, depending on the chosen scheduler type.
lr_div_factor
For 'one_cycle' optimizer, lr_div_factor
determines the learning rate division factor during the one-cycle learning rate schedule.
log_sweep_min_lr
For 'log_sweep' optimizer, log_sweep_min_lr
sets the minimum learning rate for the logarithmic learning rate sweep.
log_sweep_max_lr
For 'log_sweep' optimizer, log_sweep_max_lr
sets the maximum learning rate for the logarithmic learning rate sweep.
optimizer
optimizer
specifies the optimizer to use for training. There are various optimizer options available, and the choice depends on the model and use case.
weight_decay
weight_decay
determines the weight decay applied during optimization. It is a regularization term that prevents overfitting by penalizing large model weights.
adam_beta1
For 'adamw' optimizer, adam_beta1
sets the beta1 hyperparameter for the Adam optimizer. It controls the exponential moving average of past gradients.
adam_beta2
For 'adamw' optimizer, adam_beta2
sets the beta2 hyperparameter for the Adam optimizer. It controls the exponential moving average of past squared gradients.
adam_epsilon
For 'adamw' optimizer, adam_epsilon
sets the epsilon value added to the denominator to prevent division by zero.
max_grad_norm
max_grad_norm
specifies the maximum gradient norm value. Gradients are clipped to this value during training to prevent exploding gradients.
Last updated