Training Configuration

Field Name

Explanation

gradient_accumulation_steps

gradient_accumulation_steps controls the number of forward and backward passes to skip before updating the model's weights. It is useful when the batch size is too large to fit into GPU memory at once. Gradients are accumulated over the specified number of steps before performing weight updates.

micro_batch_size

micro_batch_size specifies the number of samples included in each batch sent to each GPU. It helps in determining the mini-batch size for training.

eval_batch_size

eval_batch_size sets the batch size for evaluation. It determines how many samples are processed in each evaluation step.

num_epochs

num_epochs defines the number of training epochs, which represent the number of times the entire dataset is processed during training.

warmup_steps

warmup_steps specifies the number of warm-up steps for the learning rate scheduler. During the warm-up phase, the learning rate gradually increases to its full value.

learning_rate

learning_rate sets the initial learning rate for training. It is a critical hyperparameter that determines the step size for weight updates during optimization.

lr_quadratic_warmup

lr_quadratic_warmup is a field related to learning rate scheduling. It is used to specify a quadratic warm-up schedule for the learning rate, which can be beneficial for certain training scenarios.

logging_steps

logging_steps sets the frequency at which training logs are generated. It controls how often training progress is reported.

save_strategy

save_strategy determines when model checkpoints are saved during training. Setting it to 'no' skips checkpoint saves, while other options control the timing of saves.

save_steps

save_steps specifies the frequency at which model checkpoints are saved. You can leave it empty to save at each epoch or specify a different number of steps.

eval_steps

eval_steps controls the frequency of model evaluation during training. It can be specified as an integer for every N steps or as a decimal for a fraction of total steps.

save_total_limit

save_total_limit limits the maximum number of checkpoints saved at a time. Older checkpoints are deleted to keep the total number within this limit.

max_steps

max_steps defines the maximum number of iterations to train for. It takes precedence over num_epochs. For example, if you set max_steps to 100, the training will stop after 100 steps, regardless of the number of epochs.

eval_table_size

eval_table_size specifies the approximate number of predictions sent to wandb (Weights and Biases) depending on the batch size. This field is enabled above 0 and is useful for tracking evaluation metrics.

eval_table_max_new_tokens

eval_table_max_new_tokens sets the total number of tokens generated for predictions sent to wandb. It helps control the amount of data sent for monitoring.

save_safetensors

When specified, save_safetensors indicates saving the model as safetensors, requiring the safetensors package for compatibility.

train_on_inputs

train_on_inputs determines whether to mask out or include the human's prompt from the training labels. Setting it to 'false' omits the prompt from training.

group_by_length

When set to 'true,' group_by_length groups data with similar sequence lengths together to minimize padding. This can help improve training efficiency but may lead to an oscillating training loss.

gradient_checkpointing

gradient_checkpointing controls whether to use gradient checkpointing, a technique that can reduce memory consumption during training. When enabled, it trades off computation for memory.

early_stopping_patience

early_stopping_patience determines when to stop training if evaluation losses increase consecutively for a specified number of times. It helps prevent overfitting.

lr_scheduler

lr_scheduler specifies the learning rate scheduler to use during training. Options include 'one_cycle,' 'log_sweep,' or leaving it empty for cosine scheduling.

lr_scheduler_kwargs

lr_scheduler_kwargs can be used to provide additional arguments to the learning rate scheduler, depending on the chosen scheduler type.

lr_div_factor

For 'one_cycle' optimizer, lr_div_factor determines the learning rate division factor during the one-cycle learning rate schedule.

log_sweep_min_lr

For 'log_sweep' optimizer, log_sweep_min_lr sets the minimum learning rate for the logarithmic learning rate sweep.

log_sweep_max_lr

For 'log_sweep' optimizer, log_sweep_max_lr sets the maximum learning rate for the logarithmic learning rate sweep.

optimizer

optimizer specifies the optimizer to use for training. There are various optimizer options available, and the choice depends on the model and use case.

weight_decay

weight_decay determines the weight decay applied during optimization. It is a regularization term that prevents overfitting by penalizing large model weights.

adam_beta1

For 'adamw' optimizer, adam_beta1 sets the beta1 hyperparameter for the Adam optimizer. It controls the exponential moving average of past gradients.

adam_beta2

For 'adamw' optimizer, adam_beta2 sets the beta2 hyperparameter for the Adam optimizer. It controls the exponential moving average of past squared gradients.

adam_epsilon

For 'adamw' optimizer, adam_epsilon sets the epsilon value added to the denominator to prevent division by zero.

max_grad_norm

max_grad_norm specifies the maximum gradient norm value. Gradients are clipped to this value during training to prevent exploding gradients.

PreviousLogging NextAugmentation Techniques

Last updated 1 year ago

Was this helpful?