# Training Configuration

<table data-full-width="false"><thead><tr><th width="270">Field Name</th><th>Explanation</th></tr></thead><tbody><tr><td>gradient_accumulation_steps</td><td><code>gradient_accumulation_steps</code> controls the number of forward and backward passes to skip before updating the model's weights. It is useful when the batch size is too large to fit into GPU memory at once. Gradients are accumulated over the specified number of steps before performing weight updates.</td></tr><tr><td>micro_batch_size</td><td><code>micro_batch_size</code> specifies the number of samples included in each batch sent to each GPU. It helps in determining the mini-batch size for training.</td></tr><tr><td>eval_batch_size</td><td><code>eval_batch_size</code> sets the batch size for evaluation. It determines how many samples are processed in each evaluation step.</td></tr><tr><td>num_epochs</td><td><code>num_epochs</code> defines the number of training epochs, which represent the number of times the entire dataset is processed during training.</td></tr><tr><td>warmup_steps</td><td><code>warmup_steps</code> specifies the number of warm-up steps for the learning rate scheduler. During the warm-up phase, the learning rate gradually increases to its full value.</td></tr><tr><td>learning_rate</td><td><code>learning_rate</code> sets the initial learning rate for training. It is a critical hyperparameter that determines the step size for weight updates during optimization.</td></tr><tr><td>lr_quadratic_warmup</td><td><code>lr_quadratic_warmup</code> is a field related to learning rate scheduling. It is used to specify a quadratic warm-up schedule for the learning rate, which can be beneficial for certain training scenarios.</td></tr><tr><td>logging_steps</td><td><code>logging_steps</code> sets the frequency at which training logs are generated. It controls how often training progress is reported.</td></tr><tr><td>save_strategy</td><td><code>save_strategy</code> determines when model checkpoints are saved during training. Setting it to 'no' skips checkpoint saves, while other options control the timing of saves.</td></tr><tr><td>save_steps</td><td><code>save_steps</code> specifies the frequency at which model checkpoints are saved. You can leave it empty to save at each epoch or specify a different number of steps.</td></tr><tr><td>eval_steps</td><td><code>eval_steps</code> controls the frequency of model evaluation during training. It can be specified as an integer for every N steps or as a decimal for a fraction of total steps.</td></tr><tr><td>save_total_limit</td><td><code>save_total_limit</code> limits the maximum number of checkpoints saved at a time. Older checkpoints are deleted to keep the total number within this limit.</td></tr><tr><td>max_steps</td><td><code>max_steps</code> defines the maximum number of iterations to train for. It takes precedence over <code>num_epochs</code>. For example, if you set <code>max_steps</code> to 100, the training will stop after 100 steps, regardless of the number of epochs.</td></tr><tr><td>eval_table_size</td><td><code>eval_table_size</code> specifies the approximate number of predictions sent to wandb (Weights and Biases) depending on the batch size. This field is enabled above 0 and is useful for tracking evaluation metrics.</td></tr><tr><td>eval_table_max_new_tokens</td><td><code>eval_table_max_new_tokens</code> sets the total number of tokens generated for predictions sent to wandb. It helps control the amount of data sent for monitoring.</td></tr><tr><td>save_safetensors</td><td>When specified, <code>save_safetensors</code> indicates saving the model as safetensors, requiring the safetensors package for compatibility.</td></tr><tr><td>train_on_inputs</td><td><code>train_on_inputs</code> determines whether to mask out or include the human's prompt from the training labels. Setting it to 'false' omits the prompt from training.</td></tr><tr><td>group_by_length</td><td>When set to 'true,' <code>group_by_length</code> groups data with similar sequence lengths together to minimize padding. This can help improve training efficiency but may lead to an oscillating training loss.</td></tr><tr><td>gradient_checkpointing</td><td><code>gradient_checkpointing</code> controls whether to use gradient checkpointing, a technique that can reduce memory consumption during training. When enabled, it trades off computation for memory.</td></tr><tr><td>early_stopping_patience</td><td><code>early_stopping_patience</code> determines when to stop training if evaluation losses increase consecutively for a specified number of times. It helps prevent overfitting.</td></tr><tr><td>lr_scheduler</td><td><code>lr_scheduler</code> specifies the learning rate scheduler to use during training. Options include 'one_cycle,' 'log_sweep,' or leaving it empty for cosine scheduling.</td></tr><tr><td>lr_scheduler_kwargs</td><td><code>lr_scheduler_kwargs</code> can be used to provide additional arguments to the learning rate scheduler, depending on the chosen scheduler type.</td></tr><tr><td>lr_div_factor</td><td>For 'one_cycle' optimizer, <code>lr_div_factor</code> determines the learning rate division factor during the one-cycle learning rate schedule.</td></tr><tr><td>log_sweep_min_lr</td><td>For 'log_sweep' optimizer, <code>log_sweep_min_lr</code> sets the minimum learning rate for the logarithmic learning rate sweep.</td></tr><tr><td>log_sweep_max_lr</td><td>For 'log_sweep' optimizer, <code>log_sweep_max_lr</code> sets the maximum learning rate for the logarithmic learning rate sweep.</td></tr><tr><td>optimizer</td><td><code>optimizer</code> specifies the optimizer to use for training. There are various optimizer options available, and the choice depends on the model and use case.</td></tr><tr><td>weight_decay</td><td><code>weight_decay</code> determines the weight decay applied during optimization. It is a regularization term that prevents overfitting by penalizing large model weights.</td></tr><tr><td>adam_beta1</td><td>For 'adamw' optimizer, <code>adam_beta1</code> sets the beta1 hyperparameter for the Adam optimizer. It controls the exponential moving average of past gradients.</td></tr><tr><td>adam_beta2</td><td>For 'adamw' optimizer, <code>adam_beta2</code> sets the beta2 hyperparameter for the Adam optimizer. It controls the exponential moving average of past squared gradients.</td></tr><tr><td>adam_epsilon</td><td>For 'adamw' optimizer, <code>adam_epsilon</code> sets the epsilon value added to the denominator to prevent division by zero.</td></tr><tr><td>max_grad_norm</td><td><code>max_grad_norm</code> specifies the maximum gradient norm value. Gradients are clipped to this value during training to prevent exploding gradients.</td></tr></tbody></table>
