Phi 2.0 - Training Configuration
This is the default training configuration. We will leave it as is for the time being.
EExplanation of each configurations
gradient_accumulation_steps: 1
Gradient accumulation is a technique used to simulate larger batch sizes without increasing memory usage.
It allows the model to accumulate gradients over multiple forward passes before performing a single backward pass and updating the weights.
Setting
gradient_accumulation_steps
to 1 means that the model will update its weights after every batch, effectively not using gradient accumulation.This is suitable when the available memory is sufficient to process the desired batch size without the need for accumulation.
However, if memory is limited and you want to simulate a larger batch size, increasing this value can be beneficial.
micro_batch_size: 2
The micro-batch size determines the number of samples processed in each forward pass of the model.
In this case, setting
micro_batch_size
to 2 means that the model will process 2 samples at a time.Smaller micro-batch sizes can be useful when dealing with limited memory resources, as they require less memory per forward pass.
However, smaller micro-batch sizes may result in slower training times and potentially noisier gradients.
The optimal micro-batch size depends on the available hardware and the specific requirements of the training task.
num_epochs: 4
The number of epochs represents the number of times the model will iterate over the entire training dataset.
Setting
num_epochs
to 4 means that the model will go through the training data 4 times.Increasing the number of epochs can allow the model to learn more from the data and potentially improve its performance.
However, training for too many epochs may lead to overfitting, where the model becomes too specialized to the training data and fails to generalize well to unseen data.
The optimal number of epochs depends on factors such as the size and complexity of the dataset, the model architecture, and the learning rate.
optimizer: adamw_torch
The optimizer is responsible for updating the model's weights based on the computed gradients.
AdamW (Adam with Weight Decay) is an optimization algorithm that extends the original Adam optimizer by adding L2 regularization.
AdamW is well-suited for training deep learning models and has been shown to perform well in various tasks.
It adapts the learning rate for each parameter based on the historical gradients and incorporates weight decay to prevent overfitting.
Using AdamW as the optimizer is a reasonable choice for training LLMs.
adam_beta2: 0.95
adam_beta2
is a hyperparameter of the AdamW optimizer that controls the exponential decay rate for the second moment estimates.It is typically set to a value close to 1 (e.g., 0.999) to give more importance to recent gradients.
Setting
adam_beta2
to 0.95 means that the optimizer will give slightly less emphasis to recent gradients compared to the default value.The impact of this change may vary depending on the specific task and dataset.
adam_epsilon: 0.00001
adam_epsilon
is a small constant value used in the denominator of the AdamW optimizer's update rule to prevent division by zero.It helps to stabilize the optimization process and avoid numerical instability.
The default value of
adam_epsilon
is usually 1e-8 or 1e-7.Setting it to 0.00001 (1e-5) is slightly larger than the typical default value, which may have a minor impact on the optimization process.
max_grad_norm: 1.0
Gradient clipping is a technique used to prevent exploding gradients in deep neural networks.
max_grad_norm
sets the maximum L2 norm of the gradients. If the norm exceeds this value, the gradients are scaled down to meet the constraint.Setting
max_grad_norm
to 1.0 means that the gradients will be clipped if their L2 norm exceeds 1.0.Gradient clipping helps to stabilize the training process and prevents the gradients from becoming too large, which can lead to unstable or divergent behavior.
The optimal value of
max_grad_norm
may vary depending on the model architecture and the specific training task.
lr_scheduler: cosine
The learning rate scheduler determines how the learning rate changes over the course of training.
The cosine scheduler, also known as cosine annealing, gradually decreases the learning rate following a cosine curve.
It starts with a relatively high learning rate and slowly decreases it until it reaches a minimum value at the end of training.
Cosine annealing can help the model converge more smoothly and potentially reach better local optima.
It is a popular choice for learning rate scheduling and has been shown to be effective in various deep learning tasks.
learning_rate: 0.000003
The learning rate determines the step size at which the model's weights are updated during training.
Setting
learning_rate
to 0.000003 (3e-6) is a relatively small value.Smaller learning rates generally result in slower convergence but can lead to more stable and precise updates.
The optimal learning rate depends on factors such as the model architecture, dataset complexity, and optimizer choice.
It may be necessary to experiment with different learning rates to find the best value for a specific task.
The combination of AdamW optimizer with a cosine learning rate scheduler and gradient clipping can help stabilize the training process and potentially improve convergence.
The micro-batch size and number of epochs should be adjusted based on the available hardware resources and the specific requirements of the training task. It's important to monitor the model's performance during training and make adjustments as needed to achieve the desired results.
Last updated