Llama3 - Extra Hyperparameters
This is the balance of the configuration file for the training of Phi 2.0. We provide a full explanation of each of the configurations below.
Training Hyperparameters
warmup_steps: 100
Warm-up steps are a crucial part of the learning rate scheduling.
Over the first 100 training steps, the learning rate incrementally increases to its target value. This gradual increase helps in stabilizing the training early on, preventing the model from making too large updates too quickly.
evals_per_epoch: 4
This setting determines the frequency of evaluations within each training epoch. With a value of 4, the model will be evaluated four times per epoch, providing regular feedback on its performance. Frequent evaluations help in monitoring the model's progress and ensuring it is learning as expected.
saves_per_epoch: 1
To safeguard your training progress, the model's state is saved once every epoch. This checkpointing allows you to resume training from the last saved state in case of interruptions and also provides opportunities for model fine-tuning.
Deepspeed and FSDP
deepspeed:
DeepSpeed integration offers advanced optimisations for accelerating training and reducing memory consumption. Configuring DeepSpeed enhances training efficiency, particularly for large-scale models, by optimizing computational resources and parallelizing the workload.
fsdp_config:
Fully Sharded Data Parallel (FSDP) is a technique to reduce memory consumption and increase the scale of distributed training. The fsdp_config
allows you to customize FSDP's behavior, optimizing memory usage and computational efficiency across multiple devices.
weight decay: 1
Weight decay is a regularization technique to prevent overfitting by penalizing large weights. A weight decay factor of 0.1 helps in moderating the update of weights, encouraging the model to learn more general features rather than overly fitting to the training data.
Special Tokens and Token Embeddings
resize_token_embeddings_to_32x:
This configuration enables the resizing of token embeddings, a feature particularly useful for adapting the model to different vocabulary sizes efficiently. Resizing to 32 times the original size allows for more flexible and scalable model architecture.
pad_token
Special tokens play a pivotal role in how the model processes and understands text. The padding token (pad_token
) is used to fill out sequences to a uniform length, ensuring consistent input size for the model.
Last updated