Phi 2.0 - Extra Hyperparameters

This is the balance of the configuration file for the training of Phi 2.0. We provide a full explanation of each of the configurations below.

warmup_steps: 100
evals_per_epoch: 4
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
resize_token_embeddings_to_32x: true
special_tokens:
  pad_token: "<|endoftext|>"

Training Hyperparameters

`warmup_steps: 100`

Warm-up steps are a crucial part of the learning rate scheduling.

Over the first 100 training steps, the learning rate incrementally increases to its target value. This gradual increase helps in stabilizing the training early on, preventing the model from making too large updates too quickly.

`evals_per_epoch: 4`

This setting determines the frequency of evaluations within each training epoch.

With a value of 4, the model will be evaluated four times per epoch, providing regular feedback on its performance. Frequent evaluations help in monitoring the model's progress and ensuring it is learning as expected.

`saves_per_epoch: 1`

To safeguard your training progress, the model's state is saved once every epoch. This checkpointing allows you to resume training from the last saved state in case of interruptions and also provides opportunities for model fine-tuning.

Deepspeed and FSDP

`deepspeed:`

DeepSpeed integration offers advanced optimisations for accelerating training and reducing memory consumption. Configuring DeepSpeed enhances training efficiency, particularly for large-scale models, by optimizing computational resources and parallelizing the workload.

`fsdp_config:`

Fully Sharded Data Parallel (FSDP) is a technique to reduce memory consumption and increase the scale of distributed training. The fsdp_config allows you to customise FSDP's behavior, optimizing memory usage and computational efficiency across multiple devices.

`weight decay: 1`

Weight decay is a regularization technique to prevent overfitting by penalising large weights.

A weight decay factor of 0.1 helps in moderating the update of weights, encouraging the model to learn more general features rather than overly fitting to the training data.

Special Tokens and Token Embeddings

`resize_token_embeddings_to_32x:`

This configuration enables the resizing of token embeddings, a feature particularly useful for adapting the model to different vocabulary sizes efficiently.

Resizing to 32 times the original size allows for more flexible and scalable model architecture.

`pad_token`

Special tokens play a pivotal role in how the model processes and understands text. The padding token (pad_token) is used to fill out sequences to a uniform length, ensuring consistent input size for the model.

PreviousPhi 2.0 - Optimisations NextPhi 2.0 - All Configurations

Last updated 1 year ago

Was this helpful?