Phi 2.0 - Sequence Configuration
Before training or fine-tuning begins, the input data must be correctly formatted and prepared.
We will be configuring the:
Sequence Length
Sample Packing
Padding to Sequence
sequence_len
This parameter sets the maximum allowable length for input sequences. Sequences longer than this length may be truncated or split during training.
This limit is essential since transformers process input data in fixed-size blocks.
Sequences longer than this length are either truncated or split.
Truncation means cutting off the part of the sequence that exceeds the limit, while splitting involves dividing a long sequence into smaller segments, each within the maximum length.
The choice of sequence length affects memory usage and computational requirements. Longer sequences can capture more context but require more computational resources.
Axolotl recommend the maximum length should typically be less than 2048 as most models have a token/context limit of 2048
sample_packing
A flag that determines whether sample packing should be used.
This is a method to optimise the training process by packing multiple shorter sequences into a single training example (batch). It can increases training efficiency by reducing padding needs and better utilizing GPU memory. This technique is particularly useful when dealing with variable-length sequences.
Implementation: If set to true
, sequences that are shorter than sequence_len
are concatenated with others to form a packed batch. This process continues until the maximum sequence length is reached or no more sequences are available for packing.
pad_to_sequence_len
This is a flag that controls whether sequences should be padded to match the specified sequence length.
This ensures that all sequences in a batch are of the same length, which is necessary for parallel processing by the model. Shorter sequences are extended (padded) with special tokens (usually [PAD]
) to reach the defined maximum sequence length.
Padding is a standard practice in training neural networks on sequences of varying lengths, but it can introduce additional computational overhead, especially with longer sequence lengths.
If set to "true," input sequences will be padded with special tokens to reach the maximum sequence length defined earlier.
Last updated