Page cover image

Llama3 - Sequence Configuration

Before training or fine-tuning begins, the input data must be correctly formatted and prepared.

We will be configuring the:

Sequence Length

  1. Sequence Length

  2. Sample Packing

  3. Padding to Sequence

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

sequence_len

This parameter sets the maximum allowable length for input sequences. Sequences longer than this length may be truncated or split during training.

This limit is essential since transformers process input data in fixed-size blocks.

Sequences longer than this length are either truncated or split.

Truncation means cutting off the part of the sequence that exceeds the limit, while splitting involves dividing a long sequence into smaller segments, each within the maximum length.

The choice of sequence length affects memory usage and computational requirements. Longer sequences can capture more context but require more computational resources.

sample_packing

A flag that determines whether sample packing should be used.

This is a method to optimise the training process by packing multiple shorter sequences into a single training example (batch). It can increases training efficiency by reducing padding needs and better utilizing GPU memory. This technique is particularly useful when dealing with variable-length sequences.

Implementation: If set to true, sequences that are shorter than sequence_len are concatenated with others to form a packed batch. This process continues until the maximum sequence length is reached or no more sequences are available for packing.

pad_to_sequence_len

This is a flag that controls whether sequences should be padded to match the specified sequence length.

This ensures that all sequences in a batch are of the same length, which is necessary for parallel processing by the model.

Shorter sequences are extended (padded) with special tokens (usually [PAD]) to reach the defined maximum sequence length.

Padding is a standard practice in training neural networks on sequences of varying lengths, but it can introduce additional computational overhead, especially with longer sequence lengths.

If set to "true," input sequences will be padded with special tokens to reach the maximum sequence length defined earlier.

Last updated

Logo

This documentation is for the Axolotl community