Page cover image

Llama3 - Data and Precision

These configurations allow you to control various aspects of the training process, such as data handling, precision settings, and hardware utilisation.

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

Let's go through each configuration and explain its purpose and implications:

train_on_inputs: false

  • This configuration determines whether to include or mask out the human's prompt from the training labels.

  • When set to false, the model will not train on the human's prompt, meaning that the prompt will be excluded from the training labels.

  • In other words, the model will only learn from the desired output or response and not from the input prompt.

  • This is useful when you want the model to generate responses based on the given prompts without explicitly learning to reproduce the prompts themselves.

  • By masking out the human's prompt, the model can focus on learning the mapping between the prompt and the desired output.

group_by_length: false

  • This configuration controls whether to group similarly sized data together to minimise padding during training.

  • When set to false, the data will not be grouped by length and will be processed in the order it appears in the dataset.

  • Grouping data by length can be beneficial when working with variable-length sequences, as it helps to reduce the amount of padding needed.

  • Padding is the process of adding dummy tokens to shorter sequences to match the length of the longest sequence in a batch.

  • By grouping similarly sized data together, you can minimize the amount of unnecessary padding, which can lead to more efficient memory usage and faster training.

  • However, enabling group_by_length may result in slower data loading and preprocessing, as it requires downloading and sorting the entire dataset before training.

  • It's also worth noting that when group_by_length is enabled, the training loss may exhibit an oscillating pattern due to the reordering of the data.

bf16: auto

  • This configuration relates to the use of BFloat16 (BF16) precision during training.

  • BFloat16 is a 16-bit floating-point format that offers a wider dynamic range compared to the more common FP16 (Half-precision) format.

  • When set to auto, the framework will automatically determine whether to use BF16 based on the available hardware and software support.

  • If the hardware (e.g., GPU) and software (e.g., PyTorch version) support BF16, it will be used for training.

  • BF16 can provide a good balance between computational efficiency and numeric precision, potentially leading to faster training times while maintaining model accuracy.

  • However, the actual performance gains may vary depending on the specific hardware and model architecture.

fp16:

  • This configuration is related to the use of FP16 (Half-precision) during training, but in the provided configuration, it is left empty.

  • FP16 is a 16-bit floating-point format that offers reduced precision compared to the standard FP32 (Single-precision) format.

  • Using FP16 can help to reduce memory usage and accelerate training on certain hardware (e.g., NVIDIA GPUs with Tensor Cores).

  • However, the empty value suggests that FP16 is not being explicitly enabled or configured in this case.

tf32: true

  • This configuration is specific to NVIDIA GPUs and relates to the use of TensorFloat-32 (TF32) precision.

  • TF32 is a 19-bit floating-point format that is used by default on NVIDIA Ampere architecture GPUs (e.g., NVIDIA A100) for certain operations, such as matrix multiplications and convolutions.

  • When set to true, TF32 will be used for supported operations on compatible hardware.

  • TF32 offers a balance between performance and precision, providing faster computation compared to FP32 while maintaining similar accuracy.

  • Enabling TF32 can lead to improved training speeds on NVIDIA Ampere GPUs without significant impact on model quality.

Last updated

Was this helpful?