# Phi 2.0 - Data and Precision

These configurations allow you to control various aspects of the training process, such as data handling, precision settings, and hardware utilisation.&#x20;

```yaml
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: true
```

Let's go through each configuration and explain its purpose and implications:

#### <mark style="color:blue;">`train_on_inputs: false`</mark>

* This configuration determines whether to include or mask out the human's prompt from the training labels.
* When set to `false`, the model will not train on the human's prompt, meaning that the prompt will be excluded from the training labels.
* In other words, the model will only learn from the desired output or response and not from the input prompt.
* This is useful when you want the model to generate responses based on the given prompts without explicitly learning to reproduce the prompts themselves.
* By masking out the human's prompt, the model can focus on learning the mapping between the prompt and the desired output.

#### <mark style="color:blue;">`group_by_length: false`</mark>

* This configuration controls whether to group similarly sized data together to minimise padding during training.
* When set to `false`, the data will not be grouped by length and will be processed in the order it appears in the dataset.
* Grouping data by length can be beneficial when working with variable-length sequences, as it helps to reduce the amount of padding needed.
* Padding is the process of adding dummy tokens to shorter sequences to match the length of the longest sequence in a batch.
* By grouping similarly sized data together, you can minimize the amount of unnecessary padding, which can lead to more efficient memory usage and faster training.
* However, enabling `group_by_length` may result in slower data loading and preprocessing, as it requires downloading and sorting the entire dataset before training.
* It's also worth noting that when `group_by_length` is enabled, the training loss may exhibit an oscillating pattern due to the reordering of the data.

#### <mark style="color:blue;">`bf16: auto`</mark>

* This configuration relates to the use of BFloat16 (BF16) precision during training.
* BFloat16 is a 16-bit floating-point format that offers a wider dynamic range compared to the more common FP16 (Half-precision) format.
* When set to `auto`, the framework will automatically determine whether to use BF16 based on the available hardware and software support.
* If the hardware (e.g., GPU) and software (e.g., PyTorch version) support BF16, it will be used for training.
* BF16 can provide a good balance between computational efficiency and numeric precision, potentially leading to faster training times while maintaining model accuracy.
* However, the actual performance gains may vary depending on the specific hardware and model architecture.

#### <mark style="color:blue;">`fp16:`</mark>

* This configuration is related to the use of FP16 (Half-precision) during training, but in the provided configuration, it is left empty.
* FP16 is a 16-bit floating-point format that offers reduced precision compared to the standard FP32 (Single-precision) format.
* Using FP16 can help to reduce memory usage and accelerate training on certain hardware (e.g., NVIDIA GPUs with Tensor Cores).
* However, the empty value suggests that FP16 is not being explicitly enabled or configured in this case.

#### <mark style="color:blue;">`tf32: true`</mark>

* This configuration is specific to NVIDIA GPUs and relates to the use of TensorFloat-32 (TF32) precision.
* TF32 is a 19-bit floating-point format that is used by default on NVIDIA Ampere architecture GPUs (e.g., NVIDIA A100) for certain operations, such as matrix multiplications and convolutions.
* When set to `true`, TF32 will be used for supported operations on compatible hardware.
* TF32 offers a balance between performance and precision, providing faster computation compared to FP32 while maintaining similar accuracy.
* Enabling TF32 can lead to improved training speeds on NVIDIA Ampere GPUs without significant impact on model quality.
