Phi 2.0 - Model Quantization
With the model configured, we next have to determine whether whether we will be using quantization in the training process.
Given Phi 2.0 is such a small model, in this case we will not be using quantization during the training process.
Model Quantization
load_in_8bit: true or false
This is configuration flag that determines whether the model should be loaded in 8-bit precision. If it is set to "true," the model will be loaded in 8-bit precision.
Memory Efficiency
8-bit precision reduces the memory footprint of the model compared to higher precision formats (like the default 16-bit). This is because it requires less memory to store each weight in the model.
Loading a model in 8-bit precision can accelerate model loading and inference times. This is due to the reduced computational load compared to higher precision formats.
While 8-bit precision is more efficient, it can slightly reduce the accuracy of the model compared to full precision (32-bit). This happens because of the reduced resolution in representing the weights and activations.
The supporting academic paper
This November 2022 paper paper presents a novel quantization method for large language models (LLMs) that enables efficient inference without performance degradation. The key points of the paper are:
The authors develop a two-part quantization procedure called LLM.int8() that allows for the use of 8-bit matrix multiplication in feed-forward and attention projection layers of transformers, reducing memory requirements by half while maintaining full precision performance.
The authors demonstrate that LLM.int8() enables inference in LLMs with up to 175B parameters without any performance degradation, making such models more accessible to researchers and practitioners.
load_in_4bit: true or false
This is configuration flag that determines whether the model should be loaded in 4-bit precision. If it is set to "true," the model will be loaded in 4-bit precision.
4-bit precision takes the concept of memory efficiency further, halving the memory requirements compared to 8-bit. This can be crucial for deploying large models on limited hardware.
Similar to 8-bit, 4-bit precision can lead to even faster loading and inference times due to the further reduced computational requirements.
The trade-off in accuracy might be more pronounced in 4-bit precision. The reduced bit-depth means that the model's ability to represent nuanced information in weights and activations is more limited. This might affect tasks that require high precision or are sensitive to small changes in weights.
strict: true or false
If set to false, default weights will be chosen where missing in adapters. This is a component of 'bits and bytes' library
Last updated