# Phi 2.0 - Model Quantization

With the model configured, we next have to determine whether whether we will be using quantization in the training process.

Given Phi 2.0 is such a small model, in this case we will not be using quantization during the training process.

### <mark style="color:blue;">Model Quantization</mark>

<pre class="language-yaml"><code class="lang-yaml"><strong>#These are the default values
</strong><strong>
</strong><strong>llm_int8_has_fp16_weight: false
</strong>bnb_4bit_quant_type: nf4  #4 bit normal float data type
bnb_4bit_use_double_quant: true

#You can override the default values as per below

load_in_8bit: true
load_in_4bit: false
strict: false
</code></pre>

### <mark style="color:blue;">load\_in\_8bit: true or false</mark>

This is configuration flag that determines <mark style="color:yellow;">whether the model should be loaded in</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">**8-bit precision.**</mark>  If it is set to "true," the model will be loaded in 8-bit precision.

#### <mark style="color:green;">**Memory Efficiency**</mark>

8-bit precision reduces the memory footprint of the model compared to higher precision formats (like the default 16-bit). This is because it requires less memory to store each weight in the model.

Loading a model in 8-bit precision can accelerate model loading and inference times. This is due to the reduced computational load compared to higher precision formats.

While 8-bit precision is more efficient, it can slightly reduce the accuracy of the model compared to full precision (32-bit). This happens because of the reduced resolution in representing the weights and activations.

<details>

<summary>Reference: <mark style="color:yellow;">BitsAndBytesConfig Class</mark> <mark style="color:green;">from the Transformers Library</mark></summary>

This class is a wrapper for configuring and managing the quantization settings when loading a model using the <mark style="color:yellow;">`bitsandbytes`</mark> library.

Quantization is a technique used to reduce the memory footprint and computational cost of deep learning models by representing weights and activations with lower-precision data types, such as int8 or 4-bit floating-point numbers.&#x20;

The `bitsandbytes` library provides methods for quantizing models, and the <mark style="color:yellow;">`BitsAndBytesConfig`</mark> class *<mark style="color:yellow;">**acts as a configuration object to control the quantization settings.**</mark>*

Let's go through the main aspects of the `BitsAndBytesConfig` class:

<mark style="color:green;">Initialization</mark>

* The class takes several arguments in its constructor to configure the quantization settings.
* The main arguments are <mark style="color:yellow;">`load_in_8bit`</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">and</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">`load_in_4bit`</mark>, which are mutually exclusive and determine whether to use 8-bit or 4-bit quantization.
* Other arguments include threshold values, module exclusion lists, and settings specific to the <mark style="color:yellow;">`bitsandbytes`</mark> library.

<mark style="color:green;">Properties and Setters</mark>

* The class provides properties and setters for the <mark style="color:yellow;">`load_in_4bit`</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">and</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">`load_in_8bit`</mark> attributes.
* The setters enforce the mutual exclusivity of these attributes and validate the input values.

<mark style="color:green;">Post-initialization</mark>

* The <mark style="color:yellow;">`post_init()`</mark> method is called after initialization to perform safety checks on the provided arguments.
* It ensures that the arguments have the correct data types and raises a <mark style="color:yellow;">`ValueError`</mark> if any inconsistencies are found.
* It also checks the version of the <mark style="color:yellow;">`bitsandbytes`</mark> library to ensure compatibility with 4-bit quantization.

<mark style="color:green;">Quantization Methods</mark>

* The `is_quantizable()` method returns `True` if the model is quantizable based on the `load_in_8bit` or `load_in_4bit` flags.
* The `quantization_method()` method returns the specific quantization method used, such as "llm\_int8", "fp4", or "nf4", based on the configuration.

<mark style="color:green;">Serialization</mark>

* The `to_dict()` method serializes the configuration instance to a Python dictionary, converting the PyTorch data types to strings for serialization.
* The `to_diff_dict()` method serializes only the attributes that differ from the default configuration, providing a more concise representation.

<mark style="color:green;">Representation</mark>

* The `__repr__()` method provides a string representation of the configuration instance, displaying the class name and the serialized dictionary.

The <mark style="color:yellow;">`BitsAndBytesConfig`</mark> class is designed to work seamlessly with the Transformers library and the `bitsandbytes` library for quantizing models. It provides a convenient way to configure and manage the quantization settings when loading a model.

Here are a few key points to note:

* The class supports both 8-bit and 4-bit quantization, controlled by the `load_in_8bit` and `load_in_4bit` flags.
* It allows specifying threshold values for outlier detection in 8-bit quantization, which can help maintain performance for large models.
* It provides options to exclude certain modules from quantization and to enable offloading of non-quantized parts to CPU.
* The class performs validation checks to ensure the consistency and compatibility of the provided arguments.

Overall, the `BitsAndBytesConfig` class is an important component in the Transformers library for enabling quantization of models using the `bitsandbytes` library. It provides a flexible and configurable interface to control the quantization settings and optimize the performance and memory usage of deep learning models.

</details>

### <mark style="color:blue;">The supporting academic paper</mark>

This <mark style="color:blue;">November 2022</mark> paper paper presents a novel quantization method for large language models (LLMs) that enables efficient inference without performance degradation. The key points of the paper are:

The authors develop a two-part quantization procedure called LLM.int8() that allows for the use of 8-bit matrix multiplication in feed-forward and attention projection layers of transformers, reducing memory requirements by half while maintaining full precision performance.

The authors demonstrate that LLM.int8() enables inference in LLMs with up to 175B parameters without any performance degradation, making such models more accessible to researchers and practitioners.

{% embed url="<https://arxiv.org/abs/2208.07339>" %}
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
{% endembed %}

### <mark style="color:blue;">load\_in\_4bit: true or false</mark>

This is configuration flag that determines whether the model should be loaded in **4-bit precision**.  If it is set to "true," the model will be loaded in 4-bit precision.

4-bit precision takes the concept of memory efficiency further, halving the memory requirements compared to 8-bit. This can be crucial for deploying large models on limited hardware.

Similar to 8-bit, 4-bit precision can lead to even faster loading and inference times due to the further reduced computational requirements.

The trade-off in accuracy might be more pronounced in 4-bit precision. The reduced bit-depth means that the model's ability to represent nuanced information in weights and activations is more limited. This might affect tasks that require high precision or are sensitive to small changes in weights.

### <mark style="color:blue;">strict: true or false</mark>

If set to false, default weights will be chosen where missing in adapters.  This is a component of 'bits and bytes' library

<details>

<summary>Reference: <mark style="color:green;">What is 'bits and bytes'?</mark></summary>

<mark style="color:green;">**Overview**</mark>

The <mark style="color:yellow;">**`bitsandbytes`**</mark> repository provides a lightweight wrapper around CUDA custom functions, primarily focusing on 8-bit optimizers, matrix multiplication <mark style="color:yellow;">**(**</mark><mark style="color:yellow;">**`LLM.int8()`**</mark><mark style="color:yellow;">**),**</mark> and quantization functions.&#x20;

This tool is designed to enhance the performance and efficiency of machine learning models, particularly in the context of CUDA-enabled computing environments.

<mark style="color:green;">**Key Features**</mark>

* <mark style="color:blue;">**8-bit Optimizers**</mark><mark style="color:blue;">:</mark> Specialised for reducing memory usage and improving computational efficiency.
* <mark style="color:blue;">**Matrix Multiplication (LLM.int8())**</mark><mark style="color:blue;">:</mark> Offers optimized matrix multiplication capabilities.
* <mark style="color:blue;">**Quantization Functions**</mark><mark style="color:blue;">:</mark> Includes various methods for quantizing models, contributing to reduced model sizes and potentially faster inference times.

<mark style="color:green;">**Requirements**</mark>

* Python version 3.8 or higher.
* Linux distribution (Ubuntu, MacOS, etc.) with CUDA version greater than 10.0.
* Note: CUDA 10.0 is deprecated, and future support is focused on CUDA >= 11.0 with release 0.39.0.

<mark style="color:green;">**Installation**</mark>

* Installable via pip (<mark style="color:yellow;">**`pip install bitsandbytes`**</mark>).
* In cases where compilation from source is necessary, users are encouraged to submit a bug report and follow the provided compilation instructions.

<mark style="color:green;">**Usage Highlights**</mark>

* <mark style="color:blue;">**Int8 Inference with HuggingFace Transformers**</mark><mark style="color:blue;">:</mark> Allows models to load in 8-bit for reduced memory usage.
* <mark style="color:blue;">**8-bit Optimizer Usage**</mark><mark style="color:blue;">:</mark> Users can easily switch to 8-bit optimizers by replacing their existing optimizers with the corresponding 8-bit version from `bitsandbytes`.
* <mark style="color:blue;">**Mixed 8-bit Training and Int8 Inference**</mark><mark style="color:blue;">:</mark> The library supports both mixed 8-bit training with 16-bit main weights and full 8-bit inference.

<mark style="color:green;">**Features**</mark>

* Advanced techniques for 8-bit matrix multiplication and LLM.int8() inference.
* A range of 8-bit optimizers including Adam, AdamW, RMSProp, LARS, LAMB, and Lion.
* A stable embedding layer feature for improved stability in NLP models.
* Fast and efficient algorithms for quantile estimation.

<mark style="color:green;">**Requirements & Hardware Compatibility**</mark>

* Requires Anaconda, cudatoolkit, and PyTorch.
* Compatible with NVIDIA GPUs, specifically Turing or newer for LLM.int8(), and Kepler or newer for 8-bit optimizers and quantization.
* Supports CUDA versions from 10.2 to 12.0.
* Note: The library is currently supported only on Linux distributions.

</details>

<details>

<summary><mark style="color:green;">Summary of Tim Dettmers' Presentation on 8-Bit Methods for Efficient Deep Learning</mark></summary>

His main thesis is that computationally efficient methods will accelerate progress in understanding deep learning.

<mark style="color:green;">**Key Points from Tim's Presentation**</mark>

**8-Bit Methods for Large Models**: Tim highlights the importance of making large models more accessible through quantization, which <mark style="color:yellow;">reduces the memory footprint.</mark>

**Quantization Explained**: He explains quantization as a process of <mark style="color:yellow;">converting floating-point or real representations into discrete buckets</mark>, akin to histogram binning.

**Linear vs. Nonlinear Quantization**: <mark style="color:yellow;">Linear (integer) quantization involves equally wide bins, while nonlinear quantization allows varying bin widths</mark>.

**Error Reduction in Quantization**: Tim illustrates how the <mark style="color:yellow;">choice of bins impacts precision and error distribution in quantized values.</mark>

**4-Bit Inference**: His recent work shows that <mark style="color:yellow;">4-bit inference is highly effective for large transformers</mark>.

**Floating Point Data Types**: The presentation delves into the structure of floating point data types, explaining the roles of exponent bits and fraction bits.

**Dynamic Exponent Data Type**: Tim introduces a <mark style="color:yellow;">unique data type he developed with a dynamic exponent</mark>, which offers flexibility in approximating large and small values with varying precision.

**8-Bit Optimizers**: The focus shifts to 8-bit optimizers, crucial for memory efficiency in training large models, particularly in language modeling.

Tim discusses <mark style="color:yellow;">reducing memory usage by approximately 40% by converting 32-bit Adam optimizer buffers to 8-bit.</mark>

This reduction is significant as it helps make large models more memory-efficient.

<mark style="color:yellow;">Outliers in Adam optimizer buffers cause issues in quantization</mark>, leading to increased error and ineffective 8-bit quantization.

Tim presents an example showing how <mark style="color:yellow;">outliers can skew the data</mark>, leading to a waste of bits and loss of effective representation.

To address the problem of outliers, Tim <mark style="color:yellow;">proposes chunking Adam states into blocks and quantizing each block independently</mark>.

This method isolates the impact of outliers to specific blocks, enhancing the stability of 8-bit optimizers.

The process involves <mark style="color:yellow;">chunking state into blocks</mark>, finding the maximum value for normalization, and storing the index for 8-bit representation.

This method ensures compact yet effective optimization, <mark style="color:yellow;">comparable to 32-bit optimizers</mark>.

This achievement indicates significant memory savings without compromising performance.

8-bit optimizers are efficient in mapping onto hardware, with the <mark style="color:yellow;">main overhead being the dequantization process.</mark>

<mark style="color:yellow;">Outliers become a significant problem in models larger than 6.7 billion parameters</mark>, causing performance drops.

Tim's research identifies <mark style="color:yellow;">systematic outliers that emerge with scale</mark> and become problematic at specific model sizes.

Outliers in large models exhibit systematic and emergent properties, affecting the same dimensions across layers.

These outliers impact all layers in a transformer model once a certain scale is reached.

The emergence of outliers follows an exponential trend, leading to a phase shift-like effect at a certain scale.

Understanding and addressing this exponential trend is key to managing outliers in large models.

A novel approach was developed to identify and process these outliers in 16-bit while handling the rest in 8-bit, effectively maintaining efficiency while addressing the problem.

<mark style="color:green;">**Efficiency of 8-Bit Matrix Multiplication**</mark>

* By applying this method, 99.9% of weights are computed in 8-bit, with a small portion in 16-bit for outliers. This approach achieves performance equivalent to 16-bit computations while halving memory size.
* This makes large models like Llama 65B accessible on consumer hardware, significantly lowering the barrier to entry for working with such models.

<mark style="color:green;">**Few-Shot and Zero-Shot Performance**</mark>

* The few-shot performance of models using 8-bit methods is comparable to 16-bit models. Tim highlighted a strong correlation between zero-shot performance and perplexity in language models, indicating that complexity evaluations can reliably predict zero-shot performance.

<mark style="color:green;">**Understanding Outliers in Transformer Models**</mark>

* Outliers tend to be concentrated in specific columns of the input batch and are more prevalent in larger models. These outliers are crucial for attention mechanisms in transformers.
* They are context-independent, aiding the attention mechanism in focusing on specific values by providing predictable patterns for the model to cancel out unnecessary information.

<mark style="color:green;">**Trade-Offs in Activation Functions**</mark>

* Replacing traditional activation functions like softmax with more stable alternatives can increase stability but may lead to a drop in performance.
* This presents a research challenge in balancing stability with maintaining or enhancing model performance.

<mark style="color:green;">**Impact of Precision and Parameter Count on Model Efficiency**</mark>

* An interesting finding is that models with the same number of bits but different distributions of precision and parameter count (e.g., 8-bit with more parameters vs. 4-bit with fewer parameters) exhibit the same inference latency.
* This equivalence in performance is due to the nature of GPU computations, where memory loading is significantly more costly than computation. Therefore, the memory used during inference, rather than the computational complexity, often dictates the performance.

</details>
