To introduce you to the process of fine tuning a language model, we will begin with Phi 2.0.
Phi 2.0 is a small but powerful model, and a great way to begin learning how to fine tune a language model.
Phi 2.0 Review
Phi 2 is a relatively small model with 2.7 billion parameters, yet it outperforms models of comparable size like Mamba and Google's Gemini Nano, as well as models 20-25 times its size, according to the benchmarks.
Phi 2 was trained on high-quality synthetic data, including textbook-quality code, common sense reasoning, logic, science, and theory of mind exercises generated by GPT-3.5 and filtered by GPT-4. This synthetic data approach allowed for more training epochs.
Training on synthetic data tends to result in less toxic models, as evidenced by Phi 2's lower toxicity scores even before reinforcement learning.
The Phi 2 researchers believe that enormous amounts of compute have been wasted on ineffective training data, and that carefully curated synthetic data can lead to more efficient and higher-quality models.
Phi 2's performance suggests that achieving ChatGPT-level capabilities with a 1 billion parameter model may be possible. Extrapolating further, a 1.5 trillion parameter model trained this way could potentially imitate a 1.5 quadrillion parameter model.
However, Phi models are sensitive to prompt variations, and longer prompts may cause the model to forget, ignore, or misinterpret parts of the prompt.
The Phi 2 model itself is open-sourced, although the full training dataset has not been released yet.
The key takeaways are that Phi 2 demonstrates the potential of using high-quality synthetic data to train smaller, more efficient models that can rival the performance of much larger models, and that this approach could lead to significant advancements in AI capabilities in the near future. However, the model's sensitivity to prompts is a limitation to keep in mind.
Click on link below to review Phi 2.0 at Huggingface model repository
The expandables below give you some insight as to the files that come with the Phi 2.0 model, and what they mean.
Files and Versions - Explanation of Contents
Here's an explanation of each file in the Hugging Face model card:
.gitattributes: This file is used to define attributes for different file types in the Git repository. It can specify how certain files should be treated, such as whether they should be normalized or how line endings should be handled.
LICENSE: This file contains the license under which the model is distributed. It specifies the terms and conditions for using, modifying, and distributing the model.
README.md: This file provides an overview of the model, including its purpose, usage instructions, and any other relevant information. It serves as the main documentation for the model.
added_tokens.json: This file contains information about any additional tokens that have been added to the model's vocabulary beyond the standard tokens.
config.json: This file holds the configuration settings for the model, such as the model architecture, hyperparameters, and other model-specific details.
configuration_phi.py: This Python file likely contains the implementation of the model's configuration class, which is used to load and manage the model's configuration.
generation_config.json: This file specifies the configuration settings for text generation using the model, such as the maximum sequence length, temperature, and other generation-related parameters.
merges.txt: This file is part of the tokenizer and contains the byte pair encoding (BPE) merges used for tokenization.
model-00001-of-00002.safetensors and model-00002-of-00002.safetensors: These files contain the serialized model weights, split into two parts.
The .safetensors format is a memory-mapped format for storing tensors safely.
model.safetensors.index.json: This file likely contains metadata or an index for the serialized model weights stored in the .safetensors files.
modeling_phi.py: This Python file contains the implementation of the model architecture and its forward pass.
special_tokens_map.json: This file maps special token names to their corresponding token IDs in the model's vocabulary.
tokenizer.json: This file contains the serialized tokenizer object, which is used to tokenize input text into a format suitable for the model.
tokenizer_config.json: This file holds the configuration settings for the tokenizer, such as the vocabulary size and any special tokens.
vocab.json: This file contains the vocabulary of the model, mapping tokens to their corresponding IDs.
These files collectively define the model architecture, weights, configuration, tokenizer, and other necessary components for using the model in downstream tasks or applications.
config.json contents and explanation
The config.json file contains the configuration settings for the Transformer-based language model called "Phi".
"_name_or_path": "microsoft/phi-2": Specifies the name or path of the pre-trained model.
"architectures": ["PhiForCausalLM"]: Indicates the architecture class used for the model, which is PhiForCausalLM (Phi model for causal language modeling).
"auto_map": { ... }: Defines the mapping between the auto classes (AutoConfig and AutoModelForCausalLM) and their corresponding implementation classes (PhiConfig and PhiForCausalLM).
"attention_dropout": 0.0: Sets the dropout probability for the attention layers. A value of 0.0 means no dropout is applied.
"bos_token_id": 50256 and "eos_token_id": 50256: Specifies the token IDs for the beginning-of-sequence (BOS) and end-of-sequence (EOS) tokens.
"embd_pdrop": 0.0: Sets the dropout probability for the embedding layers.
"hidden_act": "gelu_new": Specifies the activation function used in the hidden layers, which is the "gelu_new" variant of the Gaussian Error Linear Unit (GELU) activation.
"hidden_size": 2560: Defines the dimensionality of the model's hidden states.
"initializer_range": 0.02: Sets the range for initializing the model's weights.
"intermediate_size": 10240: Specifies the dimensionality of the intermediate (feed-forward) layers.
"layer_norm_eps": 1e-05: Sets the epsilon value for layer normalization to provide numerical stability.
"max_position_embeddings": 2048: Defines the maximum sequence length that the model can handle.
"model_type": "phi": Indicates the type of the model, which is "phi".
"num_attention_heads": 32: Specifies the number of attention heads in each attention layer.
"num_hidden_layers": 32: Defines the number of hidden layers (Transformer blocks) in the model.
"num_key_value_heads": 32: Specifies the number of key-value pairs in each attention head.
"partial_rotary_factor": 0.4: Defines the partial rotary factor used in rotary position embedding.
"qk_layernorm": false: Indicates whether layer normalization is applied to the query and key vectors in the attention mechanism.
"resid_pdrop": 0.1: Sets the dropout probability for the residual connections.
"rope_scaling": null and "rope_theta": 10000.0: Specify the scaling and theta values for RoPE (Rotary Position Embedding).
"tie_word_embeddings": false: Indicates whether the input and output word embeddings are tied (shared).
"torch_dtype": "float16": Specifies the data type used for the model's parameters (float16 for half-precision).
"transformers_version": "4.37.0": Indicates the version of the Transformers library used.
"use_cache": true: Enables caching of the model's key-value pairs during inference for faster generation.
"vocab_size": 51200: Defines the size of the model's vocabulary.
These configuration settings determine the architecture, hyperparameters, and behavior of the Phi model. They are used to initialise and configure the model during training and inference.
model.safetensors.index.json
The model.safetensors.index.jsonfile is an index file that maps the names of the model's parameters to their corresponding locations within the .safetensors files.
In this case, the model's parameters are stored in two separate .safetensors files:
model-00001-of-00002.safetensors and model-00002-of-00002.safetensors.
The index file helps the model loading process identify where each parameter is located.
Let's break it down:
The metadata field contains information about the total size of the model parameters in bytes (5,559,367,680 bytes, which is approximately 5.6 GB).
The weight_map field is a dictionary where each key represents the name of a model parameter, and the corresponding value indicates the file in which that parameter is stored.
For example, the entry "model.embed_tokens.weight": "model-00001-of-00002.safetensors"means that the model.embed_tokens.weight parameter is stored in the model-00001-of-00002.safetensors file.
The parameter names provide information about the model architecture:
model.embed_tokens.weight represents the embedding layer weights.
model.layers.0.input_layernorm.bias and model.layers.0.input_layernorm.weightrepresent the layer normalization parameters for the input of the first layer.
model.layers.0.self_attn.dense.bias and model.layers.0.self_attn.dense.weightrepresent the parameters of the dense layer in the self-attention mechanism of the first layer.
... and so on for each layer of the model.
The .safetensors format is a way to store the model parameters efficiently and safely. It allows for faster loading times and helps prevent issues like model corruption.
In summary, the model.safetensors.index.json file acts as a map that tells the model loading process where to find each parameter within the .safetensors files. This enables the model to be loaded correctly and efficiently.
With an understanding of the model characteristics, we will now download it to our local directory.