Model Analysis - Configuration Parameters
This is an analysis of the config.json file
These configuration parameters define the architecture and hyperparameters of the Llama3 language model.
The model uses a decoder-only transformer architecture with 32 hidden layers, each having 32 attention heads and a hidden size of 4096.
The model can process sequences up to 8192 tokens long and uses SiLU activation in the hidden layers.
The weights are initialised with a range of 0.02, and the model uses bfloat16 precision for its weights and activations.
The vocabulary size is set to 128,256 tokens, with specific IDs assigned for the beginning-of-sequence and end-of-sequence tokens.
architectures
Llama
ForCausalLM
Specifies the architecture class used for the language model. This class is designed for causal language modeling tasks, where the model predicts the next token based on the previous tokens.
attention_bias
false
Indicates that attention bias, which can be used to mask out certain positions or modify attention weights, is not used in the model.
attention_dropout
0.0
The dropout probability for the attention layers. Dropout is a regularization technique that randomly sets some input units to 0 during training. In this case, dropout is not applied to the attention layers.
bos_token_id
128,000
The ID of the special token used to indicate the beginning of a sequence. This token is added to the start of the input sequence.
eos_token_id
128,001
The ID of the special token used to indicate the end of a sequence. This token is added to the end of the input sequence.
hidden_act
"silu"
The activation function used in the hidden layers of the model. In this case, the Sigmoid Linear Unit (SiLU) function is used, which is defined as x * sigmoid(x).
hidden_size
4,096
The dimensionality of the hidden states in the model. This determines the size of the vectors that the model uses to represent the input and output at each layer.
initializer_range
0.02
The range used for initialising the model's weights. The weights are initialised randomly within this range to break symmetry and encourage diverse representations.
intermediate_size
14,336
The dimensionality of the intermediate (feed-forward) layer in the model. This layer is applied after the attention layer and expands the hidden state size before projecting it back to the original hidden_size.
max_position
embeddings
8,192
The maximum sequence length that the model can process, in tokens. This limits the context size that the model can attend to.
model_type
"llama"
Specifies the type of the language model. In this case, it is the "llama" model, which is a large language model developed by Meta.
num_attention_heads
32
The number of attention heads in each attention layer. Attention heads allow the model to attend to different parts of the input sequence simultaneously, capturing different relationships and patterns.
num_hidden_layers
32
The number of hidden layers (transformer blocks) in the model. Each layer consists of an attention mechanism followed by a feed-forward network.
num_key_value
heads
8
The number of key-value heads in each attention layer. This is used in the query-key-value attention mechanism, where the key-value heads are used to compute the attention weights.
pretraining_tp
1
The tensor parallelism used during pretraining. Tensor parallelism allows for parallel computation across multiple devices or nodes.
rms_norm_eps
1e-05
The epsilon value used for RMSNorm (Root Mean Square Normalization) layers. RMSNorm is a normalization technique that normalizes the activations based on their root mean square values. The epsilon is added for numerical stability.
rope_scaling
null
Indicates that RoPE (Rotary Position Embedding) scaling is not used. RoPE is a technique for encoding positional information in the attention mechanism.
rope_theta
500,000
The scaling factor for RoPE. This value determines the frequency of the rotary position embeddings.
tie_word_embeddings
false
Indicates that word embeddings are not tied with the output layer. Tying word embeddings means sharing the weights between the input embedding layer and the output softmax layer.
torch_dtype
"bfloat16"
The data type used for the model's weights and activations. In this case, Brain Floating Point 16-bit (bfloat16) is used, which is a 16-bit floating-point format that offers better performance and memory usage compared to the standard 32-bit float.
transformers_version
4.40.0.dev0
The version of the Transformers library used. This specifies the version of the library that the model configuration is compatible with.
use_cache
true
Enables caching of key-value pairs during inference for faster generation. Caching stores the computed key-value pairs for each layer, avoiding redundant computations when generating sequences.
vocab_size
128,256
The size of the model's vocabulary, in tokens. This represents the number of unique tokens that the model can recognize and generate, including regular words, subwords, and special tokens.
Last updated