Model Analysis - Configuration Parameters
This is an analysis of the config.json file
These configuration parameters define the architecture and hyperparameters of the Llama3 language model.
The model uses a decoder-only transformer architecture with 32 hidden layers, each having 32 attention heads and a hidden size of 4096.
The model can process sequences up to 8192 tokens long and uses SiLU activation in the hidden layers.
The weights are initialised with a range of 0.02, and the model uses bfloat16 precision for its weights and activations.
The vocabulary size is set to 128,256 tokens, with specific IDs assigned for the beginning-of-sequence and end-of-sequence tokens.
Last updated