Page cover image

Llama3

Introduction

Meta Llama 3 is the next generation of Meta's open-source large language model (LLM) series.

This model demonstrates exceptional performance on various industry benchmarks and offers new capabilities such as improved reasoning and code generation.

Llama 3 models are available in 8B and 70B parameter sizes, with plans to release larger models up to 400B parameters in the coming months.

Model Architecture

Decoder-only Transformer Llama 3 uses a standard decoder-only transformer architecture, with improvements over its predecessor, Llama 2.

The model employs a tokenizer with a vocabulary of 128K tokens, enabling more efficient language encoding and substantially improved model performance.

Grouped Query Attention (GQA)

To enhance inference efficiency, Llama 3 models adopt Grouped Query Attention (GQA) across both the 8B and 70B sizes. The models are trained on sequences of 8,192 tokens, using a mask to prevent self-attention from crossing document boundaries.

Training Data

Llama 3 is pretrained on over 15T tokens collected from publicly available sources.

The training dataset is seven times larger than that used for Llama 2, with four times more code.

Over 5% of the pretraining dataset consists of high-quality non-English data covering more than 30 languages, preparing for upcoming multilingual use cases.

Data Filtering

To ensure the highest quality of training data, Meta developed data-filtering pipelines that include heuristic filters, NSFW filters, semantic deduplication approaches, and text classifiers to predict data quality.

Llama 2 was used to generate training data for the text-quality classifiers powering Llama 3.

Scaling and Performance

Scaling Laws

Meta developed detailed scaling laws for downstream benchmark evaluations, enabling optimal data mix selection and informed decisions on training compute allocation.

These scaling laws allow performance prediction for the largest models on key tasks before training, ensuring strong performance across various use cases and capabilities.

Parallelization

Llama 3's largest models combine data parallelization, model parallelization, and pipeline parallelization.

The most efficient implementation achieves a compute utilization of over 400 TFLOPS per GPU when trained on 16K GPUs simultaneously.

Instruction Fine-Tuning

Llama 3's post-training approach combines supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO).

The quality of prompts used in SFT and preference rankings used in PPO and DPO significantly influences the performance of aligned models. Training on preference rankings enables the model to learn how to select the right answer in reasoning and coding tasks.

Deployment and Availability

Llama 3 will soon be available on major platforms, including cloud providers, model API providers, and more.

The improved tokenizer efficiency and GQA contribute to maintaining the inference efficiency of the 8B model on par with Llama 2 7B, despite having 1B more parameters.

Future Developments

Meta plans to release larger Llama 3 models with over 400B parameters in the coming months.

These models will offer new capabilities, including multimodality, multilingual conversation, longer context windows, and stronger overall capabilities. A detailed research paper will be published once the training of Llama 3 is complete.

Last updated

Logo

This documentation is for the Axolotl community