# Llama3

### <mark style="color:blue;">Introduction</mark>

Meta Llama 3 is the next generation of Meta's open-source large language model (LLM) series.&#x20;

This model demonstrates exceptional performance on various industry benchmarks and offers new capabilities such as improved reasoning and code generation.&#x20;

Llama 3 models are available in 8B and 70B parameter sizes, with plans to release larger models up to 400B parameters in the coming months.

#### <mark style="color:green;">Model Architecture</mark>&#x20;

Decoder-only Transformer Llama 3 uses a standard <mark style="color:yellow;">decoder-only transformer architecture</mark>, with improvements over its predecessor, Llama 2.&#x20;

The model employs a tokenizer with a <mark style="color:yellow;">vocabulary of 128K tokens</mark>, enabling more efficient language encoding and substantially improved model performance.

#### <mark style="color:green;">Grouped Query Attention (GQA)</mark>

To enhance inference efficiency, Llama 3 models adopt Grouped Query Attention (GQA) across both the 8B and 70B sizes. The models are trained on sequences of 8,192 tokens, using a mask to prevent self-attention from crossing document boundaries.

#### <mark style="color:green;">Training Data</mark>

Llama 3 is pretrained on over <mark style="color:yellow;">15T tokens</mark> collected from publicly available sources.&#x20;

The training dataset is *<mark style="color:yellow;">**seven times larger than that used for Llama 2**</mark>*, with four times more code.&#x20;

Over 5% of the pretraining dataset consists of high-quality non-English data covering more than 30 languages, preparing for upcoming multilingual use cases.

#### <mark style="color:green;">Data Filtering</mark>&#x20;

To ensure the highest quality of training data, Meta developed data-filtering pipelines that include heuristic filters, NSFW filters, semantic deduplication approaches, and text classifiers to predict data quality.&#x20;

Llama 2 was used to generate training data for the text-quality classifiers powering Llama 3.

Scaling and Performance

#### <mark style="color:green;">Scaling Laws</mark>

Meta developed detailed scaling laws for downstream benchmark evaluations, enabling optimal data mix selection and informed decisions on training compute allocation.&#x20;

These scaling laws allow performance prediction for the largest models on key tasks before training, ensuring strong performance across various use cases and capabilities.

#### <mark style="color:green;">Parallelization</mark>

Llama 3's largest models combine data parallelization, model parallelization, and pipeline parallelization.&#x20;

The most efficient implementation achieves a compute utilization of over 400 TFLOPS per GPU when trained on 16K GPUs simultaneously.

#### <mark style="color:green;">Instruction Fine-Tuning</mark>

Llama 3's post-training approach combines supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO).&#x20;

The quality of prompts used in SFT and preference rankings used in PPO and DPO significantly influences the performance of aligned models. Training on preference rankings enables the model to learn how to select the right answer in reasoning and coding tasks.

#### <mark style="color:green;">Deployment and Availability</mark>&#x20;

Llama 3 will soon be available on major platforms, including cloud providers, model API providers, and more.&#x20;

The improved tokenizer efficiency and GQA contribute to maintaining the inference efficiency of the 8B model on par with Llama 2 7B, despite having 1B more parameters.

#### <mark style="color:green;">Future Developments</mark>

Meta plans to release larger Llama 3 models with over 400B parameters in the coming months.&#x20;

These models will offer new capabilities, including multimodality, multilingual conversation, longer context windows, and stronger overall capabilities. A detailed research paper will be published once the training of Llama 3 is complete.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://axolotl.continuumlabs.pro/llama3.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
