# Tokenizer Configuration Files

The <mark style="color:blue;">**tokenizer\_config.json**</mark> and <mark style="color:blue;">**tokenizer.json**</mark> files serve different purposes in the tokenization process of the Llama3 language model.&#x20;

Let's clarify the difference between the two and how they interact:

### <mark style="color:blue;">tokenizer\_config.json</mark>

* This file contains the <mark style="color:yellow;">configuration settings for the tokenizer</mark>
* It defines the <mark style="color:yellow;">behaviour and properties of the tokenizer</mark>, such as the special tokens, maximum sequence length, and input tensor names.
* The tokenizer\_config.json file <mark style="color:yellow;">specifies how the tokenizer should handle and interpret the input text during the tokenization process.</mark>
* It includes settings like the beginning-of-sequence (BOS) token, end-of-sequence (EOS) token, and whether to clean up extra spaces during tokenization.
* The tokenizer\_config.json file also defines the mapping between special token IDs and their corresponding token content in the "added\_tokens\_decoder" section.

### <mark style="color:blue;">tokenizer.json</mark>

* This file <mark style="color:yellow;">contains the actual vocabulary and mappings used by the tokenizer to convert input text into token IDs.</mark>
* It defines the mapping between each word, subword, or character in the vocabulary and its corresponding unique token ID.
* The tokenizer.json file is used during the tokenization process to look up the token IDs for each word or subword in the input text.
* It is a crucial component of the tokenizer and is loaded by the tokenizer implementation to perform the actual tokenization.

### <mark style="color:blue;">Interaction between the two files</mark>

* The <mark style="color:blue;">**tokenizer\_config.json**</mark>**&#x20;file** provides the configuration settings for the tokenizer, specifying how it should behave and handle special tokens.
* The <mark style="color:blue;">**tokenizer.json**</mark> file contains the actual vocabulary and mappings used by the tokenizer to convert input text into token IDs.
* During the tokenization process, the tokenizer implementation loads both files:
  * It uses the tokenizer\_config.json file to configure its behavior and special token handling.
  * It uses the tokenizer.json file to look up the token IDs for each word or subword in the input text.
* The tokenizer applies the configuration settings from tokenizer\_config.json while utilizing the vocabulary and mappings from tokenizer.json to perform the tokenization.
