# Special Tokens

It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer’s vocabulary.&#x20;

This will help you avoid tokenization issues and help your model train better.&#x20;

You can do this in axolotl like this:

```bash
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens: # these are delimiters
  - "<|im_start|>"
  - "<|im_end|>"
```

When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer’s vocabulary.

## <mark style="color:blue;">Explanation of Special Tokens</mark>

The JSON structure you provided describes the special tokens used by a tokenizer in a Huggingface model.&#x20;

These special tokens have specific roles in language models and their processing. Let's break down what each part means:

### <mark style="color:blue;">**bos\_token**</mark><mark style="color:blue;">:</mark>

* **content**: `"<s>"` - This is the 'beginning of sequence' token. It's used to indicate the start of a text sequence.
* **lstrip**: `false` - Indicates that spaces to the left (beginning) of this token should not be stripped.
* **normalized**: `false` - This token is not subject to normalization during tokenization.
* **rstrip**: `false` - Indicates that spaces to the right (end) of this token should not be stripped.
* **single\_word**: `false` - This token does not represent a single word.

### <mark style="color:blue;">**eos\_token**</mark><mark style="color:blue;">:</mark>

* **content**: `"</s>"` - This is the 'end of sequence' token, used to mark the end of a text sequence.
* **lstrip**: `false` - Spaces to the left of this token should not be stripped.
* **normalized**: `false` - This token is not normalized.
* **rstrip**: `false` - Spaces to the right of this token should not be stripped.
* **single\_word**: `false` - It's not considered a single word.

### <mark style="color:blue;">**pad\_token**</mark><mark style="color:blue;">:</mark>

* `"</s>"` - This is the padding token, used to fill in the sequence to a uniform length in batch processing. Interestingly, it's the same as the 'end of sequence' token here, which is an unusual but not unheard-of configuration.

### <mark style="color:blue;">**unk\_token**</mark><mark style="color:blue;">:</mark>

* **content**: `"<unk>"` - This is the 'unknown' token, used to represent words or characters not found in the model's vocabulary.
* **lstrip**: `false` - Spaces to the left of this token are not removed.
* **normalized**: `false` - The token is not normalized.
* **rstrip**: `false` - Spaces to the right of this token are not removed.
* **single\_word**: `false` - It's not considered a single word.

This configuration is part of the tokenizer setup and dictates how the tokenizer handles these special tokens during the processing of text.&#x20;

Each special token has a role in helping the model understand and generate text, from marking the start and end of a text sequence to dealing with unknown words and padding sequences for consistent length.
