# Special Tokens

It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer’s vocabulary.&#x20;

This will help you avoid tokenization issues and help your model train better.&#x20;

You can do this in axolotl like this:

```bash
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens: # these are delimiters
  - "<|im_start|>"
  - "<|im_end|>"
```

When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer’s vocabulary.

## <mark style="color:blue;">Explanation of Special Tokens</mark>

The JSON structure you provided describes the special tokens used by a tokenizer in a Huggingface model.&#x20;

These special tokens have specific roles in language models and their processing. Let's break down what each part means:

### <mark style="color:blue;">**bos\_token**</mark><mark style="color:blue;">:</mark>

* **content**: `"<s>"` - This is the 'beginning of sequence' token. It's used to indicate the start of a text sequence.
* **lstrip**: `false` - Indicates that spaces to the left (beginning) of this token should not be stripped.
* **normalized**: `false` - This token is not subject to normalization during tokenization.
* **rstrip**: `false` - Indicates that spaces to the right (end) of this token should not be stripped.
* **single\_word**: `false` - This token does not represent a single word.

### <mark style="color:blue;">**eos\_token**</mark><mark style="color:blue;">:</mark>

* **content**: `"</s>"` - This is the 'end of sequence' token, used to mark the end of a text sequence.
* **lstrip**: `false` - Spaces to the left of this token should not be stripped.
* **normalized**: `false` - This token is not normalized.
* **rstrip**: `false` - Spaces to the right of this token should not be stripped.
* **single\_word**: `false` - It's not considered a single word.

### <mark style="color:blue;">**pad\_token**</mark><mark style="color:blue;">:</mark>

* `"</s>"` - This is the padding token, used to fill in the sequence to a uniform length in batch processing. Interestingly, it's the same as the 'end of sequence' token here, which is an unusual but not unheard-of configuration.

### <mark style="color:blue;">**unk\_token**</mark><mark style="color:blue;">:</mark>

* **content**: `"<unk>"` - This is the 'unknown' token, used to represent words or characters not found in the model's vocabulary.
* **lstrip**: `false` - Spaces to the left of this token are not removed.
* **normalized**: `false` - The token is not normalized.
* **rstrip**: `false` - Spaces to the right of this token are not removed.
* **single\_word**: `false` - It's not considered a single word.

This configuration is part of the tokenizer setup and dictates how the tokenizer handles these special tokens during the processing of text.&#x20;

Each special token has a role in helping the model understand and generate text, from marking the start and end of a text sequence to dealing with unknown words and padding sequences for consistent length.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://axolotl.continuumlabs.pro/special-tokens.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
