Special Tokens

It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer’s vocabulary.

This will help you avoid tokenization issues and help your model train better.

You can do this in axolotl like this:

special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens: # these are delimiters
  - "<|im_start|>"
  - "<|im_end|>"

When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer’s vocabulary.

Explanation of Special Tokens

The JSON structure you provided describes the special tokens used by a tokenizer in a Huggingface model.

These special tokens have specific roles in language models and their processing. Let's break down what each part means:

bos_token:

content: "<s>" - This is the 'beginning of sequence' token. It's used to indicate the start of a text sequence.
lstrip: false - Indicates that spaces to the left (beginning) of this token should not be stripped.
normalized: false - This token is not subject to normalization during tokenization.
rstrip: false - Indicates that spaces to the right (end) of this token should not be stripped.
single_word: false - This token does not represent a single word.

eos_token:

content: "</s>" - This is the 'end of sequence' token, used to mark the end of a text sequence.
lstrip: false - Spaces to the left of this token should not be stripped.
normalized: false - This token is not normalized.
rstrip: false - Spaces to the right of this token should not be stripped.
single_word: false - It's not considered a single word.

pad_token:

"</s>" - This is the padding token, used to fill in the sequence to a uniform length in batch processing. Interestingly, it's the same as the 'end of sequence' token here, which is an unusual but not unheard-of configuration.

unk_token:

content: "<unk>" - This is the 'unknown' token, used to represent words or characters not found in the model's vocabulary.
lstrip: false - Spaces to the left of this token are not removed.
normalized: false - The token is not normalized.
rstrip: false - Spaces to the right of this token are not removed.
single_word: false - It's not considered a single word.

This configuration is part of the tokenizer setup and dictates how the tokenizer handles these special tokens during the processing of text.

Each special token has a role in helping the model understand and generate text, from marking the start and end of a text sequence to dealing with unknown words and padding sequences for consistent length.

PreviousFull Fine Tune NextPrompt Construction for Fine-Tuning Large Language Models

Last updated 1 year ago

Was this helpful?