LogoLogo
Continuum Knowledge BankContinuum Applications
  • Introduction
  • Creation of Environment
    • Platform Installation
    • Axolotl Dependencies
    • setup.py objectives
      • script analysis
  • Huggingface Hub
  • Download the dataset
    • Types of Dataset Structures
    • Structuring Datasets for Fine-Tuning Large Language Models
    • Downloading Huggingface Datasets
    • Use Git to download dataset
    • Popular Datasets
    • Download cleaned Alpaca dataset
    • Template-free prompt construction
  • Downloading models
    • Phi 2.0 details
    • Downloading Phi 2.0
    • Available Models
  • Configuration for Training
  • Datasets
  • Model Selection - General
  • Phi 2.0
    • Phi 2.0 - Model Configuration
    • Phi 2.0 - Model Quantization
    • Phi 2.0 - Data Loading and Paths
    • Phi 2.0 - Sequence Configuration
    • Phi 2.0 - Lora Configuration
    • Phi 2.0 - Logging
    • Phi 2.0 - Training Configuration
    • Phi 2.0 - Data and Precision
    • Phi 2.0 - Optimisations
    • Phi 2.0 - Extra Hyperparameters
    • Phi 2.0 - All Configurations
    • Phi 2.0 - Preprocessing
    • Phi 2.0 - Training
    • Uploading Models
  • Llama2
    • Llama2 - Model Configuration
    • Llama2 - Model Quantization
    • Llama2 - Data Loading and Paths
    • Llama2 - Sequence Configuration
    • Llama2 - Lora Configuration
    • Llama2 - Logging
    • Llama2 - Training Configuration
    • Llama2 - Data and Precision
    • Llama2 - Optimisations
    • Llama2 - Extra Hyperparameters
    • Llama2- All Configurations
    • Llama2 - Training Configuration
    • Llama2 - Preprocessing
    • Llama2 - Training
  • Llama3
    • Downloading the model
    • Analysis of model files
      • Model Analysis - Configuration Parameters
      • Model Analysis - Safetensors
      • Tokenizer Configuration Files
        • Model Analysis - tokenizer.json
        • Model Analysis - Special Tokens
    • Llama3 - Model Configuration
    • Llama3 - Model Quantization
    • Llama3 - Data Loading and Paths
    • Llama3 - Sequence Configuration
    • Llama3 - Lora Configuration
    • Llama3 - Logging
    • Llama3 - Training Configuration
    • Llama3 - Data and Precision
    • Llama3 - Optimisations
    • Llama3 - Extra Hyperparameters
    • Llama3- All Configurations
    • Llama3 - Preprocessing
    • Llama3 - Training
    • Full Fine Tune
  • Special Tokens
  • Prompt Construction for Fine-Tuning Large Language Models
  • Memory-Efficient Fine-Tuning Techniques for Large Language Models
  • Training Ideas around Hyperparameters
    • Hugging Face documentation on loading PEFT
  • After fine tuning LLama3
  • Merging Model Weights
  • Merge Lora Instructions
  • Axolotl Configuration Files
    • Configuration Options
    • Model Configuration
    • Data Loading and Processing
    • Sequence Configuration
    • Lora Configuration
    • Logging
    • Training Configuration
    • Augmentation Techniques
  • Axolotl Fine-Tuning Tips & Tricks: A Comprehensive Guide
  • Axolotl debugging guide
  • Hugging Face Hub API
  • NCCL
  • Training Phi 1.5 - Youtube
  • JSON (JavaScript Object Notation)
  • General Tips
  • Datasets
Powered by GitBook
LogoLogo

This documentation is for the Axolotl community

On this page
  • Explanation of Special Tokens
  • bos_token:
  • eos_token:
  • pad_token:
  • unk_token:

Was this helpful?

Special Tokens

It is important to have special tokens like delimiters, end-of-sequence, beginning-of-sequence in your tokenizer’s vocabulary.

This will help you avoid tokenization issues and help your model train better.

You can do this in axolotl like this:

special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens: # these are delimiters
  - "<|im_start|>"
  - "<|im_end|>"

When you include these tokens in your axolotl config, axolotl adds these tokens to the tokenizer’s vocabulary.

Explanation of Special Tokens

The JSON structure you provided describes the special tokens used by a tokenizer in a Huggingface model.

These special tokens have specific roles in language models and their processing. Let's break down what each part means:

bos_token:

  • content: "<s>" - This is the 'beginning of sequence' token. It's used to indicate the start of a text sequence.

  • lstrip: false - Indicates that spaces to the left (beginning) of this token should not be stripped.

  • normalized: false - This token is not subject to normalization during tokenization.

  • rstrip: false - Indicates that spaces to the right (end) of this token should not be stripped.

  • single_word: false - This token does not represent a single word.

eos_token:

  • content: "</s>" - This is the 'end of sequence' token, used to mark the end of a text sequence.

  • lstrip: false - Spaces to the left of this token should not be stripped.

  • normalized: false - This token is not normalized.

  • rstrip: false - Spaces to the right of this token should not be stripped.

  • single_word: false - It's not considered a single word.

pad_token:

  • "</s>" - This is the padding token, used to fill in the sequence to a uniform length in batch processing. Interestingly, it's the same as the 'end of sequence' token here, which is an unusual but not unheard-of configuration.

unk_token:

  • content: "<unk>" - This is the 'unknown' token, used to represent words or characters not found in the model's vocabulary.

  • lstrip: false - Spaces to the left of this token are not removed.

  • normalized: false - The token is not normalized.

  • rstrip: false - Spaces to the right of this token are not removed.

  • single_word: false - It's not considered a single word.

This configuration is part of the tokenizer setup and dictates how the tokenizer handles these special tokens during the processing of text.

Each special token has a role in helping the model understand and generate text, from marking the start and end of a text sequence to dealing with unknown words and padding sequences for consistent length.

PreviousFull Fine TuneNextPrompt Construction for Fine-Tuning Large Language Models

Last updated 1 year ago

Was this helpful?

Page cover image