LogoLogo
Continuum Knowledge BankContinuum Applications
  • Introduction
  • Creation of Environment
    • Platform Installation
    • Axolotl Dependencies
    • setup.py objectives
      • script analysis
  • Huggingface Hub
  • Download the dataset
    • Types of Dataset Structures
    • Structuring Datasets for Fine-Tuning Large Language Models
    • Downloading Huggingface Datasets
    • Use Git to download dataset
    • Popular Datasets
    • Download cleaned Alpaca dataset
    • Template-free prompt construction
  • Downloading models
    • Phi 2.0 details
    • Downloading Phi 2.0
    • Available Models
  • Configuration for Training
  • Datasets
  • Model Selection - General
  • Phi 2.0
    • Phi 2.0 - Model Configuration
    • Phi 2.0 - Model Quantization
    • Phi 2.0 - Data Loading and Paths
    • Phi 2.0 - Sequence Configuration
    • Phi 2.0 - Lora Configuration
    • Phi 2.0 - Logging
    • Phi 2.0 - Training Configuration
    • Phi 2.0 - Data and Precision
    • Phi 2.0 - Optimisations
    • Phi 2.0 - Extra Hyperparameters
    • Phi 2.0 - All Configurations
    • Phi 2.0 - Preprocessing
    • Phi 2.0 - Training
    • Uploading Models
  • Llama2
    • Llama2 - Model Configuration
    • Llama2 - Model Quantization
    • Llama2 - Data Loading and Paths
    • Llama2 - Sequence Configuration
    • Llama2 - Lora Configuration
    • Llama2 - Logging
    • Llama2 - Training Configuration
    • Llama2 - Data and Precision
    • Llama2 - Optimisations
    • Llama2 - Extra Hyperparameters
    • Llama2- All Configurations
    • Llama2 - Training Configuration
    • Llama2 - Preprocessing
    • Llama2 - Training
  • Llama3
    • Downloading the model
    • Analysis of model files
      • Model Analysis - Configuration Parameters
      • Model Analysis - Safetensors
      • Tokenizer Configuration Files
        • Model Analysis - tokenizer.json
        • Model Analysis - Special Tokens
    • Llama3 - Model Configuration
    • Llama3 - Model Quantization
    • Llama3 - Data Loading and Paths
    • Llama3 - Sequence Configuration
    • Llama3 - Lora Configuration
    • Llama3 - Logging
    • Llama3 - Training Configuration
    • Llama3 - Data and Precision
    • Llama3 - Optimisations
    • Llama3 - Extra Hyperparameters
    • Llama3- All Configurations
    • Llama3 - Preprocessing
    • Llama3 - Training
    • Full Fine Tune
  • Special Tokens
  • Prompt Construction for Fine-Tuning Large Language Models
  • Memory-Efficient Fine-Tuning Techniques for Large Language Models
  • Training Ideas around Hyperparameters
    • Hugging Face documentation on loading PEFT
  • After fine tuning LLama3
  • Merging Model Weights
  • Merge Lora Instructions
  • Axolotl Configuration Files
    • Configuration Options
    • Model Configuration
    • Data Loading and Processing
    • Sequence Configuration
    • Lora Configuration
    • Logging
    • Training Configuration
    • Augmentation Techniques
  • Axolotl Fine-Tuning Tips & Tricks: A Comprehensive Guide
  • Axolotl debugging guide
  • Hugging Face Hub API
  • NCCL
  • Training Phi 1.5 - Youtube
  • JSON (JavaScript Object Notation)
  • General Tips
  • Datasets
Powered by GitBook
LogoLogo

This documentation is for the Axolotl community

On this page
  • tokenizer_config.json
  • tokenizer.json
  • Interaction between the two files

Was this helpful?

  1. Llama3
  2. Analysis of model files

Tokenizer Configuration Files

The tokenizer_config.json and tokenizer.json files serve different purposes in the tokenization process of the Llama3 language model.

Let's clarify the difference between the two and how they interact:

tokenizer_config.json

  • This file contains the configuration settings for the tokenizer

  • It defines the behaviour and properties of the tokenizer, such as the special tokens, maximum sequence length, and input tensor names.

  • The tokenizer_config.json file specifies how the tokenizer should handle and interpret the input text during the tokenization process.

  • It includes settings like the beginning-of-sequence (BOS) token, end-of-sequence (EOS) token, and whether to clean up extra spaces during tokenization.

  • The tokenizer_config.json file also defines the mapping between special token IDs and their corresponding token content in the "added_tokens_decoder" section.

tokenizer.json

  • This file contains the actual vocabulary and mappings used by the tokenizer to convert input text into token IDs.

  • It defines the mapping between each word, subword, or character in the vocabulary and its corresponding unique token ID.

  • The tokenizer.json file is used during the tokenization process to look up the token IDs for each word or subword in the input text.

  • It is a crucial component of the tokenizer and is loaded by the tokenizer implementation to perform the actual tokenization.

Interaction between the two files

  • The tokenizer_config.json file provides the configuration settings for the tokenizer, specifying how it should behave and handle special tokens.

  • The tokenizer.json file contains the actual vocabulary and mappings used by the tokenizer to convert input text into token IDs.

  • During the tokenization process, the tokenizer implementation loads both files:

    • It uses the tokenizer_config.json file to configure its behavior and special token handling.

    • It uses the tokenizer.json file to look up the token IDs for each word or subword in the input text.

  • The tokenizer applies the configuration settings from tokenizer_config.json while utilizing the vocabulary and mappings from tokenizer.json to perform the tokenization.

PreviousModel Analysis - SafetensorsNextModel Analysis - tokenizer.json

Last updated 1 year ago

Was this helpful?