LogoLogo
Continuum Knowledge BankContinuum Applications
  • Introduction
  • Creation of Environment
    • Platform Installation
    • Axolotl Dependencies
    • setup.py objectives
      • script analysis
  • Huggingface Hub
  • Download the dataset
    • Types of Dataset Structures
    • Structuring Datasets for Fine-Tuning Large Language Models
    • Downloading Huggingface Datasets
    • Use Git to download dataset
    • Popular Datasets
    • Download cleaned Alpaca dataset
    • Template-free prompt construction
  • Downloading models
    • Phi 2.0 details
    • Downloading Phi 2.0
    • Available Models
  • Configuration for Training
  • Datasets
  • Model Selection - General
  • Phi 2.0
    • Phi 2.0 - Model Configuration
    • Phi 2.0 - Model Quantization
    • Phi 2.0 - Data Loading and Paths
    • Phi 2.0 - Sequence Configuration
    • Phi 2.0 - Lora Configuration
    • Phi 2.0 - Logging
    • Phi 2.0 - Training Configuration
    • Phi 2.0 - Data and Precision
    • Phi 2.0 - Optimisations
    • Phi 2.0 - Extra Hyperparameters
    • Phi 2.0 - All Configurations
    • Phi 2.0 - Preprocessing
    • Phi 2.0 - Training
    • Uploading Models
  • Llama2
    • Llama2 - Model Configuration
    • Llama2 - Model Quantization
    • Llama2 - Data Loading and Paths
    • Llama2 - Sequence Configuration
    • Llama2 - Lora Configuration
    • Llama2 - Logging
    • Llama2 - Training Configuration
    • Llama2 - Data and Precision
    • Llama2 - Optimisations
    • Llama2 - Extra Hyperparameters
    • Llama2- All Configurations
    • Llama2 - Training Configuration
    • Llama2 - Preprocessing
    • Llama2 - Training
  • Llama3
    • Downloading the model
    • Analysis of model files
      • Model Analysis - Configuration Parameters
      • Model Analysis - Safetensors
      • Tokenizer Configuration Files
        • Model Analysis - tokenizer.json
        • Model Analysis - Special Tokens
    • Llama3 - Model Configuration
    • Llama3 - Model Quantization
    • Llama3 - Data Loading and Paths
    • Llama3 - Sequence Configuration
    • Llama3 - Lora Configuration
    • Llama3 - Logging
    • Llama3 - Training Configuration
    • Llama3 - Data and Precision
    • Llama3 - Optimisations
    • Llama3 - Extra Hyperparameters
    • Llama3- All Configurations
    • Llama3 - Preprocessing
    • Llama3 - Training
    • Full Fine Tune
  • Special Tokens
  • Prompt Construction for Fine-Tuning Large Language Models
  • Memory-Efficient Fine-Tuning Techniques for Large Language Models
  • Training Ideas around Hyperparameters
    • Hugging Face documentation on loading PEFT
  • After fine tuning LLama3
  • Merging Model Weights
  • Merge Lora Instructions
  • Axolotl Configuration Files
    • Configuration Options
    • Model Configuration
    • Data Loading and Processing
    • Sequence Configuration
    • Lora Configuration
    • Logging
    • Training Configuration
    • Augmentation Techniques
  • Axolotl Fine-Tuning Tips & Tricks: A Comprehensive Guide
  • Axolotl debugging guide
  • Hugging Face Hub API
  • NCCL
  • Training Phi 1.5 - Youtube
  • JSON (JavaScript Object Notation)
  • General Tips
  • Datasets
Powered by GitBook
LogoLogo

This documentation is for the Axolotl community

On this page
  • Dataset preprocessing in Axolotl involves several steps
  • The benefits of preprocessing datasets include
  • Edge cases to consider

Was this helpful?

Datasets

Dataset preprocessing in Axolotl involves several steps

  1. Parsing the dataset based on the specified dataset format.

  2. Transforming the dataset according to the chosen prompt strategy.

  3. Tokenizing the dataset using the configured model and tokenizer.

  4. Shuffling and merging multiple datasets together if more than one is used.

There are two ways to perform dataset preprocessing in Axolotl:

Before starting the training process by running the command:

python -m axolotl.cli.preprocess /path/to/your.yaml --debug

This approach allows you to preprocess the datasets separately from the training process.

During the training process itself. In this case, the preprocessing happens automatically when you start the training.

The benefits of preprocessing datasets include

  • Faster training iterations: When training interactively or performing sweeps (restarting the trainer frequently), preprocessing the datasets beforehand can save time and avoid the frustration of waiting for preprocessing to complete each time.

  • Caching: Axolotl caches the tokenized/formatted datasets based on a hash of dependent training parameters. This means that if the same preprocessing configuration is used, Axolotl can intelligently retrieve the preprocessed data from the cache, saving time and resources.

The path of the cache is controlled by the dataset_prepared_path: parameter in the configuration YAML file.

If left empty, the processed dataset will be cached in the default path ./last_run_prepared/ during training, but any existing cached data there will be ignored.

By explicitly setting dataset_prepared_path: ./last_run_prepared, the trainer will use the pre-processed data from the cache.

Edge cases to consider

  • Custom prompt strategies or user-defined prompt templates:

  • If you are writing a custom prompt strategy or using a user-defined prompt template, the trainer may not be able to detect changes in the prompt templating logic automatically. In such cases, if you have dataset_prepared_path: ... set, the trainer may not pick up the changes you made and will continue using the old prompt.

Here's an example of how you can implement dataset preprocessing in Axolotl:

Define your dataset configuration in the YAML file:

codedatasets:
  - path: /path/to/dataset1.json
    type: json
  - path: /path/to/dataset2.csv
    type: csv

dataset_prepared_path: ./preprocessed_data

Run the preprocessing command

python -m axolotl.cli.preprocess /path/to/your.yaml --debug

This command will preprocess the datasets specified in the YAML file and cache the preprocessed data in the ./preprocessed_data directory.

Start the training process

python -m axolotl.cli.train /path/to/your.yaml

The trainer will automatically use the preprocessed data from the cache, saving time and resources during training.

Remember to be cautious when making changes to custom prompt strategies or user-defined prompt templates. If you have dataset_prepared_path: ... set, make sure to either update the cache by running the preprocessing command again or remove the dataset_prepared_path: parameter to force the trainer to preprocess the data with the updated prompt logic.

By understanding and properly implementing dataset preprocessing in Axolotl, you can optimize your training workflow, save time, and ensure that your datasets are properly prepared for training your machine learning models.

PreviousConfiguration for TrainingNextModel Selection - General

Last updated 1 year ago

Was this helpful?

Page cover image