LogoLogo
Continuum Knowledge BankContinuum Applications
  • Introduction
  • Creation of Environment
    • Platform Installation
    • Axolotl Dependencies
    • setup.py objectives
      • script analysis
  • Huggingface Hub
  • Download the dataset
    • Types of Dataset Structures
    • Structuring Datasets for Fine-Tuning Large Language Models
    • Downloading Huggingface Datasets
    • Use Git to download dataset
    • Popular Datasets
    • Download cleaned Alpaca dataset
    • Template-free prompt construction
  • Downloading models
    • Phi 2.0 details
    • Downloading Phi 2.0
    • Available Models
  • Configuration for Training
  • Datasets
  • Model Selection - General
  • Phi 2.0
    • Phi 2.0 - Model Configuration
    • Phi 2.0 - Model Quantization
    • Phi 2.0 - Data Loading and Paths
    • Phi 2.0 - Sequence Configuration
    • Phi 2.0 - Lora Configuration
    • Phi 2.0 - Logging
    • Phi 2.0 - Training Configuration
    • Phi 2.0 - Data and Precision
    • Phi 2.0 - Optimisations
    • Phi 2.0 - Extra Hyperparameters
    • Phi 2.0 - All Configurations
    • Phi 2.0 - Preprocessing
    • Phi 2.0 - Training
    • Uploading Models
  • Llama2
    • Llama2 - Model Configuration
    • Llama2 - Model Quantization
    • Llama2 - Data Loading and Paths
    • Llama2 - Sequence Configuration
    • Llama2 - Lora Configuration
    • Llama2 - Logging
    • Llama2 - Training Configuration
    • Llama2 - Data and Precision
    • Llama2 - Optimisations
    • Llama2 - Extra Hyperparameters
    • Llama2- All Configurations
    • Llama2 - Training Configuration
    • Llama2 - Preprocessing
    • Llama2 - Training
  • Llama3
    • Downloading the model
    • Analysis of model files
      • Model Analysis - Configuration Parameters
      • Model Analysis - Safetensors
      • Tokenizer Configuration Files
        • Model Analysis - tokenizer.json
        • Model Analysis - Special Tokens
    • Llama3 - Model Configuration
    • Llama3 - Model Quantization
    • Llama3 - Data Loading and Paths
    • Llama3 - Sequence Configuration
    • Llama3 - Lora Configuration
    • Llama3 - Logging
    • Llama3 - Training Configuration
    • Llama3 - Data and Precision
    • Llama3 - Optimisations
    • Llama3 - Extra Hyperparameters
    • Llama3- All Configurations
    • Llama3 - Preprocessing
    • Llama3 - Training
    • Full Fine Tune
  • Special Tokens
  • Prompt Construction for Fine-Tuning Large Language Models
  • Memory-Efficient Fine-Tuning Techniques for Large Language Models
  • Training Ideas around Hyperparameters
    • Hugging Face documentation on loading PEFT
  • After fine tuning LLama3
  • Merging Model Weights
  • Merge Lora Instructions
  • Axolotl Configuration Files
    • Configuration Options
    • Model Configuration
    • Data Loading and Processing
    • Sequence Configuration
    • Lora Configuration
    • Logging
    • Training Configuration
    • Augmentation Techniques
  • Axolotl Fine-Tuning Tips & Tricks: A Comprehensive Guide
  • Axolotl debugging guide
  • Hugging Face Hub API
  • NCCL
  • Training Phi 1.5 - Youtube
  • JSON (JavaScript Object Notation)
  • General Tips
  • Datasets
Powered by GitBook
LogoLogo

This documentation is for the Axolotl community

On this page
  • Axolotl Dataset Formats and Customisation
  • Alpaca Format
  • ShareGPT Format
  • Completion Format
  • Adding Custom Prompts
  • Custom Pre-tokenized Dataset Usage
  • Interesting Points Regarding Datasets

Was this helpful?

  1. Download the dataset

Types of Dataset Structures

Formats and Customizations

Axolotl Dataset Formats and Customisation

Axolotl is versatile in handling various dataset formats.

Below are some of the formats you can use, with JSONL being the recommended format:

Alpaca Format

  • Structure: {"instruction": "your_instruction", "input": "optional_input", "output": "expected_output"}

  • Ideal for scenarios where you need to provide specific instructions along with optional input data. The output field holds the expected result. This format is particularly useful for guided learning tasks.

ShareGPT Format

  • Structure: {"conversations": [{"from": "human/gpt", "value": "dialogue_text"}]}

  • This format suits conversational models where interactions are between a human and a GPT-like model. It helps in training models to understand and respond in a dialogue setting, reflecting real-world conversational flows.

Completion Format

  • Structure: {"text": "your_text_data"}

  • The completion format is straightforward and best for training models on raw text corpora. It's ideal for scenarios where the model needs to learn from unstructured text without specific instructions or dialogue contexts.

Adding Custom Prompts

For datasets preprocessed with instruction-focused tasks:

  • Structure: {"instruction": "your_instruction", "output": "expected_output"}

  • This format supports a direct instructional approach, where the model is trained to follow specific commands or requests. It's effective for task-oriented models.

Incorporating this into your Axolotl YAML configuration

datasets:
  - path: repo
    type:
      system_prompt: ""
      field_system: system
      format: "[INST] {instruction} [/INST]"
      no_input_format: "[INST] {instruction} [/INST]"

This YAML config allows for a flexible setup, enabling the model to interpret and learn from the structured instructional format.

Custom Pre-tokenized Dataset Usage

To use a custom pre-tokenized dataset:

  • Do not specify a type in your configuration.

  • Ensure your dataset columns are precisely named as input_ids, attention_mask, and labels.

This approach is beneficial when you have a dataset that is already tokenized and ready for model consumption.

It skips additional preprocessing steps, streamlining the training process for efficiency.

Interesting Points Regarding Datasets

  • Format Flexibility: Axolotl’s support for multiple formats allows for training models on diverse data types - from structured instructional data to informal conversational dialogues.

  • Customisability: The ability to customise datasets and their integration into the system via YAML configurations provides a high degree of control over the training process, allowing for fine-tuning specific to the desired output of the model.

  • Efficiency in Pre-tokenized Data: The support for pre-tokenized datasets is a significant time-saver, particularly in scenarios where datasets are vast and tokenization can become a computationally expensive step.

This variety and customisability make Axolotl a robust tool for training language models across different scenarios and requirements, enhancing its versatility in AI model development.

PreviousDownload the datasetNextStructuring Datasets for Fine-Tuning Large Language Models

Last updated 1 year ago

Was this helpful?

Page cover image