LogoLogo
Continuum Knowledge BankContinuum Applications
  • Introduction
  • Creation of Environment
    • Platform Installation
    • Axolotl Dependencies
    • setup.py objectives
      • script analysis
  • Huggingface Hub
  • Download the dataset
    • Types of Dataset Structures
    • Structuring Datasets for Fine-Tuning Large Language Models
    • Downloading Huggingface Datasets
    • Use Git to download dataset
    • Popular Datasets
    • Download cleaned Alpaca dataset
    • Template-free prompt construction
  • Downloading models
    • Phi 2.0 details
    • Downloading Phi 2.0
    • Available Models
  • Configuration for Training
  • Datasets
  • Model Selection - General
  • Phi 2.0
    • Phi 2.0 - Model Configuration
    • Phi 2.0 - Model Quantization
    • Phi 2.0 - Data Loading and Paths
    • Phi 2.0 - Sequence Configuration
    • Phi 2.0 - Lora Configuration
    • Phi 2.0 - Logging
    • Phi 2.0 - Training Configuration
    • Phi 2.0 - Data and Precision
    • Phi 2.0 - Optimisations
    • Phi 2.0 - Extra Hyperparameters
    • Phi 2.0 - All Configurations
    • Phi 2.0 - Preprocessing
    • Phi 2.0 - Training
    • Uploading Models
  • Llama2
    • Llama2 - Model Configuration
    • Llama2 - Model Quantization
    • Llama2 - Data Loading and Paths
    • Llama2 - Sequence Configuration
    • Llama2 - Lora Configuration
    • Llama2 - Logging
    • Llama2 - Training Configuration
    • Llama2 - Data and Precision
    • Llama2 - Optimisations
    • Llama2 - Extra Hyperparameters
    • Llama2- All Configurations
    • Llama2 - Training Configuration
    • Llama2 - Preprocessing
    • Llama2 - Training
  • Llama3
    • Downloading the model
    • Analysis of model files
      • Model Analysis - Configuration Parameters
      • Model Analysis - Safetensors
      • Tokenizer Configuration Files
        • Model Analysis - tokenizer.json
        • Model Analysis - Special Tokens
    • Llama3 - Model Configuration
    • Llama3 - Model Quantization
    • Llama3 - Data Loading and Paths
    • Llama3 - Sequence Configuration
    • Llama3 - Lora Configuration
    • Llama3 - Logging
    • Llama3 - Training Configuration
    • Llama3 - Data and Precision
    • Llama3 - Optimisations
    • Llama3 - Extra Hyperparameters
    • Llama3- All Configurations
    • Llama3 - Preprocessing
    • Llama3 - Training
    • Full Fine Tune
  • Special Tokens
  • Prompt Construction for Fine-Tuning Large Language Models
  • Memory-Efficient Fine-Tuning Techniques for Large Language Models
  • Training Ideas around Hyperparameters
    • Hugging Face documentation on loading PEFT
  • After fine tuning LLama3
  • Merging Model Weights
  • Merge Lora Instructions
  • Axolotl Configuration Files
    • Configuration Options
    • Model Configuration
    • Data Loading and Processing
    • Sequence Configuration
    • Lora Configuration
    • Logging
    • Training Configuration
    • Augmentation Techniques
  • Axolotl Fine-Tuning Tips & Tricks: A Comprehensive Guide
  • Axolotl debugging guide
  • Hugging Face Hub API
  • NCCL
  • Training Phi 1.5 - Youtube
  • JSON (JavaScript Object Notation)
  • General Tips
  • Datasets
Powered by GitBook
LogoLogo

This documentation is for the Axolotl community

On this page
  • What is Alpaca format?
  • Fine-Tuning Prompts for Alpaca Model

Was this helpful?

  1. Download the dataset

Download cleaned Alpaca dataset

Test the engines

PreviousPopular DatasetsNextTemplate-free prompt construction

Last updated 1 year ago

Was this helpful?

The last instruction entered was to git clone the alpaca-cleaned dataset to the local directory:

git clone https://huggingface.co/datasets/yahma/alpaca-cleaned

This command downloaded this Huggingface 42MB json dataset into the directory you created called datasets.

Within datasets, this directory is located at alpaca-cleaned. The full path is:

your primary directory/axolotl/datasets/alpaca-cleaned

The screenshot below shows the contents of the alpaca-cleaned dataset. Note that it is in JSON format and that the training set is in Alpaca format:

What is Alpaca format?

When using instruction fine tuning. there are various formats for the training set. The Alpaca format has become one of the 'standards' for the structure of a dataset

Data Structure in alpaca_data.json

This dataset is formatted as a JSON file, where each entry is represented as a dictionary with the following key-value pairs:

Instruction (instruction):

  • Type: String (str)

  • Description: Specifies the task to be performed by the model.

Input (input):

  • Type: String (str) optional.

  • Description: Provides additional context or information needed to perform the task described in the instruction.

  • Example: If the instruction is "Summarize the following article", the input would be the text of the article.

Prevalence: In the original 52k Alpaca dataset, approximately 40% of the entries in the dataset include an input field.

Output (output):

  • Type: String (str)

  • Description: The response generated by the text-davinci-003 model, which represents the answer or completion of the task defined in the instruction.

Fine-Tuning Prompts for Alpaca Model

Two distinct prompt structures were used in the fine-tuning process, depending on whether the input field is present or not.

For Entries with Non-Empty Input Field:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:

For Entries with Empty Input Field:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:

For a full review of the different types of dataset techniques and structures used in Axolotl please visit datasets.

A screenshot from VS Code demonstrating the contents of the alpaca-cleaned dataste
Page cover image