LogoLogo
Continuum Knowledge BankContinuum Applications
  • Introduction
  • Creation of Environment
    • Platform Installation
    • Axolotl Dependencies
    • setup.py objectives
      • script analysis
  • Huggingface Hub
  • Download the dataset
    • Types of Dataset Structures
    • Structuring Datasets for Fine-Tuning Large Language Models
    • Downloading Huggingface Datasets
    • Use Git to download dataset
    • Popular Datasets
    • Download cleaned Alpaca dataset
    • Template-free prompt construction
  • Downloading models
    • Phi 2.0 details
    • Downloading Phi 2.0
    • Available Models
  • Configuration for Training
  • Datasets
  • Model Selection - General
  • Phi 2.0
    • Phi 2.0 - Model Configuration
    • Phi 2.0 - Model Quantization
    • Phi 2.0 - Data Loading and Paths
    • Phi 2.0 - Sequence Configuration
    • Phi 2.0 - Lora Configuration
    • Phi 2.0 - Logging
    • Phi 2.0 - Training Configuration
    • Phi 2.0 - Data and Precision
    • Phi 2.0 - Optimisations
    • Phi 2.0 - Extra Hyperparameters
    • Phi 2.0 - All Configurations
    • Phi 2.0 - Preprocessing
    • Phi 2.0 - Training
    • Uploading Models
  • Llama2
    • Llama2 - Model Configuration
    • Llama2 - Model Quantization
    • Llama2 - Data Loading and Paths
    • Llama2 - Sequence Configuration
    • Llama2 - Lora Configuration
    • Llama2 - Logging
    • Llama2 - Training Configuration
    • Llama2 - Data and Precision
    • Llama2 - Optimisations
    • Llama2 - Extra Hyperparameters
    • Llama2- All Configurations
    • Llama2 - Training Configuration
    • Llama2 - Preprocessing
    • Llama2 - Training
  • Llama3
    • Downloading the model
    • Analysis of model files
      • Model Analysis - Configuration Parameters
      • Model Analysis - Safetensors
      • Tokenizer Configuration Files
        • Model Analysis - tokenizer.json
        • Model Analysis - Special Tokens
    • Llama3 - Model Configuration
    • Llama3 - Model Quantization
    • Llama3 - Data Loading and Paths
    • Llama3 - Sequence Configuration
    • Llama3 - Lora Configuration
    • Llama3 - Logging
    • Llama3 - Training Configuration
    • Llama3 - Data and Precision
    • Llama3 - Optimisations
    • Llama3 - Extra Hyperparameters
    • Llama3- All Configurations
    • Llama3 - Preprocessing
    • Llama3 - Training
    • Full Fine Tune
  • Special Tokens
  • Prompt Construction for Fine-Tuning Large Language Models
  • Memory-Efficient Fine-Tuning Techniques for Large Language Models
  • Training Ideas around Hyperparameters
    • Hugging Face documentation on loading PEFT
  • After fine tuning LLama3
  • Merging Model Weights
  • Merge Lora Instructions
  • Axolotl Configuration Files
    • Configuration Options
    • Model Configuration
    • Data Loading and Processing
    • Sequence Configuration
    • Lora Configuration
    • Logging
    • Training Configuration
    • Augmentation Techniques
  • Axolotl Fine-Tuning Tips & Tricks: A Comprehensive Guide
  • Axolotl debugging guide
  • Hugging Face Hub API
  • NCCL
  • Training Phi 1.5 - Youtube
  • JSON (JavaScript Object Notation)
  • General Tips
  • Datasets
Powered by GitBook
LogoLogo

This documentation is for the Axolotl community

On this page
  • Prompt Templates
  • Some other prompt construction ideas

Was this helpful?

Prompt Construction for Fine-Tuning Large Language Models

Fine-tuning large language models on custom datasets is a powerful technique to adapt foundation models to specific domains and tasks.

A critical aspect of fine-tuning is constructing effective prompts that present the input data to the model in an optimal format for learning.

This guide covers strategies and best practices for prompt construction when fine-tuning models using tools like Axolotl. It assumes a basic familiarity with language model fine-tuning.

Prompt Templates

A common approach is to use prompt templates that provide a standard structure for the model inputs and outputs. Popular prompt templates include:

Alpaca format

{
  "instruction": "...",
  "input": "...", 
  "output": "..."
}

ShareGPT format

{
  "conversations": [
    {
      "from": "human",
      "value": "..."
    },
    {
      "from": "assistant", 
      "value": "..."
    }
  ]
}

Prompt templates offer several benefits:

  • Provide a clear demarcation between inputs and outputs

  • Enforce a consistent structure across examples

  • Allow specifying roles like "human" and "assistant"

However, there are also some drawbacks to templated prompts:

  • Can add unnecessary boilerplate and special tokens

  • May enforce a conversational structure when a direct mapping of inputs to outputs is sufficient

  • Limit flexibility to only the roles and format dictated by the template

Template-Free Prompts with input_output Format

For situations where a prompt template is overly restrictive, Axolotl supports the input_output format for constructing template-free prompts.

With input_output, you provide a list of text segments, each with a boolean label indicating if the segment should be used as input to the model or treated as the target output.

input_output format example

{
  "segments": [
    {
      "text": "<s>Hello\n",
      "label": true
    },
    {
      "text": "Hi there! ",
      "label": true
    },
    {
      "text": "Goodbye ",
      "label": false
    },
    {
      "text": "Farewell</s>",
      "label": true
    }
  ]
}

Configuring dataset for input_output format:

datasets:
  - path: data.jsonl
    type: input_output

train_on_inputs: false   # Ignores the label:false segments

Some other prompt construction ideas

Here are three creative ideas for other formats for fine-tuning a large language model:

Tree-structured format

{
  "root": {
    "text": "Once upon a time...",
    "children": [
      {
        "text": "There was a young girl named Lily.",
        "children": [...]
      },
      {
        "text": "She lived in a small village.",
        "children": [...]
      }
    ]
  }
}

This format could be useful for fine-tuning on hierarchical data like stories, articles, or dialogue trees. The model would learn to generate text that follows a coherent structure.

Linked-segments format

{
  "segments": [
    {
      "id": 1,
      "text": "Paris is the capital of France.",
      "links": [2, 3]
    },
    {
      "id": 2, 
      "text": "It is known for its art, fashion, and cuisine.",
      "links": [4]
    },
    {
      "id": 3,
      "text": "The Eiffel Tower is a famous landmark in Paris.",
      "links": [4]  
    },
    {
      "id": 4,
      "text": "Paris attracts millions of tourists every year."
    }
  ]
}

The linked-segments format allows specifying relationships between different parts of the input.

The model could learn to generate coherent text that follows the linked structure. This could be useful for tasks involving reasoning or inferring relationships between facts.

Multi-field format

{
  "title": "Apple Pie Recipe",
  "ingredients": [
    "3 cups all-purpose flour",
    "1 teaspoon salt",
    "1 cup unsalted butter", 
    "2/3 cup ice water",
    "8 cups sliced apples",
    "2 tablespoons lemon juice",
    "3/4 cup white sugar",
    "1/2 teaspoon ground cinnamon"
  ],
  "instructions": [
    "In a large bowl, mix flour and salt...",
    "Stir in butter until mixture is crumbly...",
    "Knead the dough, adding ice water as needed...",
    "Preheat the oven to 375°F...",
    "Mix apples with lemon juice, sugar and cinnamon...",
    "Place filling in the pie crust...", 
    "Cover with top crust, seal edges, and cut slits to vent...",
    "Bake for 45 minutes until crust is golden brown..."
  ]
}

This multi-field format separates different aspects of the input into designated fields. T

he model learns to interpret the fields and generate text based on their roles (title, ingredients, instructions). This could enable fine-tuning for domain-specific applications like generating recipes, product descriptions, or other structured content.

The key aspects are:

  • Each segment is a snippet of raw text

  • The label:true segments are concatenated to form the model input

  • The label:false segments are ignored when training (using a standard technique of setting the label to a special ignore token)

  • You are responsible for including any special tokens, whitespace, etc. The segments are simply concatenated as-is.

Validating Prompts

When using template-free prompts, it's important to validate that the prompts are being constructed as intended.

Some tips:

  • Use the --debug flag with the preprocessing command to print out the tokenized prompts with labels

  • Load the tokenized dataset and spot check decoded examples

  • Verify that the correct segments have a label of -100 indicating they will be ignored

Best Practices

Some general best practices for prompt construction:

  • Keep prompts concise; avoid extraneous text that may distract the model

  • Put key information like instructions or questions towards the beginning

  • For most cases, avoid multi-turn conversation unless essential for the task

  • Use a clear separator like a newline between input and output

  • Include end-of-sequence (</s>) token to help model recognize completion

  • Aim for consistency across examples

  • Experiment with different formulations and validate what works best

Conclusion

Effective prompt construction is essential for fine-tuning performance. Template-free prompts using the input_output format in Axolotl provide flexibility to optimize prompts for your specific task. Validate prompts carefully, aim for consistency and clarity, and iterate to find the optimal approach.

PreviousSpecial TokensNextMemory-Efficient Fine-Tuning Techniques for Large Language Models

Last updated 1 year ago

Was this helpful?

Page cover image