Prompt Construction for Fine-Tuning Large Language Models

Fine-tuning large language models on custom datasets is a powerful technique to adapt foundation models to specific domains and tasks.

A critical aspect of fine-tuning is constructing effective prompts that present the input data to the model in an optimal format for learning.

This guide covers strategies and best practices for prompt construction when fine-tuning models using tools like Axolotl. It assumes a basic familiarity with language model fine-tuning.

Prompt Templates

A common approach is to use prompt templates that provide a standard structure for the model inputs and outputs. Popular prompt templates include:

Alpaca format

{
  "instruction": "...",
  "input": "...", 
  "output": "..."
}

ShareGPT format

{
  "conversations": [
    {
      "from": "human",
      "value": "..."
    },
    {
      "from": "assistant", 
      "value": "..."
    }
  ]
}

Prompt templates offer several benefits:

Provide a clear demarcation between inputs and outputs
Enforce a consistent structure across examples
Allow specifying roles like "human" and "assistant"

However, there are also some drawbacks to templated prompts:

Can add unnecessary boilerplate and special tokens
May enforce a conversational structure when a direct mapping of inputs to outputs is sufficient
Limit flexibility to only the roles and format dictated by the template

Template-Free Prompts with input_output Format

For situations where a prompt template is overly restrictive, Axolotl supports the input_output format for constructing template-free prompts.

With input_output, you provide a list of text segments, each with a boolean label indicating if the segment should be used as input to the model or treated as the target output.

input_output format example

{
  "segments": [
    {
      "text": "<s>Hello\n",
      "label": true
    },
    {
      "text": "Hi there! ",
      "label": true
    },
    {
      "text": "Goodbye ",
      "label": false
    },
    {
      "text": "Farewell</s>",
      "label": true
    }
  ]
}

Configuring dataset for input_output format:

datasets:
  - path: data.jsonl
    type: input_output

train_on_inputs: false   # Ignores the label:false segments

Some other prompt construction ideas

Here are three creative ideas for other formats for fine-tuning a large language model:

Tree-structured format

{
  "root": {
    "text": "Once upon a time...",
    "children": [
      {
        "text": "There was a young girl named Lily.",
        "children": [...]
      },
      {
        "text": "She lived in a small village.",
        "children": [...]
      }
    ]
  }
}

This format could be useful for fine-tuning on hierarchical data like stories, articles, or dialogue trees. The model would learn to generate text that follows a coherent structure.

Linked-segments format

{
  "segments": [
    {
      "id": 1,
      "text": "Paris is the capital of France.",
      "links": [2, 3]
    },
    {
      "id": 2, 
      "text": "It is known for its art, fashion, and cuisine.",
      "links": [4]
    },
    {
      "id": 3,
      "text": "The Eiffel Tower is a famous landmark in Paris.",
      "links": [4]  
    },
    {
      "id": 4,
      "text": "Paris attracts millions of tourists every year."
    }
  ]
}

The linked-segments format allows specifying relationships between different parts of the input.

The model could learn to generate coherent text that follows the linked structure. This could be useful for tasks involving reasoning or inferring relationships between facts.

Multi-field format

{
  "title": "Apple Pie Recipe",
  "ingredients": [
    "3 cups all-purpose flour",
    "1 teaspoon salt",
    "1 cup unsalted butter", 
    "2/3 cup ice water",
    "8 cups sliced apples",
    "2 tablespoons lemon juice",
    "3/4 cup white sugar",
    "1/2 teaspoon ground cinnamon"
  ],
  "instructions": [
    "In a large bowl, mix flour and salt...",
    "Stir in butter until mixture is crumbly...",
    "Knead the dough, adding ice water as needed...",
    "Preheat the oven to 375°F...",
    "Mix apples with lemon juice, sugar and cinnamon...",
    "Place filling in the pie crust...", 
    "Cover with top crust, seal edges, and cut slits to vent...",
    "Bake for 45 minutes until crust is golden brown..."
  ]
}

This multi-field format separates different aspects of the input into designated fields. T

he model learns to interpret the fields and generate text based on their roles (title, ingredients, instructions). This could enable fine-tuning for domain-specific applications like generating recipes, product descriptions, or other structured content.

The key aspects are:

Each segment is a snippet of raw text
The label:true segments are concatenated to form the model input
The label:false segments are ignored when training (using a standard technique of setting the label to a special ignore token)
You are responsible for including any special tokens, whitespace, etc. The segments are simply concatenated as-is.

Validating Prompts

When using template-free prompts, it's important to validate that the prompts are being constructed as intended.

Some tips:

Use the --debug flag with the preprocessing command to print out the tokenized prompts with labels
Load the tokenized dataset and spot check decoded examples
Verify that the correct segments have a label of -100 indicating they will be ignored

Best Practices

Some general best practices for prompt construction:

Keep prompts concise; avoid extraneous text that may distract the model
Put key information like instructions or questions towards the beginning
For most cases, avoid multi-turn conversation unless essential for the task
Use a clear separator like a newline between input and output
Include end-of-sequence (</s>) token to help model recognize completion
Aim for consistency across examples
Experiment with different formulations and validate what works best

Conclusion

Effective prompt construction is essential for fine-tuning performance. Template-free prompts using the input_output format in Axolotl provide flexibility to optimize prompts for your specific task. Validate prompts carefully, aim for consistency and clarity, and iterate to find the optimal approach.

PreviousSpecial Tokens NextMemory-Efficient Fine-Tuning Techniques for Large Language Models

Last updated 1 year ago

Was this helpful?