# Prompt Construction for Fine-Tuning Large Language Models

Fine-tuning large language models on custom datasets is a powerful technique to adapt foundation models to specific domains and tasks.

A critical aspect of fine-tuning is constructing effective prompts that present the input data to the model in an optimal format for learning.

This guide covers strategies and best practices for prompt construction when fine-tuning models using tools like Axolotl. It assumes a basic familiarity with language model fine-tuning.

### <mark style="color:blue;">Prompt Templates</mark>&#x20;

A common approach is to use prompt templates that provide a standard structure for the model inputs and outputs. Popular prompt templates include:

#### <mark style="color:green;">Alpaca format</mark>

```json
{
  "instruction": "...",
  "input": "...", 
  "output": "..."
}
```

#### <mark style="color:green;">ShareGPT format</mark>

```json
{
  "conversations": [
    {
      "from": "human",
      "value": "..."
    },
    {
      "from": "assistant", 
      "value": "..."
    }
  ]
}
```

Prompt templates offer several benefits:

* Provide a clear demarcation between inputs and outputs
* Enforce a consistent structure across examples
* Allow specifying roles like "human" and "assistant"

However, there are also some <mark style="color:yellow;">drawbacks to templated prompts</mark>:

* Can add unnecessary boilerplate and special tokens
* May enforce a conversational structure when a direct mapping of inputs to outputs is sufficient
* Limit flexibility to only the roles and format dictated by the template

#### <mark style="color:green;">Template-Free Prompts with input\_output Format</mark>

For situations where a prompt template is overly restrictive, <mark style="color:yellow;">Axolotl supports the input\_output format</mark> for constructing template-free prompts.

With input\_output, you provide a list of text segments, each with a boolean label indicating if the segment should be used as input to the model or treated as the target output.

<mark style="color:green;">input\_output format example</mark>

```json
{
  "segments": [
    {
      "text": "<s>Hello\n",
      "label": true
    },
    {
      "text": "Hi there! ",
      "label": true
    },
    {
      "text": "Goodbye ",
      "label": false
    },
    {
      "text": "Farewell</s>",
      "label": true
    }
  ]
}
```

Configuring dataset for input\_output format:

```yaml
datasets:
  - path: data.jsonl
    type: input_output

train_on_inputs: false   # Ignores the label:false segments
```

### <mark style="color:blue;">Some other prompt construction ideas</mark>

Here are three creative ideas for other formats for fine-tuning a large language model:

<mark style="color:green;">Tree-structured format</mark>

```json
{
  "root": {
    "text": "Once upon a time...",
    "children": [
      {
        "text": "There was a young girl named Lily.",
        "children": [...]
      },
      {
        "text": "She lived in a small village.",
        "children": [...]
      }
    ]
  }
}
```

This format could be useful for fine-tuning on hierarchical data like stories, articles, or dialogue trees. The model would learn to generate text that follows a coherent structure.

<mark style="color:green;">Linked-segments format</mark>

```json
{
  "segments": [
    {
      "id": 1,
      "text": "Paris is the capital of France.",
      "links": [2, 3]
    },
    {
      "id": 2, 
      "text": "It is known for its art, fashion, and cuisine.",
      "links": [4]
    },
    {
      "id": 3,
      "text": "The Eiffel Tower is a famous landmark in Paris.",
      "links": [4]  
    },
    {
      "id": 4,
      "text": "Paris attracts millions of tourists every year."
    }
  ]
}
```

The linked-segments format allows specifying relationships between different parts of the input.&#x20;

The model could learn to generate coherent text that follows the linked structure. This could be useful for tasks involving reasoning or inferring relationships between facts.

<mark style="color:green;">Multi-field format</mark>

```json
{
  "title": "Apple Pie Recipe",
  "ingredients": [
    "3 cups all-purpose flour",
    "1 teaspoon salt",
    "1 cup unsalted butter", 
    "2/3 cup ice water",
    "8 cups sliced apples",
    "2 tablespoons lemon juice",
    "3/4 cup white sugar",
    "1/2 teaspoon ground cinnamon"
  ],
  "instructions": [
    "In a large bowl, mix flour and salt...",
    "Stir in butter until mixture is crumbly...",
    "Knead the dough, adding ice water as needed...",
    "Preheat the oven to 375°F...",
    "Mix apples with lemon juice, sugar and cinnamon...",
    "Place filling in the pie crust...", 
    "Cover with top crust, seal edges, and cut slits to vent...",
    "Bake for 45 minutes until crust is golden brown..."
  ]
}
```

This multi-field format separates different aspects of the input into designated fields. T

he model learns to interpret the fields and generate text based on their roles (title, ingredients, instructions). This could enable fine-tuning for domain-specific applications like generating recipes, product descriptions, or other structured content.

The key aspects are:

* Each segment is a snippet of raw text
* The label:true segments are concatenated to form the model input
* The label:false segments are ignored when training (using a standard technique of setting the label to a special ignore token)
* You are responsible for including any special tokens, whitespace, etc. The segments are simply concatenated as-is.

#### <mark style="color:green;">Validating Prompts</mark>

When using template-free prompts, it's important to validate that the prompts are being constructed as intended.&#x20;

Some tips:

* Use the --debug flag with the preprocessing command to print out the tokenized prompts with labels
* Load the tokenized dataset and spot check decoded examples
* Verify that the correct segments have a label of -100 indicating they will be ignored

#### <mark style="color:green;">Best Practices</mark>&#x20;

Some general best practices for prompt construction:

* Keep prompts concise; avoid extraneous text that may distract the model
* Put key information like instructions or questions towards the beginning
* For most cases, avoid multi-turn conversation unless essential for the task
* Use a clear separator like a newline between input and output
* Include end-of-sequence (\</s>) token to help model recognize completion
* Aim for consistency across examples
* Experiment with different formulations and validate what works best

#### <mark style="color:green;">Conclusion</mark>

Effective prompt construction is essential for fine-tuning performance. Template-free prompts using the input\_output format in Axolotl provide flexibility to optimize prompts for your specific task. Validate prompts carefully, aim for consistency and clarity, and iterate to find the optimal approach.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://axolotl.continuumlabs.pro/prompt-construction-for-fine-tuning-large-language-models.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
