# Structuring Datasets for Fine-Tuning Large Language Models

### <mark style="color:blue;">Importance of Dataset Structure</mark>

The structure of a dataset plays a crucial role in how a large language model (LLM) learns during the fine-tuning process.

Properly structured datasets enable LLMs to understand the relationship between input and output data, as well as the specific task or behavior they are being trained for.

Different dataset formats are suitable for different scenarios, such as guided learning tasks, conversational models, or learning from unstructured text.

### <mark style="color:blue;">Impact of Labelling and Syntax</mark>

* Labelling and syntax play a significant role in guiding the LLM's learning process during fine-tuning.
* Curly braces <mark style="color:yellow;">**`{}`**</mark> are used to denote JSON objects, which provide a structured way to represent key-value pairs in the dataset.
* Quotation marks <mark style="color:yellow;">**`""`**</mark> are used to enclose string values, distinguishing them from keys or other data types.
* Labels like "instruction," "input," "output," "conversations," "from," and "value" serve as keys in the JSON object, providing context and meaning to the associated values.
* By using consistent labelling and syntax, the neural language model can learn to associate specific labels with their corresponding roles in the task or conversation.

### <mark style="color:blue;">Labels</mark>

In the input/output format, the "label" property is used to indicate whether a particular segment of text should be used for training the model or not.&#x20;

The labels are typically boolean values (true or false), *<mark style="color:yellow;">**but you can use other values as well,**</mark>* depending on your specific requirements.

The choice of label values depends on the task and the desired behavior of the model during training. Here are a few common scenarios:

#### <mark style="color:green;">Binary Labels (true/false)</mark>

* When using binary labels, a segment with "label": true is considered as part of the input or output that the model should learn from during training.
* A segment with "label": false is masked or ignored during training, meaning the model is not trained on that particular segment.
* Binary labels are commonly used when you want to selectively train the model on specific parts of the input or output while excluding others.

#### <mark style="color:green;">Multiple Labels</mark>

* In some cases, you might want to assign different labels to different segments to indicate their roles or importance in the training process.
* For example, you could use labels like "input", "output", "context", or "meta" to differentiate between different types of segments.
* By assigning different labels, you can control how the model processes and learns from each segment during training.

#### <mark style="color:green;">Numerical Labels</mark>

* Numerical labels can be used to assign weights or priorities to different segments.
* For instance, you could use labels like 1, 2, or 3 to indicate the importance or relevance of each segment in the training process.
* Higher numerical labels could be used for segments that are more critical for the model to learn from, while lower labels could be used for less important segments.

The choice of label values ultimately depends on how you want to control the model's learning process and what segments you consider important for training.

### <mark style="color:blue;">How LLMs Learn from Structured Datasets</mark>

* During fine-tuning, the model <mark style="color:yellow;">**processes the structured dataset**</mark> and <mark style="color:yellow;">**adjusts its internal parameters**</mark> to better <mark style="color:yellow;">**understand the relationships between input and output data.**</mark>
* The transformer architecture of the model, with its attention mechanism, allows the model to focus on relevant parts of the input and generate appropriate outputs based on the learned patterns.
* By encountering multiple examples in the dataset with consistent labelling and structure, the model gradually learns to generate outputs that match the expected format and content.
* The model learns to associate specific labels (e.g., "instruction," "input," "output") with their respective roles in the task, enabling it to generate coherent and relevant responses.

### <mark style="color:blue;">Benefits of Structured Datasets</mark>

* Structured datasets provide a <mark style="color:yellow;">**clear and consistent format**</mark> for the model to learn from, reducing ambiguity and improving the model's understanding of the task.
* <mark style="color:yellow;">**Consistent labelling and syntax**</mark> enable the model to generate outputs that adhere to the expected format, making it easier to integrate the model into downstream applications.
* Well-structured datasets facilitate the fine-tuning process by providing the model with <mark style="color:yellow;">**clear examples of input-output relationships**</mark>, leading to better performance and generalisation.

### <mark style="color:blue;">Considerations for Dataset Structure</mark>

* Choose a dataset format that aligns with the specific task or behavior you want the model to learn.
* Ensure consistency in labelling and syntax throughout the dataset to avoid confusion during the learning process.
* Use clear and descriptive labels that accurately represent the role of each data point in the task or conversation.
* Consider the balance between providing enough context and keeping the dataset concise to optimize the learning process.

By understanding the importance of dataset structure, common formats, labelling, and syntax, you can create well-structured datasets that enable LLMs to effectively learn and generate desired outputs during the fine-tuning process.&#x20;

The transformer architecture of LLMs, with its attention mechanism, leverages the structured nature of the dataset to learn patterns, associations, and relationships between input and output data, ultimately improving the model's performance on the target task.

### <mark style="color:blue;">Prompt Construction</mark>

The <mark style="color:yellow;">**`input_output`**</mark> format is described as an alternative to using predefined templates (like 'alpaca' or 'chatml') which can add unnecessary complexity or limit flexibility.&#x20;

With <mark style="color:yellow;">**`input_output`**</mark>, you have more control over the exact structure of your prompts.

The key feature of <mark style="color:yellow;">**`input_output`**</mark> is the ability to mask certain segments of your prompts so that the model doesn't train on them. This is done by setting <mark style="color:yellow;">**`train_on_inputs: false`**</mark> in your configuration.

To use <mark style="color:yellow;">**`input_output`**</mark>, you prepare your data in a <mark style="color:yellow;">specific JSON Lines (JSONL) format</mark>.&#x20;

Each line in the JSONL file represents a single prompt and consists of a series of "segments".

Each segment has two properties:

* "text": The actual text content of this segment.
* "label": A boolean indicating whether the model should train on this segment (true) or mask it (false).

Here are five different types of dataset structures using the input/output format for various domain use cases:

#### <mark style="color:green;">Sentiment Analysis</mark>

In this example, we'll create a dataset for sentiment analysis of movie reviews.&#x20;

The input will be a movie review, and the output will be the sentiment label (positive, negative, or neutral).

The segment with the movie review text has "label": true because it is the input that the model should learn from.

The segment with the sentiment label (e.g., "positive") has "label": false because it is the expected output that the model should generate, not learn from.

```json
{
  "segments": [
    {
      "label": true,
      "text": "<s>I absolutely loved this movie! The acting was superb, and the plot kept me engaged from start to finish. "
    },
    {
      "label": false,
      "text": "positive"
    },
    {
      "label": true,
      "text": "</s>"
    }
  ]
}
```

#### <mark style="color:green;">Named Entity Recognition (NER)</mark>

This dataset structure is designed for named entity recognition tasks, where the goal is to identify and classify named entities in a given text. The input will be a sentence, and the output will be the same sentence with named entities marked.

```json
{
  "segments": [
    {
      "label": true,
      "text": "<s>Apple Inc. is planning to launch the iPhone 15 in September 2023 in Cupertino, California. "
    },
    {
      "label": false,
      "text": "<ORG>Apple Inc.</ORG> is planning to launch the <PROD>iPhone 15</PROD> in <DATE>September 2023</DATE> in <LOC>Cupertino, California</LOC>."
    },
    {
      "label": true,
      "text": "</s>"
    }
  ]
}
```

#### <mark style="color:green;">Text Summarization</mark>

For text summarization tasks, the input will be a longer piece of text (e.g., a news article), and the output will be a concise summary.

```json
{
  "segments": [
    {
      "label": true,
      "text": "<s>In a groundbreaking discovery, scientists have found a new species of dinosaur in the Gobi Desert. The dinosaur, named Mongolraptor, is believed to be a close relative of the Velociraptor. The discovery sheds new light on the diversity of dinosaur species during the Late Cretaceous period. The findings were published in the journal Nature on Tuesday.\n"
    },
    {
      "label": false,
      "text": "Scientists discovered a new dinosaur species, Mongolraptor, in the Gobi Desert. The finding, published in Nature, reveals new information about dinosaur diversity in the Late Cretaceous period."
    },
    {
      "label": true,
      "text": "</s>"
    }
  ]
}
```

#### <mark style="color:green;">Machine Translation</mark>

This dataset structure is suitable for machine translation tasks, where the input is a sentence in one language, and the output is the translated sentence in another language.

```json
{
  "segments": [
    {
      "label": true,
      "text": "<s>Der schnelle braune Fuchs springt über den faulen Hund.\n"
    },
    {
      "label": false,
      "text": "The quick brown fox jumps over the lazy dog."
    },
    {
      "label": true,
      "text": "</s>"
    }
  ]
}
```

#### <mark style="color:green;">Dialogue Act Classification</mark>

In this example, we'll create a dataset for dialogue act classification, where the goal is to classify the intent of each utterance in a conversation. The input will be a conversational utterance, and the output will be the corresponding dialogue act label.

```json
{
  "segments": [
    {
      "label": true,
      "text": "<s>User: Hey, can you help me find a good Italian restaurant nearby?\n"
    },
    {
      "label": false,
      "text": "request_recommendation"
    },
    {
      "label": true,
      "text": "\nAssistant: Sure, I'd be happy to help! What's your price range and preferred location?\n"
    },
    {
      "label": false,
      "text": "request_information"
    },
    {
      "label": true,
      "text": "</s>"
    }
  ]
}
```

These examples demonstrate how the input/output format can be adapted for various tasks and domains.&#x20;

By carefully designing the structure of your dataset and deciding which segments to label for training, you can create custom datasets tailored to your specific use case.

### <mark style="color:blue;">How to configure the YAML file for input output</mark>

To use the <mark style="color:yellow;">**`input_output`**</mark> format, you specify it in your Axolotl configuration file:

```yaml
datasets:
  - path: output.jsonl
    type: input_output
```

When you run the preprocessing step with the <mark style="color:yellow;">**`--debug`**</mark> flag, Axolotl will print out the tokens along with their labels so you can verify that the correct segments are being masked.

* A <mark style="color:yellow;">**label of 1**</mark> means the token will be trained on.
* A <mark style="color:yellow;">**label of -100**</mark> means the token will be masked.

You can also inspect the materialised data after preprocessing to ensure your prompts are being assembled correctly.  This involves loading the tokenized data and decoding it back into text to see the final prompt structure.

The <mark style="color:yellow;">**`input_output`**</mark> format provides a flexible, template-free way to construct prompts for fine-tuning.&#x20;

By allowing you to mask specific segments, it gives you fine-grained control over what parts of your prompts the model actually learns from.&#x20;

The JSONL structure with "segments", "text", and "label" properties is a clear and machine-readable way to define these prompts.

### <mark style="color:blue;">A medical dataset</mark>

This dataset is a collection of medical question-answering data, intended for fine-tuning a language model to perform medical question answering tasks.&#x20;

Here's the table format of the dataset:

<table><thead><tr><th width="156">Column Name</th><th width="128">Data Type</th><th>Description</th></tr></thead><tbody><tr><td>id</td><td>int64</td><td>Unique identifier for each entry</td></tr><tr><td>ending0</td><td>string</td><td>Possible ending or option for the scenario</td></tr><tr><td>ending1</td><td>string</td><td>Possible ending or option for the scenario</td></tr><tr><td>ending2</td><td>string</td><td>Possible ending or option for the scenario</td></tr><tr><td>ending3</td><td>string</td><td>Possible ending or option for the scenario</td></tr><tr><td>ending4</td><td>string</td><td>Possible ending or option for the scenario</td></tr><tr><td>label</td><td>int64</td><td>Correct answer or label for the scenario</td></tr><tr><td>sent1</td><td>string</td><td>Sentence or statement part of the scenario</td></tr><tr><td>sent2</td><td>string</td><td>Continuation or second part of the scenario</td></tr><tr><td>startphrase</td><td>string</td><td>Initial phrase or context for the scenario</td></tr></tbody></table>

#### <mark style="color:green;">Let's break down the structure</mark>

The dataset is structured in a tabular format, with each row representing a single question-answer pair. The key columns are:

1. <mark style="color:yellow;">**`sent1`**</mark><mark style="color:yellow;">**:**</mark> This column contains the medical question or scenario. It provides the context for the question being asked.
2. <mark style="color:yellow;">**`sent2`**</mark><mark style="color:yellow;">**:**</mark> This column contains the specific question related to the medical scenario described in `sent1`.
3. <mark style="color:yellow;">**`ending0`**</mark><mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**to**</mark><mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**`ending4`**</mark><mark style="color:yellow;">**:**</mark> These five columns <mark style="color:yellow;">**represent the possible answer choices**</mark> for the question. Each column contains a different answer option.
4. <mark style="color:yellow;">**`label`**</mark><mark style="color:yellow;">**:**</mark> This column <mark style="color:yellow;">**indicates the correct answer choice**</mark> for the given question. It is an integer value ranging from 0 to 4, corresponding to the `ending0` to `ending4` columns.

<mark style="color:green;">**Example Scenario**</mark>

* **Id:** 1,754
* **Startphrase:** "An 8-year-old boy is brought to the paediatrician because his mother is concerned about recent behavioural changes."
* **Sent2:** "Which of the following trinucleotide repeats is this child most likely to possess?"
* **Options:**
  * **Ending0:** CGG
  * **Ending1:** GAA
  * **Ending2:** CAG
  * **Ending3:** CTG
  * **Ending4:** GCC
* **Correct Answer:** Label 1 (Indicates the most likely trinucleotide repeat)

These entries illustrate how the dataset is structured to <mark style="color:yellow;">**provide multiple-choice questions**</mark> related to medical scenarios, which can be used for training or testing in medical education and artificial intelligence applications.

The structure of this dataset, with questions, answer choices, and labels, is well-suited for fine-tuning a language model for multiple-choice question answering tasks.&#x20;

By providing the model with a variety of medical scenarios and questions during fine-tuning, it learns to understand the context and select the most appropriate answer from the given choices.

The fine-tuning process leverages the pre-existing knowledge of the language model and adapts it specifically for the medical domain and the question answering task.&#x20;

The model learns to associate patterns in the input text with the correct answer choices, enabling it to make accurate predictions on new, unseen medical questions.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://axolotl.continuumlabs.pro/download-the-dataset/structuring-datasets-for-fine-tuning-large-language-models.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
