Structuring Datasets for Fine-Tuning Large Language Models

Importance of Dataset Structure

The structure of a dataset plays a crucial role in how a large language model (LLM) learns during the fine-tuning process.

Properly structured datasets enable LLMs to understand the relationship between input and output data, as well as the specific task or behavior they are being trained for.

Different dataset formats are suitable for different scenarios, such as guided learning tasks, conversational models, or learning from unstructured text.

Impact of Labelling and Syntax

  • Labelling and syntax play a significant role in guiding the LLM's learning process during fine-tuning.

  • Curly braces {} are used to denote JSON objects, which provide a structured way to represent key-value pairs in the dataset.

  • Quotation marks "" are used to enclose string values, distinguishing them from keys or other data types.

  • Labels like "instruction," "input," "output," "conversations," "from," and "value" serve as keys in the JSON object, providing context and meaning to the associated values.

  • By using consistent labelling and syntax, the neural language model can learn to associate specific labels with their corresponding roles in the task or conversation.

Labels

In the input/output format, the "label" property is used to indicate whether a particular segment of text should be used for training the model or not.

The labels are typically boolean values (true or false), but you can use other values as well, depending on your specific requirements.

The choice of label values depends on the task and the desired behavior of the model during training. Here are a few common scenarios:

Binary Labels (true/false)

  • When using binary labels, a segment with "label": true is considered as part of the input or output that the model should learn from during training.

  • A segment with "label": false is masked or ignored during training, meaning the model is not trained on that particular segment.

  • Binary labels are commonly used when you want to selectively train the model on specific parts of the input or output while excluding others.

Multiple Labels

  • In some cases, you might want to assign different labels to different segments to indicate their roles or importance in the training process.

  • For example, you could use labels like "input", "output", "context", or "meta" to differentiate between different types of segments.

  • By assigning different labels, you can control how the model processes and learns from each segment during training.

Numerical Labels

  • Numerical labels can be used to assign weights or priorities to different segments.

  • For instance, you could use labels like 1, 2, or 3 to indicate the importance or relevance of each segment in the training process.

  • Higher numerical labels could be used for segments that are more critical for the model to learn from, while lower labels could be used for less important segments.

The choice of label values ultimately depends on how you want to control the model's learning process and what segments you consider important for training.

How LLMs Learn from Structured Datasets

  • During fine-tuning, the model processes the structured dataset and adjusts its internal parameters to better understand the relationships between input and output data.

  • The transformer architecture of the model, with its attention mechanism, allows the model to focus on relevant parts of the input and generate appropriate outputs based on the learned patterns.

  • By encountering multiple examples in the dataset with consistent labelling and structure, the model gradually learns to generate outputs that match the expected format and content.

  • The model learns to associate specific labels (e.g., "instruction," "input," "output") with their respective roles in the task, enabling it to generate coherent and relevant responses.

Benefits of Structured Datasets

  • Structured datasets provide a clear and consistent format for the model to learn from, reducing ambiguity and improving the model's understanding of the task.

  • Consistent labelling and syntax enable the model to generate outputs that adhere to the expected format, making it easier to integrate the model into downstream applications.

  • Well-structured datasets facilitate the fine-tuning process by providing the model with clear examples of input-output relationships, leading to better performance and generalisation.

Considerations for Dataset Structure

  • Choose a dataset format that aligns with the specific task or behavior you want the model to learn.

  • Ensure consistency in labelling and syntax throughout the dataset to avoid confusion during the learning process.

  • Use clear and descriptive labels that accurately represent the role of each data point in the task or conversation.

  • Consider the balance between providing enough context and keeping the dataset concise to optimize the learning process.

By understanding the importance of dataset structure, common formats, labelling, and syntax, you can create well-structured datasets that enable LLMs to effectively learn and generate desired outputs during the fine-tuning process.

The transformer architecture of LLMs, with its attention mechanism, leverages the structured nature of the dataset to learn patterns, associations, and relationships between input and output data, ultimately improving the model's performance on the target task.

Prompt Construction

The input_output format is described as an alternative to using predefined templates (like 'alpaca' or 'chatml') which can add unnecessary complexity or limit flexibility.

With input_output, you have more control over the exact structure of your prompts.

The key feature of input_output is the ability to mask certain segments of your prompts so that the model doesn't train on them. This is done by setting train_on_inputs: false in your configuration.

To use input_output, you prepare your data in a specific JSON Lines (JSONL) format.

Each line in the JSONL file represents a single prompt and consists of a series of "segments".

Each segment has two properties:

  • "text": The actual text content of this segment.

  • "label": A boolean indicating whether the model should train on this segment (true) or mask it (false).

Here are five different types of dataset structures using the input/output format for various domain use cases:

Sentiment Analysis

In this example, we'll create a dataset for sentiment analysis of movie reviews.

The input will be a movie review, and the output will be the sentiment label (positive, negative, or neutral).

The segment with the movie review text has "label": true because it is the input that the model should learn from.

The segment with the sentiment label (e.g., "positive") has "label": false because it is the expected output that the model should generate, not learn from.

{
  "segments": [
    {
      "label": true,
      "text": "<s>I absolutely loved this movie! The acting was superb, and the plot kept me engaged from start to finish. "
    },
    {
      "label": false,
      "text": "positive"
    },
    {
      "label": true,
      "text": "</s>"
    }
  ]
}

Named Entity Recognition (NER)

This dataset structure is designed for named entity recognition tasks, where the goal is to identify and classify named entities in a given text. The input will be a sentence, and the output will be the same sentence with named entities marked.

{
  "segments": [
    {
      "label": true,
      "text": "<s>Apple Inc. is planning to launch the iPhone 15 in September 2023 in Cupertino, California. "
    },
    {
      "label": false,
      "text": "<ORG>Apple Inc.</ORG> is planning to launch the <PROD>iPhone 15</PROD> in <DATE>September 2023</DATE> in <LOC>Cupertino, California</LOC>."
    },
    {
      "label": true,
      "text": "</s>"
    }
  ]
}

Text Summarization

For text summarization tasks, the input will be a longer piece of text (e.g., a news article), and the output will be a concise summary.

{
  "segments": [
    {
      "label": true,
      "text": "<s>In a groundbreaking discovery, scientists have found a new species of dinosaur in the Gobi Desert. The dinosaur, named Mongolraptor, is believed to be a close relative of the Velociraptor. The discovery sheds new light on the diversity of dinosaur species during the Late Cretaceous period. The findings were published in the journal Nature on Tuesday.\n"
    },
    {
      "label": false,
      "text": "Scientists discovered a new dinosaur species, Mongolraptor, in the Gobi Desert. The finding, published in Nature, reveals new information about dinosaur diversity in the Late Cretaceous period."
    },
    {
      "label": true,
      "text": "</s>"
    }
  ]
}

Machine Translation

This dataset structure is suitable for machine translation tasks, where the input is a sentence in one language, and the output is the translated sentence in another language.

{
  "segments": [
    {
      "label": true,
      "text": "<s>Der schnelle braune Fuchs springt über den faulen Hund.\n"
    },
    {
      "label": false,
      "text": "The quick brown fox jumps over the lazy dog."
    },
    {
      "label": true,
      "text": "</s>"
    }
  ]
}

Dialogue Act Classification

In this example, we'll create a dataset for dialogue act classification, where the goal is to classify the intent of each utterance in a conversation. The input will be a conversational utterance, and the output will be the corresponding dialogue act label.

{
  "segments": [
    {
      "label": true,
      "text": "<s>User: Hey, can you help me find a good Italian restaurant nearby?\n"
    },
    {
      "label": false,
      "text": "request_recommendation"
    },
    {
      "label": true,
      "text": "\nAssistant: Sure, I'd be happy to help! What's your price range and preferred location?\n"
    },
    {
      "label": false,
      "text": "request_information"
    },
    {
      "label": true,
      "text": "</s>"
    }
  ]
}

These examples demonstrate how the input/output format can be adapted for various tasks and domains.

By carefully designing the structure of your dataset and deciding which segments to label for training, you can create custom datasets tailored to your specific use case.

How to configure the YAML file for input output

To use the input_output format, you specify it in your Axolotl configuration file:

datasets:
  - path: output.jsonl
    type: input_output

When you run the preprocessing step with the --debug flag, Axolotl will print out the tokens along with their labels so you can verify that the correct segments are being masked.

  • A label of 1 means the token will be trained on.

  • A label of -100 means the token will be masked.

You can also inspect the materialised data after preprocessing to ensure your prompts are being assembled correctly. This involves loading the tokenized data and decoding it back into text to see the final prompt structure.

The input_output format provides a flexible, template-free way to construct prompts for fine-tuning.

By allowing you to mask specific segments, it gives you fine-grained control over what parts of your prompts the model actually learns from.

The JSONL structure with "segments", "text", and "label" properties is a clear and machine-readable way to define these prompts.

A medical dataset

This dataset is a collection of medical question-answering data, intended for fine-tuning a language model to perform medical question answering tasks.

Here's the table format of the dataset:

Column Name
Data Type
Description

id

int64

Unique identifier for each entry

ending0

string

Possible ending or option for the scenario

ending1

string

Possible ending or option for the scenario

ending2

string

Possible ending or option for the scenario

ending3

string

Possible ending or option for the scenario

ending4

string

Possible ending or option for the scenario

label

int64

Correct answer or label for the scenario

sent1

string

Sentence or statement part of the scenario

sent2

string

Continuation or second part of the scenario

startphrase

string

Initial phrase or context for the scenario

Let's break down the structure

The dataset is structured in a tabular format, with each row representing a single question-answer pair. The key columns are:

  1. sent1: This column contains the medical question or scenario. It provides the context for the question being asked.

  2. sent2: This column contains the specific question related to the medical scenario described in sent1.

  3. ending0 to ending4: These five columns represent the possible answer choices for the question. Each column contains a different answer option.

  4. label: This column indicates the correct answer choice for the given question. It is an integer value ranging from 0 to 4, corresponding to the ending0 to ending4 columns.

Example Scenario

  • Id: 1,754

  • Startphrase: "An 8-year-old boy is brought to the paediatrician because his mother is concerned about recent behavioural changes."

  • Sent2: "Which of the following trinucleotide repeats is this child most likely to possess?"

  • Options:

    • Ending0: CGG

    • Ending1: GAA

    • Ending2: CAG

    • Ending3: CTG

    • Ending4: GCC

  • Correct Answer: Label 1 (Indicates the most likely trinucleotide repeat)

These entries illustrate how the dataset is structured to provide multiple-choice questions related to medical scenarios, which can be used for training or testing in medical education and artificial intelligence applications.

The structure of this dataset, with questions, answer choices, and labels, is well-suited for fine-tuning a language model for multiple-choice question answering tasks.

By providing the model with a variety of medical scenarios and questions during fine-tuning, it learns to understand the context and select the most appropriate answer from the given choices.

The fine-tuning process leverages the pre-existing knowledge of the language model and adapts it specifically for the medical domain and the question answering task.

The model learns to associate patterns in the input text with the correct answer choices, enabling it to make accurate predictions on new, unseen medical questions.

Last updated