Structuring Datasets for Fine-Tuning Large Language Models
Importance of Dataset Structure
The structure of a dataset plays a crucial role in how a large language model (LLM) learns during the fine-tuning process.
Properly structured datasets enable LLMs to understand the relationship between input and output data, as well as the specific task or behavior they are being trained for.
Different dataset formats are suitable for different scenarios, such as guided learning tasks, conversational models, or learning from unstructured text.
Impact of Labelling and Syntax
Labelling and syntax play a significant role in guiding the LLM's learning process during fine-tuning.
Curly braces
{}
are used to denote JSON objects, which provide a structured way to represent key-value pairs in the dataset.Quotation marks
""
are used to enclose string values, distinguishing them from keys or other data types.Labels like "instruction," "input," "output," "conversations," "from," and "value" serve as keys in the JSON object, providing context and meaning to the associated values.
By using consistent labelling and syntax, the neural language model can learn to associate specific labels with their corresponding roles in the task or conversation.
Labels
In the input/output format, the "label" property is used to indicate whether a particular segment of text should be used for training the model or not.
The labels are typically boolean values (true or false), but you can use other values as well, depending on your specific requirements.
The choice of label values depends on the task and the desired behavior of the model during training. Here are a few common scenarios:
Binary Labels (true/false)
When using binary labels, a segment with "label": true is considered as part of the input or output that the model should learn from during training.
A segment with "label": false is masked or ignored during training, meaning the model is not trained on that particular segment.
Binary labels are commonly used when you want to selectively train the model on specific parts of the input or output while excluding others.
Multiple Labels
In some cases, you might want to assign different labels to different segments to indicate their roles or importance in the training process.
For example, you could use labels like "input", "output", "context", or "meta" to differentiate between different types of segments.
By assigning different labels, you can control how the model processes and learns from each segment during training.
Numerical Labels
Numerical labels can be used to assign weights or priorities to different segments.
For instance, you could use labels like 1, 2, or 3 to indicate the importance or relevance of each segment in the training process.
Higher numerical labels could be used for segments that are more critical for the model to learn from, while lower labels could be used for less important segments.
The choice of label values ultimately depends on how you want to control the model's learning process and what segments you consider important for training.
How LLMs Learn from Structured Datasets
During fine-tuning, the model processes the structured dataset and adjusts its internal parameters to better understand the relationships between input and output data.
The transformer architecture of the model, with its attention mechanism, allows the model to focus on relevant parts of the input and generate appropriate outputs based on the learned patterns.
By encountering multiple examples in the dataset with consistent labelling and structure, the model gradually learns to generate outputs that match the expected format and content.
The model learns to associate specific labels (e.g., "instruction," "input," "output") with their respective roles in the task, enabling it to generate coherent and relevant responses.
Benefits of Structured Datasets
Structured datasets provide a clear and consistent format for the model to learn from, reducing ambiguity and improving the model's understanding of the task.
Consistent labelling and syntax enable the model to generate outputs that adhere to the expected format, making it easier to integrate the model into downstream applications.
Well-structured datasets facilitate the fine-tuning process by providing the model with clear examples of input-output relationships, leading to better performance and generalisation.
Considerations for Dataset Structure
Choose a dataset format that aligns with the specific task or behavior you want the model to learn.
Ensure consistency in labelling and syntax throughout the dataset to avoid confusion during the learning process.
Use clear and descriptive labels that accurately represent the role of each data point in the task or conversation.
Consider the balance between providing enough context and keeping the dataset concise to optimize the learning process.
By understanding the importance of dataset structure, common formats, labelling, and syntax, you can create well-structured datasets that enable LLMs to effectively learn and generate desired outputs during the fine-tuning process.
The transformer architecture of LLMs, with its attention mechanism, leverages the structured nature of the dataset to learn patterns, associations, and relationships between input and output data, ultimately improving the model's performance on the target task.
Prompt Construction
The input_output
format is described as an alternative to using predefined templates (like 'alpaca' or 'chatml') which can add unnecessary complexity or limit flexibility.
With input_output
, you have more control over the exact structure of your prompts.
The key feature of input_output
is the ability to mask certain segments of your prompts so that the model doesn't train on them. This is done by setting train_on_inputs: false
in your configuration.
To use input_output
, you prepare your data in a specific JSON Lines (JSONL) format.
Each line in the JSONL file represents a single prompt and consists of a series of "segments".
Each segment has two properties:
"text": The actual text content of this segment.
"label": A boolean indicating whether the model should train on this segment (true) or mask it (false).
Here are five different types of dataset structures using the input/output format for various domain use cases:
Sentiment Analysis
In this example, we'll create a dataset for sentiment analysis of movie reviews.
The input will be a movie review, and the output will be the sentiment label (positive, negative, or neutral).
The segment with the movie review text has "label": true because it is the input that the model should learn from.
The segment with the sentiment label (e.g., "positive") has "label": false because it is the expected output that the model should generate, not learn from.
Named Entity Recognition (NER)
This dataset structure is designed for named entity recognition tasks, where the goal is to identify and classify named entities in a given text. The input will be a sentence, and the output will be the same sentence with named entities marked.
Text Summarization
For text summarization tasks, the input will be a longer piece of text (e.g., a news article), and the output will be a concise summary.
Machine Translation
This dataset structure is suitable for machine translation tasks, where the input is a sentence in one language, and the output is the translated sentence in another language.
Dialogue Act Classification
In this example, we'll create a dataset for dialogue act classification, where the goal is to classify the intent of each utterance in a conversation. The input will be a conversational utterance, and the output will be the corresponding dialogue act label.
These examples demonstrate how the input/output format can be adapted for various tasks and domains.
By carefully designing the structure of your dataset and deciding which segments to label for training, you can create custom datasets tailored to your specific use case.
How to configure the YAML file for input output
To use the input_output
format, you specify it in your Axolotl configuration file:
When you run the preprocessing step with the --debug
flag, Axolotl will print out the tokens along with their labels so you can verify that the correct segments are being masked.
A label of 1 means the token will be trained on.
A label of -100 means the token will be masked.
You can also inspect the materialised data after preprocessing to ensure your prompts are being assembled correctly. This involves loading the tokenized data and decoding it back into text to see the final prompt structure.
The input_output
format provides a flexible, template-free way to construct prompts for fine-tuning.
By allowing you to mask specific segments, it gives you fine-grained control over what parts of your prompts the model actually learns from.
The JSONL structure with "segments", "text", and "label" properties is a clear and machine-readable way to define these prompts.
A medical dataset
This dataset is a collection of medical question-answering data, intended for fine-tuning a language model to perform medical question answering tasks.
Here's the table format of the dataset:
id
int64
Unique identifier for each entry
ending0
string
Possible ending or option for the scenario
ending1
string
Possible ending or option for the scenario
ending2
string
Possible ending or option for the scenario
ending3
string
Possible ending or option for the scenario
ending4
string
Possible ending or option for the scenario
label
int64
Correct answer or label for the scenario
sent1
string
Sentence or statement part of the scenario
sent2
string
Continuation or second part of the scenario
startphrase
string
Initial phrase or context for the scenario
Let's break down the structure
The dataset is structured in a tabular format, with each row representing a single question-answer pair. The key columns are:
sent1
: This column contains the medical question or scenario. It provides the context for the question being asked.sent2
: This column contains the specific question related to the medical scenario described insent1
.ending0
toending4
: These five columns represent the possible answer choices for the question. Each column contains a different answer option.label
: This column indicates the correct answer choice for the given question. It is an integer value ranging from 0 to 4, corresponding to theending0
toending4
columns.
Example Scenario
Id: 1,754
Startphrase: "An 8-year-old boy is brought to the paediatrician because his mother is concerned about recent behavioural changes."
Sent2: "Which of the following trinucleotide repeats is this child most likely to possess?"
Options:
Ending0: CGG
Ending1: GAA
Ending2: CAG
Ending3: CTG
Ending4: GCC
Correct Answer: Label 1 (Indicates the most likely trinucleotide repeat)
These entries illustrate how the dataset is structured to provide multiple-choice questions related to medical scenarios, which can be used for training or testing in medical education and artificial intelligence applications.
The structure of this dataset, with questions, answer choices, and labels, is well-suited for fine-tuning a language model for multiple-choice question answering tasks.
By providing the model with a variety of medical scenarios and questions during fine-tuning, it learns to understand the context and select the most appropriate answer from the given choices.
The fine-tuning process leverages the pre-existing knowledge of the language model and adapts it specifically for the medical domain and the question answering task.
The model learns to associate patterns in the input text with the correct answer choices, enabling it to make accurate predictions on new, unseen medical questions.
Last updated