Types of Dataset Structures
Formats and Customizations
Axolotl Dataset Formats and Customisation
Axolotl is versatile in handling various dataset formats.
Below are some of the formats you can use, with JSONL being the recommended format:
Alpaca Format
Structure:
{
"instruction"
: "your_instruction",
"input"
: "optional_input",
"output"
: "expected_output"}
Ideal for scenarios where you need to provide specific instructions along with optional input data. The output field holds the expected result. This format is particularly useful for guided learning tasks.
ShareGPT Format
Structure:
{
"conversations"
: [{"from": "human/gpt", "value":
"dialogue_text"
}]}
This format suits conversational models where interactions are between a human and a GPT-like model. It helps in training models to understand and respond in a dialogue setting, reflecting real-world conversational flows.
Completion Format
Structure:
{
"text"
: "your_text_data"}
The completion format is straightforward and best for training models on raw text corpora. It's ideal for scenarios where the model needs to learn from unstructured text without specific instructions or dialogue contexts.
Adding Custom Prompts
For datasets preprocessed with instruction-focused tasks:
Structure:
{
"instruction"
: "your_instruction",
"output"
: "expected_output"}
This format supports a direct instructional approach, where the model is trained to follow specific commands or requests. It's effective for task-oriented models.
Incorporating this into your Axolotl YAML configuration
This YAML config allows for a flexible setup, enabling the model to interpret and learn from the structured instructional format.
Custom Pre-tokenized Dataset Usage
To use a custom pre-tokenized dataset:
Do not specify a
type
in your configuration.Ensure your dataset columns are precisely named as
input_ids
,attention_mask
, andlabels
.
This approach is beneficial when you have a dataset that is already tokenized and ready for model consumption.
It skips additional preprocessing steps, streamlining the training process for efficiency.
Interesting Points Regarding Datasets
Format Flexibility: Axolotl’s support for multiple formats allows for training models on diverse data types - from structured instructional data to informal conversational dialogues.
Customisability: The ability to customise datasets and their integration into the system via YAML configurations provides a high degree of control over the training process, allowing for fine-tuning specific to the desired output of the model.
Efficiency in Pre-tokenized Data: The support for pre-tokenized datasets is a significant time-saver, particularly in scenarios where datasets are vast and tokenization can become a computationally expensive step.
This variety and customisability make Axolotl a robust tool for training language models across different scenarios and requirements, enhancing its versatility in AI model development.
Last updated