Types of Dataset Structures

Formats and Customizations

Axolotl Dataset Formats and Customisation

Axolotl is versatile in handling various dataset formats.

Below are some of the formats you can use, with JSONL being the recommended format:

Alpaca Format

Structure: {"instruction": "your_instruction", "input": "optional_input", "output": "expected_output"}
Ideal for scenarios where you need to provide specific instructions along with optional input data. The output field holds the expected result. This format is particularly useful for guided learning tasks.

ShareGPT Format

Structure: {"conversations": [{"from": "human/gpt", "value": "dialogue_text"}]}
This format suits conversational models where interactions are between a human and a GPT-like model. It helps in training models to understand and respond in a dialogue setting, reflecting real-world conversational flows.

Completion Format

Structure: {"text": "your_text_data"}
The completion format is straightforward and best for training models on raw text corpora. It's ideal for scenarios where the model needs to learn from unstructured text without specific instructions or dialogue contexts.

Adding Custom Prompts

For datasets preprocessed with instruction-focused tasks:

Structure: {"instruction": "your_instruction", "output": "expected_output"}
This format supports a direct instructional approach, where the model is trained to follow specific commands or requests. It's effective for task-oriented models.

Incorporating this into your Axolotl YAML configuration

datasets:
  - path: repo
    type:
      system_prompt: ""
      field_system: system
      format: "[INST] {instruction} [/INST]"
      no_input_format: "[INST] {instruction} [/INST]"

This YAML config allows for a flexible setup, enabling the model to interpret and learn from the structured instructional format.

Custom Pre-tokenized Dataset Usage

To use a custom pre-tokenized dataset:

Do not specify a type in your configuration.
Ensure your dataset columns are precisely named as input_ids, attention_mask, and labels.

This approach is beneficial when you have a dataset that is already tokenized and ready for model consumption.

It skips additional preprocessing steps, streamlining the training process for efficiency.

Interesting Points Regarding Datasets

Format Flexibility: Axolotl’s support for multiple formats allows for training models on diverse data types - from structured instructional data to informal conversational dialogues.
Customisability: The ability to customise datasets and their integration into the system via YAML configurations provides a high degree of control over the training process, allowing for fine-tuning specific to the desired output of the model.
Efficiency in Pre-tokenized Data: The support for pre-tokenized datasets is a significant time-saver, particularly in scenarios where datasets are vast and tokenization can become a computationally expensive step.

This variety and customisability make Axolotl a robust tool for training language models across different scenarios and requirements, enhancing its versatility in AI model development.

PreviousDownload the dataset NextStructuring Datasets for Fine-Tuning Large Language Models

Last updated 1 year ago

Was this helpful?