# Types of Dataset Structures

### <mark style="color:blue;">Axolotl Dataset Formats and Customisation</mark>

Axolotl is versatile in handling various dataset formats. &#x20;

Below are some of the formats you can use, with <mark style="color:blue;">**JSONL**</mark> being the recommended format:

### <mark style="color:green;">**Alpaca Format**</mark>

* Structure: <mark style="color:yellow;">`{`</mark><mark style="color:purple;">**`"instruction"`**</mark><mark style="color:yellow;">`: "your_instruction",`</mark>` `<mark style="color:purple;">**`"input"`**</mark><mark style="color:yellow;">`: "optional_input",`</mark>` `<mark style="color:purple;">**`"output"`**</mark><mark style="color:yellow;">`: "expected_output"}`</mark>
* Ideal for scenarios where you need to *<mark style="color:yellow;">p</mark><mark style="color:yellow;">**rovide specific instructions along with optional input data.**</mark>* The output field holds the expected result. This format is particularly useful for guided learning tasks.

### <mark style="color:green;">**ShareGPT Format**</mark>

* Structure: <mark style="color:yellow;">`{`</mark><mark style="color:purple;">**`"conversations"`**</mark><mark style="color:yellow;">`: [{"from": "human/gpt", "value":`</mark>` `<mark style="color:purple;">**`"dialogue_text"`**</mark><mark style="color:yellow;">`}]}`</mark>
* This format suits conversational models where *<mark style="color:yellow;">**interactions are between a human and a GPT-like model.**</mark>* It helps in training models to understand and respond in a dialogue setting, reflecting real-world conversational flows.

### <mark style="color:green;">**Completion Format**</mark>

* Structure: <mark style="color:yellow;">`{`</mark><mark style="color:purple;">**`"text"`**</mark><mark style="color:yellow;">`: "your_text_data"}`</mark>
* The completion format is straightforward and best for training models on raw text corpora.  It's ideal for scenarios where the model *<mark style="color:yellow;">**needs to learn from unstructured text**</mark>* without specific instructions or dialogue contexts.

### <mark style="color:blue;">**Adding Custom Prompts**</mark>

For datasets preprocessed with instruction-focused tasks:

* Structure: <mark style="color:yellow;">`{`</mark><mark style="color:purple;">**`"instruction"`**</mark><mark style="color:yellow;">`: "your_instruction",`</mark>` `<mark style="color:purple;">**`"output"`**</mark><mark style="color:yellow;">`: "expected_output"}`</mark>
* This format supports a direct instructional approach, where the model is trained to follow specific commands or requests. It's effective for task-oriented models.

#### <mark style="color:green;">**Incorporating this into your Axolotl YAML configuration**</mark>

```yaml
datasets:
  - path: repo
    type:
      system_prompt: ""
      field_system: system
      format: "[INST] {instruction} [/INST]"
      no_input_format: "[INST] {instruction} [/INST]"
```

This YAML config allows for a flexible setup, enabling the model to interpret and learn from the structured instructional format.

### <mark style="color:blue;">**Custom Pre-tokenized Dataset Usage**</mark>

To use a custom pre-tokenized dataset:

* <mark style="color:yellow;">**Do not**</mark> specify a <mark style="color:yellow;">**`type`**</mark> in your configuration.
* Ensure your dataset columns are precisely named as <mark style="color:yellow;">**`input_ids`**</mark><mark style="color:yellow;">,</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">**`attention_mask`**</mark>, and <mark style="color:yellow;">**`labels`**</mark><mark style="color:yellow;">.</mark>

This approach is beneficial <mark style="color:blue;">**when you have a dataset that is already tokenized**</mark> and ready for model consumption.&#x20;

It skips additional preprocessing steps, streamlining the training process for efficiency.

### <mark style="color:blue;">Interesting Points Regarding Datasets</mark>

* <mark style="color:purple;">**Format Flexibility**</mark><mark style="color:purple;">:</mark> Axolotl’s support for multiple formats allows for training models on diverse data types - from structured instructional data to informal conversational dialogues.
* <mark style="color:purple;">**Customisability**</mark><mark style="color:purple;">:</mark> The ability to customise datasets and their integration into the system via YAML configurations provides a high degree of control over the training process, allowing for fine-tuning specific to the desired output of the model.
* <mark style="color:purple;">**Efficiency in Pre-tokenized Data**</mark><mark style="color:purple;">:</mark> The support for pre-tokenized datasets is a significant time-saver, particularly in scenarios where datasets are vast and tokenization can become a computationally expensive step.

This variety and customisability make Axolotl a robust tool for training language models across different scenarios and requirements, enhancing its versatility in AI model development.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://axolotl.continuumlabs.pro/download-the-dataset/types-of-dataset-structures.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
