# Types of Dataset Structures

### <mark style="color:blue;">Axolotl Dataset Formats and Customisation</mark>

Axolotl is versatile in handling various dataset formats. &#x20;

Below are some of the formats you can use, with <mark style="color:blue;">**JSONL**</mark> being the recommended format:

### <mark style="color:green;">**Alpaca Format**</mark>

* Structure: <mark style="color:yellow;">`{`</mark><mark style="color:purple;">**`"instruction"`**</mark><mark style="color:yellow;">`: "your_instruction",`</mark>` `<mark style="color:purple;">**`"input"`**</mark><mark style="color:yellow;">`: "optional_input",`</mark>` `<mark style="color:purple;">**`"output"`**</mark><mark style="color:yellow;">`: "expected_output"}`</mark>
* Ideal for scenarios where you need to *<mark style="color:yellow;">p</mark><mark style="color:yellow;">**rovide specific instructions along with optional input data.**</mark>* The output field holds the expected result. This format is particularly useful for guided learning tasks.

### <mark style="color:green;">**ShareGPT Format**</mark>

* Structure: <mark style="color:yellow;">`{`</mark><mark style="color:purple;">**`"conversations"`**</mark><mark style="color:yellow;">`: [{"from": "human/gpt", "value":`</mark>` `<mark style="color:purple;">**`"dialogue_text"`**</mark><mark style="color:yellow;">`}]}`</mark>
* This format suits conversational models where *<mark style="color:yellow;">**interactions are between a human and a GPT-like model.**</mark>* It helps in training models to understand and respond in a dialogue setting, reflecting real-world conversational flows.

### <mark style="color:green;">**Completion Format**</mark>

* Structure: <mark style="color:yellow;">`{`</mark><mark style="color:purple;">**`"text"`**</mark><mark style="color:yellow;">`: "your_text_data"}`</mark>
* The completion format is straightforward and best for training models on raw text corpora.  It's ideal for scenarios where the model *<mark style="color:yellow;">**needs to learn from unstructured text**</mark>* without specific instructions or dialogue contexts.

### <mark style="color:blue;">**Adding Custom Prompts**</mark>

For datasets preprocessed with instruction-focused tasks:

* Structure: <mark style="color:yellow;">`{`</mark><mark style="color:purple;">**`"instruction"`**</mark><mark style="color:yellow;">`: "your_instruction",`</mark>` `<mark style="color:purple;">**`"output"`**</mark><mark style="color:yellow;">`: "expected_output"}`</mark>
* This format supports a direct instructional approach, where the model is trained to follow specific commands or requests. It's effective for task-oriented models.

#### <mark style="color:green;">**Incorporating this into your Axolotl YAML configuration**</mark>

```yaml
datasets:
  - path: repo
    type:
      system_prompt: ""
      field_system: system
      format: "[INST] {instruction} [/INST]"
      no_input_format: "[INST] {instruction} [/INST]"
```

This YAML config allows for a flexible setup, enabling the model to interpret and learn from the structured instructional format.

### <mark style="color:blue;">**Custom Pre-tokenized Dataset Usage**</mark>

To use a custom pre-tokenized dataset:

* <mark style="color:yellow;">**Do not**</mark> specify a <mark style="color:yellow;">**`type`**</mark> in your configuration.
* Ensure your dataset columns are precisely named as <mark style="color:yellow;">**`input_ids`**</mark><mark style="color:yellow;">,</mark> <mark style="color:yellow;"></mark><mark style="color:yellow;">**`attention_mask`**</mark>, and <mark style="color:yellow;">**`labels`**</mark><mark style="color:yellow;">.</mark>

This approach is beneficial <mark style="color:blue;">**when you have a dataset that is already tokenized**</mark> and ready for model consumption.&#x20;

It skips additional preprocessing steps, streamlining the training process for efficiency.

### <mark style="color:blue;">Interesting Points Regarding Datasets</mark>

* <mark style="color:purple;">**Format Flexibility**</mark><mark style="color:purple;">:</mark> Axolotl’s support for multiple formats allows for training models on diverse data types - from structured instructional data to informal conversational dialogues.
* <mark style="color:purple;">**Customisability**</mark><mark style="color:purple;">:</mark> The ability to customise datasets and their integration into the system via YAML configurations provides a high degree of control over the training process, allowing for fine-tuning specific to the desired output of the model.
* <mark style="color:purple;">**Efficiency in Pre-tokenized Data**</mark><mark style="color:purple;">:</mark> The support for pre-tokenized datasets is a significant time-saver, particularly in scenarios where datasets are vast and tokenization can become a computationally expensive step.

This variety and customisability make Axolotl a robust tool for training language models across different scenarios and requirements, enhancing its versatility in AI model development.
