# Phi 2.0 - Data Loading and Paths

With the model defined and fully configured, we  will now configure the datasets section.

First off, this configuration file is all about <mark style="color:yellow;">specifying the datasets you want to use for fine-tuning your model and where you would like the fine tuned model stored once training is complete.</mark>

You can provide one or more datasets under the <mark style="color:yellow;">**`datasets`**</mark> section. &#x20;

Each dataset can come from different sources, like a HuggingFace dataset repo, an S3 or GS path, or even a local JSON file.

For each dataset, you can choose the <mark style="color:yellow;">type of prompt</mark> you want to use during training.&#x20;

Axolotl supports a bunch of prompt types, like "alpaca", "sharegpt", "gpteacher", "oasst", and "reflection". You can specify the prompt type using the `type` parameter.

If you're using a file-based dataset, you can specify the data type using the <mark style="color:yellow;">**`ds_type`**</mark> parameter. It supports formats like JSON, Arrow, Parquet, text, and CSV. You can also provide the path to the source data files using the <mark style="color:yellow;">**`data_files`**</mark> parameter.

### <mark style="color:blue;">Basic datasets configuration</mark>

<pre class="language-yaml"><code class="lang-yaml"><strong>datasets:
</strong>  - path: datasets/alpaca-cleaned/alpaca_data_cleaned.json
    type: alpaca
    ds_type: json
    data_files:
  - alpaca_data_cleaned.json
dataset_prepared_path:
val_set_size: 0.20

output_dir: ./phi-out
</code></pre>

This component of the configuration file tells the platform where the dataset it located, and how the dataset should be split up into <mark style="color:yellow;">training versus validation.</mark>

We will only be using a number of these parameters for our fine tuning.  A full analysis of configurations is analysed below, and a summary can be found here:  [<mark style="color:orange;">**data loading and processing configurations**</mark>](https://axolotl.continuumlabs.pro/axolotl-configuration-files/data-loading-and-processing)<mark style="color:orange;">**.**</mark>

***

### <mark style="color:green;">Full assessment of all of the datasets configuration options</mark>

### <mark style="color:purple;">`path:`</mark>

#### This parameter indicates the <mark style="color:yellow;">path to a dataset that has been prepared for training</mark>.&#x20;

As highlighted, this can be a HuggingFace dataset repository, an AWS S3 storage path (s3://), or Google Cloud Path (gs\://) or "json" for a local dataset.

Here's an example of a Google Cloud Storage path:

```yaml
gs://my-bucket/datasets/my_dataset.json
```

In this example, "my-bucket" is the name of the Google Cloud Storage bucket, and "datasets/my\_dataset.json" is the path to the specific file within that bucket.

To determine the source of your data files, right click on the folder containing the data files in VS Code and then select 'copy URL'.&#x20;

```yaml
path: datasets/alpaca-cleaned/alpaca_data_cleaned.json
```

### <mark style="color:purple;">`type:`</mark>

Indicates the type of prompt to use for training, such as <mark style="color:yellow;">"alpaca", "sharegpt", "gpteacher", "oasst", or "reflection".</mark>

We will be using <mark style="color:yellow;">Alpaca in this example</mark>.  For a full explanation of each data 'type' please refer to this documentation: data type

```yaml
type: alpaca
```

### <mark style="color:purple;">`ds_type:`</mark>

(Optional) Specifies the data type when the <mark style="color:yellow;">`path`</mark> is a file, such as "json", "arrow", "parquet", "text", or "csv".

Huggingface datasets can come in a number of formats.  You can configure the data loading process to take into account the data type - csv, json, parquet.   <mark style="color:yellow;">We will be using JSON format.</mark>

```yaml
ds_type: json
```

### <mark style="color:purple;">`data_files:`</mark>

(Optional) Specifies the path to the source data files.

### <mark style="color:purple;">`dataset_prepared_path`</mark><mark style="color:purple;">:</mark>

Specifies the relative path where Axolotl attempts to save the prepared dataset as an arrow file after packing the data together. This allows subsequent training attempts to load faster.

```yaml
path: datasets/alpaca-cleaned/alpaca_data_cleaned.json
```

### <mark style="color:purple;">`output_dir:`</mark>&#x20;

This is the path where the full fine tuned model will be saved to:

```bash
output_dir: ./completed-model
```

### <mark style="color:blue;">Advanced: Explanation of all other data related configuration options</mark>

### <mark style="color:purple;">`shards`</mark><mark style="color:purple;">:</mark>&#x20;

Axolotl allows you to split your data into shards, which can be handy for large datasets. You can specify the number of shards using the <mark style="color:yellow;">**`shards`**</mark> parameter.

### <mark style="color:purple;">`name`</mark><mark style="color:purple;">:</mark>&#x20;

If you're using a dataset from a repository, you can give it a name using the <mark style="color:yellow;">**`name`**</mark> parameter. This can help you keep track of which dataset configuration you're using.

### <mark style="color:purple;">`train_on_split`</mark><mark style="color:purple;">:</mark>&#x20;

By default, Axolotl uses the "train" split for training, but you can specify a different split using the `train_on_split` parameter.

### <mark style="color:purple;">`conversation`</mark><mark style="color:purple;">:</mark>&#x20;

If you're using the "sharegpt" prompt type, you can specify the conversation type using the <mark style="color:yellow;">**`conversation`**</mark> parameter. Axolotl uses the FastChat library for conversations, and you can find the available conversation types in the FastChat documentation.

{% embed url="<https://github.com/lm-sys/FastChat/tree/main>" %}

### <mark style="color:purple;">`field_human`</mark> <mark style="color:purple;"></mark><mark style="color:purple;">and</mark> <mark style="color:purple;"></mark><mark style="color:purple;">`field_model`</mark><mark style="color:purple;">: (optional)</mark>

You can also specify the keys to use for the human and assistant roles in the conversation using the <mark style="color:yellow;">`field_human`</mark> and `f`<mark style="color:yellow;">`ield_model`</mark> parameters.

### <mark style="color:purple;">`roles`</mark><mark style="color:purple;">:</mark>&#x20;

If your dataset has additional keys that you want to use as input or output roles, you can specify them under the `roles` section. The `input` role is used for masking, and the `output` role is used for generation.

### <mark style="color:purple;">`input`</mark><mark style="color:purple;">:</mark>

&#x20;(Optional) A list of keys to be masked based on <mark style="color:yellow;">**`train_on_input`**</mark><mark style="color:yellow;">**.**</mark>

### <mark style="color:purple;">`output`</mark><mark style="color:purple;">:</mark>&#x20;

(Optional) A list of keys to be used as output.

### <mark style="color:blue;">Custom user instruction prompt</mark>

Allows defining a custom prompt for each dataset.

### <mark style="color:purple;">`system_prompt`</mark><mark style="color:purple;">:</mark>&#x20;

(Optional) Specifies the system prompt.

### <mark style="color:purple;">`system_format`</mark><mark style="color:blue;">:</mark>

&#x20;(Optional) Specifies the format of the system prompt, with "{system}" as a placeholder.

### <mark style="color:purple;">`field_system`</mark><mark style="color:purple;">,</mark> <mark style="color:purple;"></mark><mark style="color:purple;">`field_instruction`</mark><mark style="color:purple;">,</mark> <mark style="color:purple;"></mark><mark style="color:purple;">`field_input`</mark><mark style="color:purple;">,</mark> <mark style="color:purple;"></mark><mark style="color:purple;">`field_output`</mark><mark style="color:purple;">:</mark>&#x20;

(Optional) Specify the column names for the respective fields in the dataset.

### <mark style="color:purple;">`format`</mark><mark style="color:purple;">:</mark>&#x20;

Specifies the format of the prompt, which can include placeholders for instruction and input.

### <mark style="color:purple;">`no_input_format`</mark><mark style="color:blue;">:</mark>

&#x20;Specifies the format of the prompt when there is no input, excluding the "{input}" placeholder.

1. <mark style="color:blue;">**`field`**</mark><mark style="color:blue;">**:**</mark> (Optional) For "completion" datasets only, specifies the field to use instead of the "text" column.
2. <mark style="color:blue;">**`shuffle_merged_datasets`**</mark><mark style="color:blue;">**:**</mark> (Optional) By default, Axolotl shuffles the merged datasets, but you can disable this behavior by setting <mark style="color:yellow;">**`shuffle_merged_datasets`**</mark><mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**to**</mark><mark style="color:yellow;">**&#x20;**</mark><mark style="color:yellow;">**`false`**</mark>.

### <mark style="color:blue;">Datasets for Evaluation</mark>

### <mark style="color:purple;">`test_datasets`</mark>

A list of one or more datasets used for evaluating the model.

* <mark style="color:blue;">**`path`**</mark><mark style="color:blue;">**:**</mark> Specifies the path to the test dataset file.
* <mark style="color:blue;">**`ds_type`**</mark><mark style="color:blue;">**:**</mark> Specifies the data type of the test dataset, such as "json".
* <mark style="color:blue;">**`split`**</mark><mark style="color:blue;">**:**</mark> Specifies the split to use for the test dataset. For "json" datasets, the default split is called "train".
* <mark style="color:blue;">**`type`**</mark><mark style="color:blue;">**:**</mark> Specifies the type of the test dataset, such as "completion".
* <mark style="color:blue;">**`data_files`**</mark><mark style="color:blue;">**:**</mark> Specifies the list of data files for the test dataset.

Note: You can use either <mark style="color:blue;">**`test_datasets`**</mark> or <mark style="color:blue;">**`val_set_size`**</mark>, but not both.

### <mark style="color:purple;">`dataset_prepared_path`</mark><mark style="color:purple;">:</mark>

Specifies the relative path where Axolotl attempts to save the prepared dataset as an arrow file after packing the data together. This allows subsequent training attempts to load faster.

### <mark style="color:purple;">`push_dataset_to_hub`</mark>

If you want to push the prepared dataset to the Hugging Face Hub, you can specify the repository path using the <mark style="color:yellow;">**`push_dataset_to_hub`**</mark> parameter.

### <mark style="color:purple;">`hub_model_id`</mark>

You can also push the fine-tuned model checkpoints to the Hugging Face Hub by specifying the private repository path using the <mark style="color:yellow;">**`hub_model_id`**</mark> parameter.

### <mark style="color:purple;">`hub_strategy`</mark>

The <mark style="color:yellow;">**`hub_strategy`**</mark> parameter allows you to control how the checkpoints are pushed to the hub.   Refer to the Hugging Face Trainer documentation for more information.
