# Data Loading and Processing

<table data-full-width="false"><thead><tr><th width="288">Field Name</th><th>Explanation</th></tr></thead><tbody><tr><td>datasets</td><td>Datasets provide the training data for the model.</td></tr><tr><td><mark style="color:blue;">path</mark></td><td><mark style="color:blue;">The</mark> <mark style="color:yellow;"><code>path</code></mark> <mark style="color:blue;">field specifies the location of the dataset. It can be a</mark> <mark style="color:yellow;">HuggingFace dataset repository</mark> <mark style="color:blue;">path, a</mark> <mark style="color:yellow;">cloud storage path</mark> <mark style="color:blue;">(e.g., s3:// or gs://), or a</mark> <mark style="color:yellow;">local "json" file.</mark></td></tr><tr><td>type</td><td>The <code>type</code> field defines the <mark style="color:blue;">type of prompt strategy used for training the model</mark>. For example, it can be set to <mark style="color:yellow;">"alpaca," "sharegpt," "gpteacher," "oasst," or "reflection."</mark> Each type may have specific characteristics tailored to different training approaches.</td></tr><tr><td>ds_type</td><td>The optional <code>ds_type</code> field <mark style="color:blue;">specifies the datatype when the <code>path</code> points to a local file.</mark> It can be set to <mark style="color:yellow;">"json," "arrow," "parquet," "text," or "csv,"</mark> depending on the format of the dataset file.</td></tr><tr><td>data_files</td><td>If necessary, the <mark style="color:yellow;"><code>data_files</code></mark> field can be used to specify the path to the source data files associated with the dataset. This field helps the training process locate the relevant data.</td></tr><tr><td>shards</td><td>The optional <mark style="color:yellow;"><code>shards</code></mark> field allows you to specify the number of <mark style="color:blue;">data shards into which the dataset should be divided</mark>. Sharding can help distribute the data efficiently for parallel processing during training.</td></tr><tr><td>name</td><td>You can provide an optional <mark style="color:yellow;"><code>name</code></mark> for the dataset configuration. This name can be useful for reference when working with multiple datasets during fine-tuning.</td></tr><tr><td>train_on_split</td><td>The optional <mark style="color:yellow;"><code>train_on_split</code></mark> field lets you <mark style="color:blue;">specify the name of the dataset split to load from</mark>. For instance, you might use "train" to load the training split of the dataset.</td></tr><tr><td>conversation</td><td>For specific types of prompts like "sharegpt," this optional field defines the fastchat conversation type. It's typically used in conjunction with the "sharegpt" type and allows customization of conversation style.</td></tr></tbody></table>

<table data-full-width="false"><thead><tr><th width="306">Custom User Prompt</th><th>Explanation</th></tr></thead><tbody><tr><td>system_prompt</td><td>The <mark style="color:yellow;"><code>system_prompt</code></mark> field is part of the <mark style="color:blue;">custom user prompt configuration.</mark> It defines the prompt provided to the system or assistant as part of the interaction with the model.</td></tr><tr><td>system_format</td><td><mark style="color:yellow;"><code>system_format</code></mark> specifies the format in which the system prompt is presented. In this case, it uses a placeholder "{system}" to represent the system's response.</td></tr><tr><td>field_system</td><td><mark style="color:yellow;"><code>field_system</code></mark> specifies the <mark style="color:blue;">name of the field where the system prompt is located in the dataset</mark>. It helps the training process identify the system's responses in the dataset.</td></tr><tr><td>field_instruction</td><td><mark style="color:yellow;"><code>field_instruction</code></mark> specifies the field name for the instruction or query provided to the model. It's used to extract user instructions or queries from the dataset.</td></tr><tr><td>field_input</td><td><mark style="color:yellow;"><code>field_input</code></mark> defines the field name where <mark style="color:blue;">user inputs are stored in the dataset</mark>. It's essential for the model to understand and respond to user inputs effectively.</td></tr><tr><td>field_output</td><td><mark style="color:yellow;"><code>field_output</code></mark> represents the field name where the <mark style="color:blue;">assistant's outputs or responses are stored in the dataset.</mark> It helps in training the model to generate appropriate responses.</td></tr><tr><td>format</td><td>The <mark style="color:yellow;"><code>format</code></mark> field allows customization of the conversation format. It can be configured to be single-line or multi-line and includes placeholders for instruction and input, making it flexible for various dialogue styles.</td></tr><tr><td>no_input_format</td><td><mark style="color:yellow;"><code>no_input_format</code></mark> defines the format of the conversation when there is no input (e.g., for system prompts). It's important for consistency and readability of the generated interactions.</td></tr><tr><td>field</td><td>For "completion" datasets, this field can be used to specify a custom field in the dataset to be used instead of the default "text" column. This customization can be beneficial for specific use cases.</td></tr></tbody></table>

<table data-full-width="false"><thead><tr><th width="307">Dataset Prepared Path</th><th>Explanation</th></tr></thead><tbody><tr><td>dataset_prepared_path</td><td>The <mark style="color:yellow;"><code>dataset_prepared_path</code></mark> specifies the relative path where the prepared dataset is saved as an Arrow file. This prepared dataset is packed together for more efficient loading during subsequent training attempts, enhancing training performance.</td></tr></tbody></table>

<table data-full-width="false"><thead><tr><th width="321">Push Dataset to Hub</th><th>Explanation</th></tr></thead><tbody><tr><td>push_dataset_to_hub</td><td>The <mark style="color:yellow;"><code>push_dataset_to_hub</code></mark> field specifies the <mark style="color:blue;">repository path to which the prepared dataset should be pushed</mark>. This feature is useful for sharing datasets with others, making it accessible through the HuggingFace dataset hub.</td></tr></tbody></table>

<table data-full-width="false"><thead><tr><th width="327">Dataset Processing</th><th>Explanation</th></tr></thead><tbody><tr><td>dataset_processes</td><td>The <mark style="color:yellow;"><code>dataset_processes</code></mark> field allows you to define the <mark style="color:blue;">maximum number of processes to use during preprocessing of the input dataset.</mark> If not set, it defaults to the number of CPU cores available, which can optimize data preparation for training.</td></tr></tbody></table>

<table data-full-width="false"><thead><tr><th width="336">Push Checkpoints to Hub</th><th>Explanation</th></tr></thead><tbody><tr><td>hub_model_id</td><td>The <mark style="color:yellow;"><code>hub_model_id</code></mark> field specifies the <mark style="color:blue;">repository path to which the finetuned model checkpoints should be pushed</mark>. It facilitates the sharing of finetuned models through the HuggingFace model hub, making them accessible to others.</td></tr><tr><td>hub_strategy</td><td>The <mark style="color:yellow;"><code>hub_strategy</code></mark> field, not specified in this example, is intended to define the strategy for pushing checkpoints to the hub. It allows you to customize the behavior when pushing the checkpoints.</td></tr></tbody></table>

<table data-full-width="false"><thead><tr><th width="339">Authentication Token</th><th>Explanation</th></tr></thead><tbody><tr><td>hf_use_auth_token</td><td>The <mark style="color:yellow;"><code>hf_use_auth_token</code></mark> field is a boolean value that determines whether to use HuggingFace's <code>u</code><mark style="color:yellow;"><code>se_auth_token</code></mark> for loading datasets. This is particularly useful for fetching private datasets and is required to be set to "true" when used with <mark style="color:yellow;"><code>push_dataset_to_hub</code>.</mark></td></tr></tbody></table>

<table data-full-width="false"><thead><tr><th width="342">Validation Set Size</th><th>Explanation</th></tr></thead><tbody><tr><td>val_set_size</td><td><mark style="color:yellow;"><code>val_set_size</code></mark> <mark style="color:blue;">specifies the fraction of the dataset that should be set aside for evaluation purposes.</mark> For example, a value of 0.04 means that 4% of the dataset will be reserved for evaluation, helping assess model performance.</td></tr></tbody></table>

<table data-full-width="false"><thead><tr><th width="344">Dataset Sharding</th><th>Explanation</th></tr></thead><tbody><tr><td>dataset_shard_num</td><td>The <mark style="color:yellow;"><code>dataset_shard_num</code></mark> and <mark style="color:yellow;"><code>dataset_shard_idx</code></mark> fields, although not specified in this example, may be intended for configuring dataset sharding. <mark style="color:yellow;"><code>dataset_shard_num</code></mark> could define the number of shards to use, and <mark style="color:yellow;"><code>dataset_shard_idx</code></mark> could specify the index of the shard to use for the entire dataset. These options can be helpful for efficient data processing and training with large datasets.</td></tr></tbody></table>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://axolotl.continuumlabs.pro/axolotl-configuration-files/data-loading-and-processing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
