> For the complete documentation index, see [llms.txt](https://axolotl.continuumlabs.pro/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://axolotl.continuumlabs.pro/phi-2.0/phi-2.0-data-loading-and-paths.md). # Phi 2.0 - Data Loading and Paths With the model defined and fully configured, we will now configure the datasets section. First off, this configuration file is all about specifying the datasets you want to use for fine-tuning your model and where you would like the fine tuned model stored once training is complete. You can provide one or more datasets under the **`datasets`** section. Each dataset can come from different sources, like a HuggingFace dataset repo, an S3 or GS path, or even a local JSON file. For each dataset, you can choose the type of prompt you want to use during training. Axolotl supports a bunch of prompt types, like "alpaca", "sharegpt", "gpteacher", "oasst", and "reflection". You can specify the prompt type using the `type` parameter. If you're using a file-based dataset, you can specify the data type using the **`ds_type`** parameter. It supports formats like JSON, Arrow, Parquet, text, and CSV. You can also provide the path to the source data files using the **`data_files`** parameter. ### Basic datasets configuration

datasets:
  - path: datasets/alpaca-cleaned/alpaca_data_cleaned.json
    type: alpaca
    ds_type: json
    data_files:
  - alpaca_data_cleaned.json
dataset_prepared_path:
val_set_size: 0.20

output_dir: ./phi-out

This component of the configuration file tells the platform where the dataset it located, and how the dataset should be split up into training versus validation. We will only be using a number of these parameters for our fine tuning. A full analysis of configurations is analysed below, and a summary can be found here: [**data loading and processing configurations**](/axolotl-configuration-files/data-loading-and-processing.md)**.** *** ### Full assessment of all of the datasets configuration options ### `path:` #### This parameter indicates the path to a dataset that has been prepared for training. As highlighted, this can be a HuggingFace dataset repository, an AWS S3 storage path (s3://), or Google Cloud Path (gs\://) or "json" for a local dataset. Here's an example of a Google Cloud Storage path: ```yaml gs://my-bucket/datasets/my_dataset.json ``` In this example, "my-bucket" is the name of the Google Cloud Storage bucket, and "datasets/my\_dataset.json" is the path to the specific file within that bucket. To determine the source of your data files, right click on the folder containing the data files in VS Code and then select 'copy URL'. ```yaml path: datasets/alpaca-cleaned/alpaca_data_cleaned.json ``` ### `type:` Indicates the type of prompt to use for training, such as "alpaca", "sharegpt", "gpteacher", "oasst", or "reflection". We will be using Alpaca in this example. For a full explanation of each data 'type' please refer to this documentation: data type ```yaml type: alpaca ``` ### `ds_type:` (Optional) Specifies the data type when the `path` is a file, such as "json", "arrow", "parquet", "text", or "csv". Huggingface datasets can come in a number of formats. You can configure the data loading process to take into account the data type - csv, json, parquet. We will be using JSON format. ```yaml ds_type: json ``` ### `data_files:` (Optional) Specifies the path to the source data files. ### `dataset_prepared_path`: Specifies the relative path where Axolotl attempts to save the prepared dataset as an arrow file after packing the data together. This allows subsequent training attempts to load faster. ```yaml path: datasets/alpaca-cleaned/alpaca_data_cleaned.json ``` ### `output_dir:` This is the path where the full fine tuned model will be saved to: ```bash output_dir: ./completed-model ``` ### Advanced: Explanation of all other data related configuration options ### `shards`: Axolotl allows you to split your data into shards, which can be handy for large datasets. You can specify the number of shards using the **`shards`** parameter. ### `name`: If you're using a dataset from a repository, you can give it a name using the **`name`** parameter. This can help you keep track of which dataset configuration you're using. ### `train_on_split`: By default, Axolotl uses the "train" split for training, but you can specify a different split using the `train_on_split` parameter. ### `conversation`: If you're using the "sharegpt" prompt type, you can specify the conversation type using the **`conversation`** parameter. Axolotl uses the FastChat library for conversations, and you can find the available conversation types in the FastChat documentation. {% embed url="" %} ### `field_human` and `field_model`: (optional) You can also specify the keys to use for the human and assistant roles in the conversation using the `field_human` and `f``ield_model` parameters. ### `roles`: If your dataset has additional keys that you want to use as input or output roles, you can specify them under the `roles` section. The `input` role is used for masking, and the `output` role is used for generation. ### `input`: (Optional) A list of keys to be masked based on **`train_on_input`****.** ### `output`: (Optional) A list of keys to be used as output. ### Custom user instruction prompt Allows defining a custom prompt for each dataset. ### `system_prompt`: (Optional) Specifies the system prompt. ### `system_format`: (Optional) Specifies the format of the system prompt, with "{system}" as a placeholder. ### `field_system`, `field_instruction`, `field_input`, `field_output`: (Optional) Specify the column names for the respective fields in the dataset. ### `format`: Specifies the format of the prompt, which can include placeholders for instruction and input. ### `no_input_format`: Specifies the format of the prompt when there is no input, excluding the "{input}" placeholder. 1. **`field`****:** (Optional) For "completion" datasets only, specifies the field to use instead of the "text" column. 2. **`shuffle_merged_datasets`****:** (Optional) By default, Axolotl shuffles the merged datasets, but you can disable this behavior by setting **`shuffle_merged_datasets`**** ****to**** ****`false`**. ### Datasets for Evaluation ### `test_datasets` A list of one or more datasets used for evaluating the model. * **`path`****:** Specifies the path to the test dataset file. * **`ds_type`****:** Specifies the data type of the test dataset, such as "json". * **`split`****:** Specifies the split to use for the test dataset. For "json" datasets, the default split is called "train". * **`type`****:** Specifies the type of the test dataset, such as "completion". * **`data_files`****:** Specifies the list of data files for the test dataset. Note: You can use either **`test_datasets`** or **`val_set_size`**, but not both. ### `dataset_prepared_path`: Specifies the relative path where Axolotl attempts to save the prepared dataset as an arrow file after packing the data together. This allows subsequent training attempts to load faster. ### `push_dataset_to_hub` If you want to push the prepared dataset to the Hugging Face Hub, you can specify the repository path using the **`push_dataset_to_hub`** parameter. ### `hub_model_id` You can also push the fine-tuned model checkpoints to the Hugging Face Hub by specifying the private repository path using the **`hub_model_id`** parameter. ### `hub_strategy` The **`hub_strategy`** parameter allows you to control how the checkpoints are pushed to the hub. Refer to the Hugging Face Trainer documentation for more information. --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://axolotl.continuumlabs.pro/phi-2.0/phi-2.0-data-loading-and-paths.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.