Llama2 - Data Loading and Paths

With the model defined and fully configured, we will now configure the datasets section.

First off, this configuration file is all about specifying the datasets you want to use for fine-tuning your model and where you would like the fine tuned model stored once training is complete.

You can provide one or more datasets under the datasets section.

Each dataset can come from different sources, like a HuggingFace dataset repo, an S3 or GS path, or even a local JSON file.

For each dataset, you can choose the type of prompt you want to use during training.

Axolotl supports a bunch of prompt types, like "alpaca", "sharegpt", "gpteacher", "oasst", and "reflection". You can specify the prompt type using the type parameter.

If you're using a file-based dataset, you can specify the data type using the ds_type parameter. It supports formats like JSON, Arrow, Parquet, text, and CSV. You can also provide the path to the source data files using the data_files parameter.

Basic datasets configuration

datasets:
  - path: datasets/alpaca-cleaned/alpaca_data_cleaned.json
    type: alpaca
    ds_type: json
    data_files:
  - alpaca_data_cleaned.json
dataset_prepared_path:
val_set_size: 0.20
output_dir: ./llama-out

This component of the configuration file tells the platform where the dataset it located, and how the dataset should be split up into training versus validation.

We will only be using a number of these parameters for our fine tuning. The full number of configurations can be found here: data loading and processing configurations

All configurations

`path:`

This parameter indicates the path to a dataset that has been prepared for training.

As highlighted, this can be a HuggingFace dataset repository, an AWS S3 storage path (s3://), or Google Cloud Path (gs://) or "json" for a local dataset.

Here's an example of a Google Cloud Storage path:

gs://my-bucket/datasets/my_dataset.json

In this example, "my-bucket" is the name of the Google Cloud Storage bucket, and "datasets/my_dataset.json" is the path to the specific file within that bucket.

To determine the source of your data files, right click on the folder containing the data files in VS Code and then select 'copy URL'.

path: datasets/alpaca-cleaned/alpaca_data_cleaned.json

`type:`

Indicates the type of prompt to use for training, such as "alpaca", "sharegpt", "gpteacher", "oasst", or "reflection".

We will be using Alpaca in this example. For a full explanation of each data 'type' please refer to this documentation: data type

type: alpaca

`ds_type:`

(Optional) Specifies the data type when the path is a file, such as "json", "arrow", "parquet", "text", or "csv".

Huggingface datasets can come in a number of formats. You can configure the data loading process to take into account the data type - csv, json, parquet. We will be using JSON format.

ds_type: json

`data_files:`

(Optional) Specifies the path to the source data files.

`dataset_prepared_path`:

Specifies the relative path where Axolotl attempts to save the prepared dataset as an arrow file after packing the data together. This allows subsequent training attempts to load faster.

path: datasets/alpaca-cleaned/alpaca_data_cleaned.json

`output_dir:`

This is the path where the full fine tuned model will be saved to:

output_dir: ./completed-model

`shards`:

Axolotl allows you to split your data into shards, which can be handy for large datasets. You can specify the number of shards using the shards parameter.

`name`:

If you're using a dataset from a repository, you can give it a name using the name parameter. This can help you keep track of which dataset configuration you're using.

`train_on_split`:

By default, Axolotl uses the "train" split for training, but you can specify a different split using the train_on_split parameter.

`conversation`:

If you're using the "sharegpt" prompt type, you can specify the conversation type using the conversation parameter. Axolotl uses the FastChat library for conversations, and you can find the available conversation types in the FastChat documentation.

GitHub - lm-sys/FastChat: An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.GitHub

`field_human` and `field_model`: (optional)

You can also specify the keys to use for the human and assistant roles in the conversation using the field_human and field_model parameters.

`roles`:

If your dataset has additional keys that you want to use as input or output roles, you can specify them under the roles section. The input role is used for masking, and the output role is used for generation.

`input`:

(Optional) A list of keys to be masked based on train_on_input.

`output`:

(Optional) A list of keys to be used as output.

Custom user instruction prompt

Allows defining a custom prompt for each dataset.

`system_prompt`:

(Optional) Specifies the system prompt.

`system_format`:

(Optional) Specifies the format of the system prompt, with "{system}" as a placeholder.

`field_system`, `field_instruction`, `field_input`, `field_output`:

(Optional) Specify the column names for the respective fields in the dataset.

`format`:

Specifies the format of the prompt, which can include placeholders for instruction and input.

`no_input_format`:

Specifies the format of the prompt when there is no input, excluding the "{input}" placeholder.

field: (Optional) For "completion" datasets only, specifies the field to use instead of the "text" column.
shuffle_merged_datasets: (Optional) By default, Axolotl shuffles the merged datasets, but you can disable this behavior by setting shuffle_merged_datasets to false.

Datasets for Evaluation

`test_datasets`

A list of one or more datasets used for evaluating the model.

path: Specifies the path to the test dataset file.
ds_type: Specifies the data type of the test dataset, such as "json".
split: Specifies the split to use for the test dataset. For "json" datasets, the default split is called "train".
type: Specifies the type of the test dataset, such as "completion".
data_files: Specifies the list of data files for the test dataset.

Note: You can use either test_datasets or val_set_size, but not both.

`dataset_prepared_path`:

Specifies the relative path where Axolotl attempts to save the prepared dataset as an arrow file after packing the data together. This allows subsequent training attempts to load faster.

`push_dataset_to_hub`

If you want to push the prepared dataset to the Hugging Face Hub, you can specify the repository path using the push_dataset_to_hub parameter.

`hub_model_id`

You can also push the fine-tuned model checkpoints to the Hugging Face Hub by specifying the private repository path using the hub_model_id parameter.

`hub_strategy`

The hub_strategy parameter allows you to control how the checkpoints are pushed to the hub. Refer to the Hugging Face Trainer documentation for more information.

PreviousLlama2 - Model Quantization NextLlama2 - Sequence Configuration

Last updated 1 year ago

Was this helpful?

Basic datasets configuration

All configurations

path:

This parameter indicates the path to a dataset that has been prepared for training.

type:

ds_type:

data_files:

dataset_prepared_path:

output_dir:

Advanced: Explanation of all other data related configuration options

shards:

name:

train_on_split:

conversation:

field_human and field_model: (optional)

roles:

input:

output:

Custom user instruction prompt

system_prompt:

system_format:

field_system, field_instruction, field_input, field_output:

format:

no_input_format:

Datasets for Evaluation

test_datasets

dataset_prepared_path:

push_dataset_to_hub

hub_model_id

hub_strategy

`path:`

`type:`

`ds_type:`

`data_files:`

`dataset_prepared_path`:

`output_dir:`

`shards`:

`name`:

`train_on_split`:

`conversation`:

`field_human` and `field_model`: (optional)

`roles`:

`input`:

`output`:

`system_prompt`:

`system_format`:

`field_system`, `field_instruction`, `field_input`, `field_output`:

`format`:

`no_input_format`:

`test_datasets`

`dataset_prepared_path`:

`push_dataset_to_hub`

`hub_model_id`

`hub_strategy`