Llama2 - Data Loading and Paths
With the model defined and fully configured, we will now configure the datasets section.
First off, this configuration file is all about specifying the datasets you want to use for fine-tuning your model and where you would like the fine tuned model stored once training is complete.
You can provide one or more datasets under the datasets
section.
Each dataset can come from different sources, like a HuggingFace dataset repo, an S3 or GS path, or even a local JSON file.
For each dataset, you can choose the type of prompt you want to use during training.
Axolotl supports a bunch of prompt types, like "alpaca", "sharegpt", "gpteacher", "oasst", and "reflection". You can specify the prompt type using the type
parameter.
If you're using a file-based dataset, you can specify the data type using the ds_type
parameter. It supports formats like JSON, Arrow, Parquet, text, and CSV. You can also provide the path to the source data files using the data_files
parameter.
Basic datasets configuration
This component of the configuration file tells the platform where the dataset it located, and how the dataset should be split up into training versus validation.
We will only be using a number of these parameters for our fine tuning. The full number of configurations can be found here: data loading and processing configurations
All configurations
path:
path:
This parameter indicates the path to a dataset that has been prepared for training.
As highlighted, this can be a HuggingFace dataset repository, an AWS S3 storage path (s3://), or Google Cloud Path (gs://) or "json" for a local dataset.
Here's an example of a Google Cloud Storage path:
In this example, "my-bucket" is the name of the Google Cloud Storage bucket, and "datasets/my_dataset.json" is the path to the specific file within that bucket.
To determine the source of your data files, right click on the folder containing the data files in VS Code and then select 'copy URL'.
type:
type:
Indicates the type of prompt to use for training, such as "alpaca", "sharegpt", "gpteacher", "oasst", or "reflection".
We will be using Alpaca in this example. For a full explanation of each data 'type' please refer to this documentation: data type
ds_type:
ds_type:
(Optional) Specifies the data type when the path
is a file, such as "json", "arrow", "parquet", "text", or "csv".
Huggingface datasets can come in a number of formats. You can configure the data loading process to take into account the data type - csv, json, parquet. We will be using JSON format.
data_files:
data_files:
(Optional) Specifies the path to the source data files.
dataset_prepared_path
:
dataset_prepared_path
:Specifies the relative path where Axolotl attempts to save the prepared dataset as an arrow file after packing the data together. This allows subsequent training attempts to load faster.
output_dir:
output_dir:
This is the path where the full fine tuned model will be saved to:
Advanced: Explanation of all other data related configuration options
shards
:
shards
: Axolotl allows you to split your data into shards, which can be handy for large datasets. You can specify the number of shards using the shards
parameter.
name
:
name
: If you're using a dataset from a repository, you can give it a name using the name
parameter. This can help you keep track of which dataset configuration you're using.
train_on_split
:
train_on_split
: By default, Axolotl uses the "train" split for training, but you can specify a different split using the train_on_split
parameter.
conversation
:
conversation
: If you're using the "sharegpt" prompt type, you can specify the conversation type using the conversation
parameter. Axolotl uses the FastChat library for conversations, and you can find the available conversation types in the FastChat documentation.
field_human
and field_model
: (optional)
field_human
and field_model
: (optional)You can also specify the keys to use for the human and assistant roles in the conversation using the field_human
and f
ield_model
parameters.
roles
:
roles
: If your dataset has additional keys that you want to use as input or output roles, you can specify them under the roles
section. The input
role is used for masking, and the output
role is used for generation.
input
:
input
: (Optional) A list of keys to be masked based on train_on_input
.
output
:
output
: (Optional) A list of keys to be used as output.
Custom user instruction prompt
Allows defining a custom prompt for each dataset.
system_prompt
:
system_prompt
: (Optional) Specifies the system prompt.
system_format
:
system_format
:(Optional) Specifies the format of the system prompt, with "{system}" as a placeholder.
field_system
, field_instruction
, field_input
, field_output
:
field_system
, field_instruction
, field_input
, field_output
: (Optional) Specify the column names for the respective fields in the dataset.
format
:
format
: Specifies the format of the prompt, which can include placeholders for instruction and input.
no_input_format
:
no_input_format
:Specifies the format of the prompt when there is no input, excluding the "{input}" placeholder.
field
: (Optional) For "completion" datasets only, specifies the field to use instead of the "text" column.shuffle_merged_datasets
: (Optional) By default, Axolotl shuffles the merged datasets, but you can disable this behavior by settingshuffle_merged_datasets
tofalse
.
Datasets for Evaluation
test_datasets
test_datasets
A list of one or more datasets used for evaluating the model.
path
: Specifies the path to the test dataset file.ds_type
: Specifies the data type of the test dataset, such as "json".split
: Specifies the split to use for the test dataset. For "json" datasets, the default split is called "train".type
: Specifies the type of the test dataset, such as "completion".data_files
: Specifies the list of data files for the test dataset.
Note: You can use either test_datasets
or val_set_size
, but not both.
dataset_prepared_path
:
dataset_prepared_path
:Specifies the relative path where Axolotl attempts to save the prepared dataset as an arrow file after packing the data together. This allows subsequent training attempts to load faster.
push_dataset_to_hub
push_dataset_to_hub
If you want to push the prepared dataset to the Hugging Face Hub, you can specify the repository path using the push_dataset_to_hub
parameter.
hub_model_id
hub_model_id
You can also push the fine-tuned model checkpoints to the Hugging Face Hub by specifying the private repository path using the hub_model_id
parameter.
hub_strategy
hub_strategy
The hub_strategy
parameter allows you to control how the checkpoints are pushed to the hub. Refer to the Hugging Face Trainer documentation for more information.
Last updated