Data Loading and Processing
datasets
Datasets provide the training data for the model.
path
The path
field specifies the location of the dataset. It can be a HuggingFace dataset repository path, a cloud storage path (e.g., s3:// or gs://), or a local "json" file.
type
The type
field defines the type of prompt strategy used for training the model. For example, it can be set to "alpaca," "sharegpt," "gpteacher," "oasst," or "reflection." Each type may have specific characteristics tailored to different training approaches.
ds_type
The optional ds_type
field specifies the datatype when the path
points to a local file. It can be set to "json," "arrow," "parquet," "text," or "csv," depending on the format of the dataset file.
data_files
If necessary, the data_files
field can be used to specify the path to the source data files associated with the dataset. This field helps the training process locate the relevant data.
shards
The optional shards
field allows you to specify the number of data shards into which the dataset should be divided. Sharding can help distribute the data efficiently for parallel processing during training.
name
You can provide an optional name
for the dataset configuration. This name can be useful for reference when working with multiple datasets during fine-tuning.
train_on_split
The optional train_on_split
field lets you specify the name of the dataset split to load from. For instance, you might use "train" to load the training split of the dataset.
conversation
For specific types of prompts like "sharegpt," this optional field defines the fastchat conversation type. It's typically used in conjunction with the "sharegpt" type and allows customization of conversation style.
system_prompt
The system_prompt
field is part of the custom user prompt configuration. It defines the prompt provided to the system or assistant as part of the interaction with the model.
system_format
system_format
specifies the format in which the system prompt is presented. In this case, it uses a placeholder "{system}" to represent the system's response.
field_system
field_system
specifies the name of the field where the system prompt is located in the dataset. It helps the training process identify the system's responses in the dataset.
field_instruction
field_instruction
specifies the field name for the instruction or query provided to the model. It's used to extract user instructions or queries from the dataset.
field_input
field_input
defines the field name where user inputs are stored in the dataset. It's essential for the model to understand and respond to user inputs effectively.
field_output
field_output
represents the field name where the assistant's outputs or responses are stored in the dataset. It helps in training the model to generate appropriate responses.
format
The format
field allows customization of the conversation format. It can be configured to be single-line or multi-line and includes placeholders for instruction and input, making it flexible for various dialogue styles.
no_input_format
no_input_format
defines the format of the conversation when there is no input (e.g., for system prompts). It's important for consistency and readability of the generated interactions.
field
For "completion" datasets, this field can be used to specify a custom field in the dataset to be used instead of the default "text" column. This customization can be beneficial for specific use cases.
dataset_prepared_path
The dataset_prepared_path
specifies the relative path where the prepared dataset is saved as an Arrow file. This prepared dataset is packed together for more efficient loading during subsequent training attempts, enhancing training performance.
push_dataset_to_hub
The push_dataset_to_hub
field specifies the repository path to which the prepared dataset should be pushed. This feature is useful for sharing datasets with others, making it accessible through the HuggingFace dataset hub.
dataset_processes
The dataset_processes
field allows you to define the maximum number of processes to use during preprocessing of the input dataset. If not set, it defaults to the number of CPU cores available, which can optimize data preparation for training.
hub_model_id
The hub_model_id
field specifies the repository path to which the finetuned model checkpoints should be pushed. It facilitates the sharing of finetuned models through the HuggingFace model hub, making them accessible to others.
hub_strategy
The hub_strategy
field, not specified in this example, is intended to define the strategy for pushing checkpoints to the hub. It allows you to customize the behavior when pushing the checkpoints.
hf_use_auth_token
The hf_use_auth_token
field is a boolean value that determines whether to use HuggingFace's u
se_auth_token
for loading datasets. This is particularly useful for fetching private datasets and is required to be set to "true" when used with push_dataset_to_hub
.
val_set_size
val_set_size
specifies the fraction of the dataset that should be set aside for evaluation purposes. For example, a value of 0.04 means that 4% of the dataset will be reserved for evaluation, helping assess model performance.
dataset_shard_num
The dataset_shard_num
and dataset_shard_idx
fields, although not specified in this example, may be intended for configuring dataset sharding. dataset_shard_num
could define the number of shards to use, and dataset_shard_idx
could specify the index of the shard to use for the entire dataset. These options can be helpful for efficient data processing and training with large datasets.
Last updated