Data Loading and Processing

Field Name

Explanation

datasets

Datasets provide the training data for the model.

path

The path field specifies the location of the dataset. It can be a HuggingFace dataset repository path, a cloud storage path (e.g., s3:// or gs://), or a local "json" file.

type

The type field defines the type of prompt strategy used for training the model. For example, it can be set to "alpaca," "sharegpt," "gpteacher," "oasst," or "reflection." Each type may have specific characteristics tailored to different training approaches.

ds_type

The optional ds_type field specifies the datatype when the path points to a local file. It can be set to "json," "arrow," "parquet," "text," or "csv," depending on the format of the dataset file.

data_files

If necessary, the data_files field can be used to specify the path to the source data files associated with the dataset. This field helps the training process locate the relevant data.

shards

The optional shards field allows you to specify the number of data shards into which the dataset should be divided. Sharding can help distribute the data efficiently for parallel processing during training.

name

You can provide an optional name for the dataset configuration. This name can be useful for reference when working with multiple datasets during fine-tuning.

train_on_split

The optional train_on_split field lets you specify the name of the dataset split to load from. For instance, you might use "train" to load the training split of the dataset.

conversation

For specific types of prompts like "sharegpt," this optional field defines the fastchat conversation type. It's typically used in conjunction with the "sharegpt" type and allows customization of conversation style.

Custom User Prompt

Explanation

system_prompt

The system_prompt field is part of the custom user prompt configuration. It defines the prompt provided to the system or assistant as part of the interaction with the model.

system_format

system_format specifies the format in which the system prompt is presented. In this case, it uses a placeholder "{system}" to represent the system's response.

field_system

field_system specifies the name of the field where the system prompt is located in the dataset. It helps the training process identify the system's responses in the dataset.

field_instruction

field_instruction specifies the field name for the instruction or query provided to the model. It's used to extract user instructions or queries from the dataset.

field_input

field_input defines the field name where user inputs are stored in the dataset. It's essential for the model to understand and respond to user inputs effectively.

field_output

field_output represents the field name where the assistant's outputs or responses are stored in the dataset. It helps in training the model to generate appropriate responses.

format

The format field allows customization of the conversation format. It can be configured to be single-line or multi-line and includes placeholders for instruction and input, making it flexible for various dialogue styles.

no_input_format

no_input_format defines the format of the conversation when there is no input (e.g., for system prompts). It's important for consistency and readability of the generated interactions.

field

For "completion" datasets, this field can be used to specify a custom field in the dataset to be used instead of the default "text" column. This customization can be beneficial for specific use cases.

Dataset Prepared Path

Explanation

dataset_prepared_path

The dataset_prepared_path specifies the relative path where the prepared dataset is saved as an Arrow file. This prepared dataset is packed together for more efficient loading during subsequent training attempts, enhancing training performance.

Push Dataset to Hub

Explanation

push_dataset_to_hub

The push_dataset_to_hub field specifies the repository path to which the prepared dataset should be pushed. This feature is useful for sharing datasets with others, making it accessible through the HuggingFace dataset hub.

Dataset Processing

Explanation

dataset_processes

The dataset_processes field allows you to define the maximum number of processes to use during preprocessing of the input dataset. If not set, it defaults to the number of CPU cores available, which can optimize data preparation for training.

Push Checkpoints to Hub

Explanation

hub_model_id

The hub_model_id field specifies the repository path to which the finetuned model checkpoints should be pushed. It facilitates the sharing of finetuned models through the HuggingFace model hub, making them accessible to others.

hub_strategy

The hub_strategy field, not specified in this example, is intended to define the strategy for pushing checkpoints to the hub. It allows you to customize the behavior when pushing the checkpoints.

Authentication Token

Explanation

hf_use_auth_token

The hf_use_auth_token field is a boolean value that determines whether to use HuggingFace's use_auth_token for loading datasets. This is particularly useful for fetching private datasets and is required to be set to "true" when used with push_dataset_to_hub.

Validation Set Size

Explanation

val_set_size

val_set_size specifies the fraction of the dataset that should be set aside for evaluation purposes. For example, a value of 0.04 means that 4% of the dataset will be reserved for evaluation, helping assess model performance.

Dataset Sharding

Explanation

dataset_shard_num

The dataset_shard_num and dataset_shard_idx fields, although not specified in this example, may be intended for configuring dataset sharding. dataset_shard_num could define the number of shards to use, and dataset_shard_idx could specify the index of the shard to use for the entire dataset. These options can be helpful for efficient data processing and training with large datasets.

PreviousModel Configuration NextSequence Configuration

Last updated 1 year ago

Was this helpful?