Llama3 - Data Loading and Paths
datasets
This component of the configuration file tells the platform where the dataset it located, and how the dataset should be split up into training versus validation.
We will only be using a number of these parameters for our fine tuning. The full number of configurations can be found here: data loading and processing configurations
We will be using the Alpgasus dataset
Tips on Parquet files
Here are some interesting insights about using Parquet files when training large language models:
Axolotl supports loading datasets in various formats, including Parquet files.
This allows for flexibility in how the training data is stored and accessed.
Parquet files are a popular choice for storing training datasets due to their efficiency and ability to handle large amounts of data. They are columnar storage format that provides fast querying and retrieval of specific columns.
When specifying the dataset in the Axolotl configuration file (YAML), you can provide the path to the Parquet files or directories containing the Parquet files. For example:
Parquet files can be stored locally or in remote storage systems like Amazon S3 or Google Cloud Storage (GCS).
Axolotl supports loading datasets from these remote storage systems by specifying the appropriate path format, such as s3://path/to/data.parquet
or gs://path/to/parquet/dir/
.
If the dataset is separated into multiple Parquet files on Hugging Face or other platforms, you can specify the number of shards to split the data into using the shards
parameter in the dataset configuration.
Preprocessing the data and converting it into Parquet or Arrow format can be beneficial for efficient storage and faster loading during training. It allows you to filter out unnecessary data and optimise the dataset for your specific training needs.
The size of the Parquet files can vary depending on the dataset and the specific requirements of the training task.
When loading Parquet files, you can specify the ds_type
as "parquet" in the dataset configuration to explicitly indicate the data type.
Overall, using Parquet files provides a scalable and efficient way to store and load training datasets for large language models. Axolotl's support for Parquet files, along with other formats like JSON, Arrow, and CSV, offers flexibility in dataset management and allows for seamless integration with various storage systems.
fine tuning data set location
dataset_prepared_path:
This parameter indicates the path to a dataset that has been prepared for training.
ds_type:
Huggingface datasets can come in a number of formats. You can configure the data loading process to take into account the data type - csv, json, parquet. We will be using Parquet format.
validation Set
val_set_size: 0.10
Description: Specifies the size of the validation set as a percentage of the total dataset.
Meaning: This parameter determines the proportion of the dataset that will be used for validation during the training process. In this case, the default is set to 5% of the total dataset.
output files
output_dir: ./output
Description: Specifies the directory where output files should be saved.
Meaning: This parameter designates the directory where various output files, such as trained model checkpoints or logs, should be stored.
Last updated