Llama3 - Data Loading and Paths

datasets

datasets:
  - path: datasets/alpagasus/data/train-00000-of-00001-0c59455170918204.parquet
    type: alpaca
    ds_type: parquet
    data_files:
  - train-00000-of-00001-0c59455170918204.parquet
dataset_prepared_path:
val_set_size: 0.10
output_dir: ./llama3-out

This component of the configuration file tells the platform where the dataset it located, and how the dataset should be split up into training versus validation.

We will only be using a number of these parameters for our fine tuning. The full number of configurations can be found here: data loading and processing configurations

We will be using the Alpgasus dataset

AlpaGasus: Training A Better Alpaca with Fewer DataarXiv.org

Alpgasus README file

This README.md file describes a dataset hosted on Huggingface, which is based on the AlpaGasus paper.

AlpaGasus is an unofficial implementation of a dataset that aims to improve upon the original Alpaca dataset. The key points about this dataset are:

Features: The dataset consists of three features:

instruction: The instruction or prompt given to the model (data type: string)
input: The input text associated with the instruction (data type: string)
output: The expected output or response from the model (data type: string)

Splits: The dataset has a single split called "train," which contains:

Number of bytes: 3,918,129
Number of examples: 9,229
Download size: 2,486,877 bytes
Dataset size: 3,918,129 bytes

Configurations: There is only one configuration named "default," which uses the data files located at "data/train-*" paths.

License: The dataset is licensed under the GPL-3.0 license.
Task Categories: The dataset is categorized under the "text-generation" task.
Tags: The dataset is tagged with "alpaca" and "llama."
Size Categories: The dataset falls under the "1K<n<10K" size category, indicating that it has between 1,000 and 10,000 examples.

The AlpaGasus dataset is a filtered version of the original Alpaca dataset, where GPT-4 acts as a judge to select high-quality samples.

The authors of the AlpaGasus paper demonstrated that models trained on this filtered dataset with only 9,000 samples can outperform models trained on the original 52,000 samples of the Alpaca dataset.

The key takeaways are:

AlpaGasus is an unofficial implementation that aims to improve the Alpaca dataset by filtering it using GPT-4.
The dataset contains 9,229 high-quality examples, significantly fewer than the original Alpaca dataset.
Despite having fewer examples, models trained on the AlpaGasus dataset can outperform those trained on the larger Alpaca dataset.
The dataset is designed for text generation tasks and is compatible with models like Alpaca and Llama.
The dataset is available on Huggingface and is licensed under GPL-3.0.

Tips on Parquet files

Here are some interesting insights about using Parquet files when training large language models:

Axolotl supports loading datasets in various formats, including Parquet files.

This allows for flexibility in how the training data is stored and accessed.

Parquet files are a popular choice for storing training datasets due to their efficiency and ability to handle large amounts of data. They are columnar storage format that provides fast querying and retrieval of specific columns.

When specifying the dataset in the Axolotl configuration file (YAML), you can provide the path to the Parquet files or directories containing the Parquet files. For example:

datasets:
  - path: /path/to/parquet/files/
    type: alpaca
    ds_type: parquet

Parquet files can be stored locally or in remote storage systems like Amazon S3 or Google Cloud Storage (GCS).

Axolotl supports loading datasets from these remote storage systems by specifying the appropriate path format, such as s3://path/to/data.parquet or gs://path/to/parquet/dir/.

If the dataset is separated into multiple Parquet files on Hugging Face or other platforms, you can specify the number of shards to split the data into using the shards parameter in the dataset configuration.

Preprocessing the data and converting it into Parquet or Arrow format can be beneficial for efficient storage and faster loading during training. It allows you to filter out unnecessary data and optimise the dataset for your specific training needs.

The size of the Parquet files can vary depending on the dataset and the specific requirements of the training task.

When loading Parquet files, you can specify the ds_type as "parquet" in the dataset configuration to explicitly indicate the data type.

Overall, using Parquet files provides a scalable and efficient way to store and load training datasets for large language models. Axolotl's support for Parquet files, along with other formats like JSON, Arrow, and CSV, offers flexibility in dataset management and allows for seamless integration with various storage systems.

fine tuning data set location

dataset_prepared_path:

This parameter indicates the path to a dataset that has been prepared for training.

ds_type:

Huggingface datasets can come in a number of formats. You can configure the data loading process to take into account the data type - csv, json, parquet. We will be using Parquet format.

validation Set

val_set_size: 0.10

Description: Specifies the size of the validation set as a percentage of the total dataset.

Meaning: This parameter determines the proportion of the dataset that will be used for validation during the training process. In this case, the default is set to 5% of the total dataset.

output files

output_dir: ./output

Description: Specifies the directory where output files should be saved.

Meaning: This parameter designates the directory where various output files, such as trained model checkpoints or logs, should be stored.

PreviousLlama3 - Model Quantization NextLlama3 - Sequence Configuration

Last updated 1 year ago

Was this helpful?