# Llama3 - Data Loading and Paths

## <mark style="color:green;">datasets</mark>

<pre class="language-yaml"><code class="lang-yaml"><strong>datasets:
</strong>  - path: datasets/alpagasus/data/train-00000-of-00001-0c59455170918204.parquet
    type: alpaca
    ds_type: parquet
    data_files:
  - train-00000-of-00001-0c59455170918204.parquet
dataset_prepared_path:
val_set_size: 0.10
output_dir: ./llama3-out
</code></pre>

This component of the configuration file tells the platform where the dataset it located, and how the dataset should be split up into <mark style="color:yellow;">training versus validation.</mark>

We will only be using a number of these parameters for our fine tuning.  The full number of configurations can be found here:  [<mark style="color:orange;">**data loading and processing configurations**</mark>](https://axolotl.continuumlabs.pro/axolotl-configuration-files/data-loading-and-processing)

### <mark style="color:blue;">**We will be using the Alpgasus dataset**</mark>

{% embed url="<https://arxiv.org/abs/2307.08701>" %}

<details>

<summary><mark style="color:green;"><strong>Alpgasus README file</strong></mark></summary>

This README.md file describes a dataset hosted on Huggingface, which is based on the AlpaGasus paper.&#x20;

AlpaGasus is an unofficial implementation of a dataset that aims to improve upon the original Alpaca dataset. The key points about this dataset are:

<mark style="color:green;">Features: The dataset consists of three features:</mark>

* `instruction`: The instruction or prompt given to the model (data type: string)
* `input`: The input text associated with the instruction (data type: string)
* `output`: The expected output or response from the model (data type: string)

<mark style="color:green;">Splits: The dataset has a single split called "train," which contains:</mark>

* Number of bytes: 3,918,129
* Number of examples: 9,229
* Download size: 2,486,877 bytes
* Dataset size: 3,918,129 bytes

<mark style="color:green;">Configurations:</mark> There is only one configuration named "default," which uses the data files located at "data/train-\*" paths.

1. License: The dataset is licensed under the GPL-3.0 license.
2. Task Categories: The dataset is categorized under the "text-generation" task.
3. Tags: The dataset is tagged with "alpaca" and "llama."
4. Size Categories: The dataset falls under the "1K\<n<10K" size category, indicating that it has between 1,000 and 10,000 examples.

The AlpaGasus dataset is a filtered version of the original Alpaca dataset, where GPT-4 acts as a judge to select high-quality samples.&#x20;

The authors of the AlpaGasus paper demonstrated that models trained on this filtered dataset with only 9,000 samples can outperform models trained on the original 52,000 samples of the Alpaca dataset.

The key takeaways are:

1. AlpaGasus is an unofficial implementation that aims to improve the Alpaca dataset by filtering it using GPT-4.
2. The dataset contains 9,229 high-quality examples, significantly fewer than the original Alpaca dataset.
3. Despite having fewer examples, models trained on the AlpaGasus dataset can outperform those trained on the larger Alpaca dataset.
4. The dataset is designed for text generation tasks and is compatible with models like Alpaca and Llama.
5. The dataset is available on Huggingface and is licensed under GPL-3.0.

</details>

### <mark style="color:blue;">Tips on Parquet files</mark>

Here are some interesting insights about using Parquet files when training large language models:

Axolotl supports loading datasets in various formats, including Parquet files.&#x20;

This allows for flexibility in how the training data is stored and accessed.

Parquet files are a popular choice for storing training datasets due to their efficiency and ability to handle large amounts of data. They are columnar storage format that provides fast querying and retrieval of specific columns.

When specifying the dataset in the Axolotl configuration file (YAML), you can provide the path to the Parquet files or directories containing the Parquet files. For example:

```yaml
datasets:
  - path: /path/to/parquet/files/
    type: alpaca
    ds_type: parquet
```

Parquet files can be stored locally or in remote storage systems like Amazon S3 or Google Cloud Storage (GCS).&#x20;

Axolotl supports loading datasets from these remote storage systems by specifying the appropriate path format, such as `s3://path/to/data.parquet` or `gs://path/to/parquet/dir/`.

If the dataset is separated into multiple Parquet files on Hugging Face or other platforms, you can specify the number of shards to split the data into using the <mark style="color:yellow;">**`shards`**</mark> parameter in the dataset configuration.

Preprocessing the data and converting it into Parquet or Arrow format can be beneficial for efficient storage and faster loading during training.  It allows you to filter out unnecessary data and optimise the dataset for your specific training needs.

The size of the Parquet files can vary depending on the dataset and the specific requirements of the training task.

When loading Parquet files, you can specify the <mark style="color:yellow;">**`ds_type`**</mark> as "parquet" in the dataset configuration to explicitly indicate the data type.

Overall, using Parquet files provides a scalable and efficient way to store and load training datasets for large language models.  Axolotl's support for Parquet files, along with other formats like JSON, Arrow, and CSV, offers flexibility in dataset management and allows for seamless integration with various storage systems.

### <mark style="color:blue;">fine tuning data set location</mark>

#### <mark style="color:green;">dataset\_prepared\_path:</mark>

This parameter indicates the <mark style="color:yellow;">path to a dataset that has been prepared for training</mark>.&#x20;

#### <mark style="color:green;">ds\_type:</mark>

Huggingface datasets can come in a number of formats.  You can configure the data loading process to take into account the data type - csv, json, parquet.   We will be using Parquet format.

### <mark style="color:blue;">validation Set</mark>

#### <mark style="color:green;">val\_set\_size: 0.10</mark>

**Description:** Specifies the <mark style="color:yellow;">size of the validation set as a percentage of the total dataset.</mark>

**Meaning:** This parameter determines the proportion of the dataset that will be used for validation during the training process. In this case, the default is set to 5% of the total dataset.

### <mark style="color:blue;">output files</mark>

#### <mark style="color:green;">output\_dir: ./output</mark>

Description: Specifies the directory where <mark style="color:yellow;">output files should be saved.</mark>

Meaning: This parameter designates the directory where various output files, such as trained model checkpoints or logs, should be stored.
