Download cleaned Alpaca dataset

Test the engines

The last instruction entered was to git clone the alpaca-cleaned dataset to the local directory:

git clone https://huggingface.co/datasets/yahma/alpaca-cleaned

This command downloaded this Huggingface 42MB json dataset into the directory you created called datasets.

Within datasets, this directory is located at alpaca-cleaned. The full path is:

your primary directory/axolotl/datasets/alpaca-cleaned

The screenshot below shows the contents of the alpaca-cleaned dataset. Note that it is in JSON format and that the training set is in Alpaca format:

What is Alpaca format?

When using instruction fine tuning. there are various formats for the training set. The Alpaca format has become one of the 'standards' for the structure of a dataset

Data Structure in `alpaca_data.json`

This dataset is formatted as a JSON file, where each entry is represented as a dictionary with the following key-value pairs:

Instruction (instruction):

Type: String (str)
Description: Specifies the task to be performed by the model.

Input (input):

Type: String (str) optional.
Description: Provides additional context or information needed to perform the task described in the instruction.
Example: If the instruction is "Summarize the following article", the input would be the text of the article.

Prevalence: In the original 52k Alpaca dataset, approximately 40% of the entries in the dataset include an input field.

Output (output):

Type: String (str)
Description: The response generated by the text-davinci-003 model, which represents the answer or completion of the task defined in the instruction.

Fine-Tuning Prompts for Alpaca Model

Two distinct prompt structures were used in the fine-tuning process, depending on whether the input field is present or not.

For Entries with Non-Empty Input Field:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:

For Entries with Empty Input Field:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:

For a full review of the different types of dataset techniques and structures used in Axolotl please visit datasets.

Last updated 1 year ago

Was this helpful?

What is Alpaca format?

Data Structure in alpaca_data.json

Fine-Tuning Prompts for Alpaca Model

Data Structure in `alpaca_data.json`