Page cover image

Download cleaned Alpaca dataset

Test the engines

The last instruction entered was to git clone the alpaca-cleaned dataset to the local directory:

git clone https://huggingface.co/datasets/yahma/alpaca-cleaned

This command downloaded this Huggingface 42MB json dataset into the directory you created called datasets.

Within datasets, this directory is located at alpaca-cleaned. The full path is:

your primary directory/axolotl/datasets/alpaca-cleaned

The screenshot below shows the contents of the alpaca-cleaned dataset. Note that it is in JSON format and that the training set is in Alpaca format:

A screenshot from VS Code demonstrating the contents of the alpaca-cleaned dataste

What is Alpaca format?

When using instruction fine tuning. there are various formats for the training set. The Alpaca format has become one of the 'standards' for the structure of a dataset

Data Structure in alpaca_data.json

This dataset is formatted as a JSON file, where each entry is represented as a dictionary with the following key-value pairs:

Instruction (instruction):

  • Type: String (str)

  • Description: Specifies the task to be performed by the model.

Input (input):

  • Type: String (str) optional.

  • Description: Provides additional context or information needed to perform the task described in the instruction.

  • Example: If the instruction is "Summarize the following article", the input would be the text of the article.

Prevalence: In the original 52k Alpaca dataset, approximately 40% of the entries in the dataset include an input field.

Output (output):

  • Type: String (str)

  • Description: The response generated by the text-davinci-003 model, which represents the answer or completion of the task defined in the instruction.

Fine-Tuning Prompts for Alpaca Model

Two distinct prompt structures were used in the fine-tuning process, depending on whether the input field is present or not.

For Entries with Non-Empty Input Field:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:

For Entries with Empty Input Field:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:

For a full review of the different types of dataset techniques and structures used in Axolotl please visit datasets.

Last updated

Was this helpful?