Download cleaned Alpaca dataset
Test the engines
Last updated
This documentation is for the Axolotl community
Test the engines
Last updated
The last instruction entered was to git clone the alpaca-cleaned dataset to the local directory:
This command downloaded this Huggingface 42MB json dataset into the directory you created called datasets.
Within datasets, this directory is located at alpaca-cleaned. The full path is:
your primary directory/axolotl/datasets/alpaca-cleaned
The screenshot below shows the contents of the alpaca-cleaned dataset. Note that it is in JSON format and that the training set is in Alpaca format:
When using instruction fine tuning. there are various formats for the training set. The Alpaca format has become one of the 'standards' for the structure of a dataset
alpaca_data.json
This dataset is formatted as a JSON file, where each entry is represented as a dictionary with the following key-value pairs:
Instruction (instruction
):
Type: String (str
)
Description: Specifies the task to be performed by the model.
Input (input
):
Type: String (str
) optional.
Description: Provides additional context or information needed to perform the task described in the instruction
.
Example: If the instruction is "Summarize the following article", the input would be the text of the article.
Prevalence: In the original 52k Alpaca dataset, approximately 40% of the entries in the dataset include an input
field.
Output (output
):
Type: String (str
)
Description: The response generated by the text-davinci-003 model, which represents the answer or completion of the task defined in the instruction
.
Two distinct prompt structures were used in the fine-tuning process, depending on whether the input field
is present or not.
For Entries with Non-Empty Input Field:
For Entries with Empty Input Field:
For a full review of the different types of dataset techniques and structures used in Axolotl please visit datasets.