Download cleaned Alpaca dataset
Test the engines
Last updated
Test the engines
Last updated
The last instruction entered was to git clone the alpaca-cleaned dataset to the local directory:
This command downloaded this Huggingface 42MB json dataset into the directory you created called datasets.
Within datasets, this directory is located at alpaca-cleaned. The full path is:
your primary directory/axolotl/datasets/alpaca-cleaned
The screenshot below shows the contents of the alpaca-cleaned dataset. Note that it is in JSON format and that the training set is in Alpaca format:
When using instruction fine tuning. there are various formats for the training set. The Alpaca format has become one of the 'standards' for the structure of a dataset
alpaca_data.json
This dataset is formatted as a JSON file, where each entry is represented as a dictionary with the following key-value pairs:
Instruction (instruction
):
Type: String (str
)
Description: Specifies the task to be performed by the model.
Input (input
):
Type: String (str
) optional.
Description: Provides additional context or information needed to perform the task described in the instruction
.
Example: If the instruction is "Summarize the following article", the input would be the text of the article.
Prevalence: In the original 52k Alpaca dataset, approximately 40% of the entries in the dataset include an input
field.
Output (output
):
Type: String (str
)
Description: The response generated by the text-davinci-003 model, which represents the answer or completion of the task defined in the instruction
.
Two distinct prompt structures were used in the fine-tuning process, depending on whether the input field
is present or not.
For Entries with Non-Empty Input Field:
For Entries with Empty Input Field:
For a full review of the different types of dataset techniques and structures used in Axolotl please visit datasets.