You should be connected to the Huggingface Hub as well as having git installed and prepared on the virtual machine. If you do not, for reference:
Install Git LFS
First, ensure that Git LFS is installed on your machine. If it's not installed, you can download and install it from Git LFS' website.
On most systems, you can install Git LFS using a package manager. For instance, on Ubuntu, you can use:
sudoapt-getinstallgit-lfs
Initialise Git LFS
After installation, you need to set up Git LFS. In your terminal, run:
gitlfsinstall
The output should be as follows:
Updatedgithooks.GitLFSinitialized.
Navigate to the Huggingface datasets repository and search for the dataset you wish to download. In this case we will download the 'alpaca-cleaned' dataset.
When you are in the datasets website, enter in the name of the required dataset below in the 'filtered by name' input box. in this case, filter by 'alpaca_cleaned'
Why are we using the Alpaca Cleaned dataset?
The Alpaca-Cleaned dataset is a refined version of the original Alpaca Dataset from Stanford, addressing several identified issues to improve its quality and utility for instruction-tuning of language models. Key aspects of this dataset include:
Dataset Description and Corrections:
Hallucinations:Fixed instances where the original dataset's instructions caused the model to generate baseless answers, typically related to external web content.
Merged Instructions: Separated instructions that were improperly combined in the original dataset.
Empty Outputs: Addressed entries with missing outputs in the original dataset.
Missing Code Examples: Supplemented descriptions that lacked necessary code examples.
Image Generation Instructions: Removed unrealistic instructions for generating images.
N/A Outputs and Inconsistent Inputs: Corrected code snippets with N/A outputs and standardized the formatting of empty inputs.
Incorrect Answers: Identified and fixed wrong answers, particularly in math problems.
Unclear Instructions:Clarified or re-wrote non-sensical or unclear instructions.
Control Characters:Removed extraneous escape and control characters present in the original dataset.
Original Alpaca Dataset Overview
Consisting of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine.
Aimed at instruction-tuning to enhance language models' ability to follow instructions.
Modifications from the original data generation pipeline include using text-davinci-003, a new prompt for instruction generation, and a more efficient data generation approach.
The dataset is noted for its diversity and cost-effectiveness in generation.
Dataset Structure and Contents:
Fields include instruction (task description), input (context or additional information), and output (answer generated by text-davinci-003).
The text field combines these elements using a specific prompt template.
The dataset is primarily structured for training purposes, with 52,002 instances in the training split.
Intended Use and Considerations:
Primarily designed for training pretrained language models on instruction-following tasks.
The dataset, primarily in English, poses potential risks like harmful content dissemination and requires careful use and further refinement to address errors or biases.
After filtering by dataset name, you will seen all the datasets attributable to that name. We will be downloading yahma/alpaca_cleaned:
Once in the dataset repository, click on the button with three dots positioned horizontally. This will provide the opportunity to use git clone to download the dataset to your directory.
When you click on the three horizontal dolts, a dialog box appears providing the command line for a git clone download of the dataset. Follow the instructions below to git clone the dataset into the axolotl environment.
Go into the primary axolotl directory and then enter the following command:
This command will create a folder called datasets and download the specified Huggingface dataset into it.
Git Clone deconstruction
In the context of the command git clone https://huggingface.co/datasets/yahma/alpaca-cleaned, "yahma" refers to the username or organisation name within the Hugging Face Datasets repository. Here's a breakdown of the components of this command:
git clone: This is a Git command used to clone a repository. It makes a copy of the specified repository and downloads it to your local machine.
https://huggingface.co/datasets: This URL points to the Hugging Face Datasets repository. Hugging Face hosts machine learning models, datasets, and related tools. The /datasets part indicates that the repository being cloned is a dataset repository.
yahma: This is the username or the name of the organisation on the Hugging Face platform that owns the repository you are cloning. In this case, 'yahma' is the entity that has uploaded or maintained the dataset named 'alpaca-cleaned'.
alpaca-cleaned: This is the name of the specific dataset repository under the user or organization 'yahma' on Hugging Face. The name suggests it might be a cleaned or processed version of a dataset related to "alpaca".
When you run this command, you are cloning the 'alpaca-cleaned' dataset from the 'yahma' user or organisation's space on Hugging Face to your local machine. This allows you to use or analyze the dataset directly on your computer.
Remember where you stored your dataset - it is required for when you prepare your Axolotl configuration file