# Use Git to download dataset

You should be connected to the Huggingface Hub as well as having git installed and prepared on the virtual machine.  If you do not, for reference:

### <mark style="color:blue;">**Install Git LFS**</mark>

* First, ensure that Git LFS is installed on your machine. If it's not installed, you can download and install it from [Git LFS' website](https://git-lfs.github.com/).
* On most systems, you can install Git LFS using a package manager. For instance, on Ubuntu, you can use:

```bash
sudo apt-get install git-lfs
```

<mark style="color:green;">**Initialise Git LFS**</mark>

After installation, you need to set up Git LFS. In your terminal, run:

```bash
git lfs install
```

The output should be as follows:

```bash
Updated git hooks.
Git LFS initialized.
```

Navigate to the Huggingface datasets repository and search for the dataset you wish to download.  In this case we will download the 'alpaca-cleaned' dataset.

{% embed url="<https://huggingface.co/datasets>" %}

When you are in the datasets website, enter in the name of the required dataset below in the 'filtered by name' input box.  in this case, filter by <mark style="color:yellow;">'alpaca\_cleaned'</mark>

<div data-full-width="false"><figure><img src="https://148429626-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FcgmygEk0ifLuns7P9aMW%2Fuploads%2FnGL54HbJgU8fbK851yHA%2FScreenshot%202023-12-24%20165803.png?alt=media&#x26;token=05389075-5268-4b56-a75d-8bb72e78b1d8" alt=""><figcaption><p>The main landing page for Huggingface Datasets</p></figcaption></figure></div>

<details>

<summary><mark style="color:green;">Why are we using the Alpaca Cleaned dataset?</mark></summary>

The Alpaca-Cleaned dataset is a <mark style="color:yellow;">refined version of the original Alpaca Dataset</mark> from Stanford, addressing several identified issues to improve its quality and utility for instruction-tuning of language models. Key aspects of this dataset include:

<mark style="color:purple;">**Dataset Description and Corrections**</mark><mark style="color:purple;">:</mark>

* <mark style="color:green;">**Hallucinations**</mark><mark style="color:green;">:</mark> Fixed instances where the original dataset's instructions caused the model to <mark style="color:blue;">generate baseless answers,</mark> typically related to external web content.
* <mark style="color:green;">**Merged Instructions**</mark><mark style="color:green;">:</mark> Separated instructions that were improperly combined in the original dataset.
* <mark style="color:green;">**Empty Outputs**</mark><mark style="color:green;">:</mark> Addressed entries with missing outputs in the original dataset.
* <mark style="color:green;">**Missing Code Examples**</mark><mark style="color:green;">:</mark> Supplemented descriptions that lacked necessary code examples.
* <mark style="color:green;">**Image Generation Instructions**</mark><mark style="color:green;">:</mark> Removed unrealistic instructions for generating images.
* <mark style="color:green;">**N/A Outputs and Inconsistent Inputs**</mark>: Corrected code snippets with N/A outputs and standardized the formatting of empty inputs.
* <mark style="color:green;">**Incorrect Answers**</mark><mark style="color:green;">:</mark> Identified and fixed wrong answers, particularly in math problems.
* <mark style="color:green;">**Unclear Instructions**</mark><mark style="color:green;">:</mark> Clarified or re-wrote non-sensical or unclear instructions.
* <mark style="color:green;">**Control Characters**</mark><mark style="color:green;">:</mark> Removed extraneous escape and control characters present in the original dataset.

<mark style="color:purple;">**Original Alpaca Dataset Overview**</mark>

* Consisting of 52,000 instructions and demonstrations generated by OpenAI's `text-davinci-003` engine.
* Aimed at instruction-tuning to enhance language models' ability to follow instructions.
* Modifications from the original data generation pipeline include using `text-davinci-003`, a new prompt for instruction generation, and a more efficient data generation approach.
* The dataset is noted for its diversity and cost-effectiveness in generation.

<mark style="color:purple;">**Dataset Structure and Contents**</mark><mark style="color:purple;">:</mark>

* Fields include `instruction` (task description), `input` (context or additional information), and `output` (answer generated by `text-davinci-003`).
* The `text` field combines these elements using a specific prompt template.
* The dataset is primarily structured for training purposes, with 52,002 instances in the training split.

<mark style="color:purple;">**Intended Use and Considerations**</mark><mark style="color:purple;">:</mark>

* Primarily designed for training pretrained language models on instruction-following tasks.
* The dataset, primarily in English, poses potential risks like harmful content dissemination and requires careful use and further refinement to address errors or biases.

</details>

After filtering by dataset name, you will seen all the datasets attributable to that name.  We will be downloading <mark style="color:yellow;">yahma/alpaca\_cleaned</mark>:

<div data-full-width="false"><figure><img src="https://148429626-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FcgmygEk0ifLuns7P9aMW%2Fuploads%2Fk8g5WIOvSYBMI6QsBpuu%2Falpacale.png?alt=media&#x26;token=03bf439f-362d-4189-b6b5-22bbd68b2e1c" alt=""><figcaption><p>The Alpaca_Cleaned dataset is highlighted in the yellow</p></figcaption></figure></div>

Once in the dataset repository, click on the button with <mark style="color:yellow;">three dots</mark> positioned horizontally.  This will provide the opportunity to use <mark style="color:yellow;">git clone</mark> to download the dataset to your directory.

<div data-full-width="false"><figure><img src="https://148429626-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FcgmygEk0ifLuns7P9aMW%2Fuploads%2FUpehBkK8Y5nG63BcxEUR%2Fsss.png?alt=media&#x26;token=8a6ffafa-739e-4de6-82af-7a15cd7dbdb7" alt=""><figcaption></figcaption></figure></div>

When you click on the three horizontal dolts, a <mark style="color:yellow;">dialog box appears providing the command line for a git clone download of the dataset</mark>.  Follow the instructions below to git clone the dataset into the axolotl environment.

<figure><img src="https://148429626-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FcgmygEk0ifLuns7P9aMW%2Fuploads%2FCQWUAQLmmJtIZ6SSYDPY%2Fss.png?alt=media&#x26;token=5251bdb8-b113-4c38-8f84-ec99d176a3ff" alt=""><figcaption><p>git clone dataset</p></figcaption></figure>

Go into the primary axolotl directory and then enter the following command:

```bash
git clone https://huggingface.co/datasets/yahma/alpaca-cleaned
```

This command will create a folder called datasets and download the specified Huggingface dataset into it.

<details>

<summary><mark style="color:green;">Git Clone deconstruction</mark></summary>

In the context of the command `git clone`` `<mark style="color:yellow;">`https://huggingface.co/datasets/yahma/alpaca-cleaned`</mark>, "yahma" refers to the username or organisation name within the Hugging Face Datasets repository. Here's a breakdown of the components of this command:

1. <mark style="color:yellow;">**`git clone`**</mark><mark style="color:yellow;">:</mark> This is a Git command used to clone a repository. It makes a copy of the specified repository and downloads it to your local machine.
2. <mark style="color:yellow;">**`https://huggingface.co/datasets`**</mark><mark style="color:yellow;">:</mark> This URL points to the Hugging Face Datasets repository. Hugging Face hosts machine learning models, datasets, and related tools. The <mark style="color:yellow;">`/datasets`</mark> part indicates that the repository being cloned is a dataset repository.
3. <mark style="color:yellow;">**`yahma`**</mark><mark style="color:yellow;">:</mark> This is the username or the name of the organisation on the Hugging Face platform that owns the repository you are cloning. In this case, 'yahma' is the entity that has uploaded or maintained the dataset named 'alpaca-cleaned'.
4. <mark style="color:yellow;">**`alpaca-cleaned`**</mark><mark style="color:yellow;">:</mark> This is the name of the specific dataset repository under the user or organization 'yahma' on Hugging Face. The name suggests it might be a cleaned or processed version of a dataset related to "alpaca".

When you run this command, you are cloning the 'alpaca-cleaned' dataset from the 'yahma' user or organisation's space on Hugging Face to your local machine. This allows you to use or analyze the dataset directly on your computer.

</details>

{% hint style="warning" %}

## <mark style="color:orange;">**Remember where you stored your dataset - it is required for when you prepare your Axolotl configuration file**</mark>

{% endhint %}
