# Download cleaned Alpaca dataset

The <mark style="color:yellow;">last instruction</mark> entered was to git clone the alpaca-cleaned dataset to the local directory:&#x20;

```bash
git clone https://huggingface.co/datasets/yahma/alpaca-cleaned
```

This command downloaded this Huggingface <mark style="color:yellow;">42MB json dataset</mark> into the directory you created called **datasets.**

**Within datasets, this directory is located at alpaca-cleaned.  The full path is:**

*<mark style="color:blue;">your primary directory</mark>*/<mark style="color:yellow;">axolotl</mark>/<mark style="color:purple;">datasets</mark>/<mark style="color:green;">alpaca-cleaned</mark>

The screenshot below shows the contents of the alpaca-cleaned dataset.  Note that it is in JSON format and that the training set is in Alpaca format:

<figure><img src="/files/BE3h4h65qb6ecRZeIMCs" alt=""><figcaption><p>A screenshot from VS Code demonstrating the contents of the alpaca-cleaned dataste</p></figcaption></figure>

### <mark style="color:blue;">What is Alpaca format?</mark>

When using <mark style="color:blue;">instruction fine tuning.</mark> there are various formats for the training set.  The Alpaca format has become one of the 'standards' for the structure of a dataset

#### <mark style="color:green;">**Data Structure in**</mark><mark style="color:green;">**&#x20;**</mark><mark style="color:green;">**`alpaca_data.json`**</mark>

This dataset is formatted as a <mark style="color:blue;">JSON file</mark>, where each entry is represented as a dictionary with the following key-value pairs:

**Instruction&#x20;**<mark style="color:yellow;">**(**</mark><mark style="color:yellow;">**`instruction`**</mark><mark style="color:yellow;">**)**</mark><mark style="color:yellow;">:</mark>

* Type: String <mark style="color:yellow;">(</mark><mark style="color:yellow;">`str`</mark><mark style="color:yellow;">)</mark>
* Description: Specifies the task to be performed by the model.

**Input (`input`)**:

* Type: String <mark style="color:yellow;">(</mark><mark style="color:yellow;">`str`</mark><mark style="color:yellow;">)</mark> optional.
* Description: Provides additional context or information needed to perform the task described in the <mark style="color:yellow;">`instruction`</mark><mark style="color:yellow;">.</mark>
* Example: If the instruction is "Summarize the following article", the input would be the text of the article.

Prevalence: In the original 52k Alpaca dataset, approximately <mark style="color:yellow;">40% of the entries</mark> in the dataset include an `input` field.

**Output&#x20;**<mark style="color:yellow;">**(**</mark><mark style="color:yellow;">**`output`**</mark><mark style="color:yellow;">**)**</mark><mark style="color:yellow;">:</mark>

* Type: String <mark style="color:yellow;">(</mark><mark style="color:yellow;">`str`</mark><mark style="color:yellow;">)</mark>
* Description: The response generated by the text-davinci-003 model, which represents the answer or completion of the task defined in the <mark style="color:yellow;">`instruction`</mark>.

### <mark style="color:blue;">**Fine-Tuning Prompts for Alpaca Model**</mark>

Two distinct prompt structures were used in the fine-tuning process, depending on whether the <mark style="color:yellow;">`input field`</mark> is present or not.

**For Entries with&#x20;**<mark style="color:yellow;">**Non-Empty Input Field**</mark>:

{% code fullWidth="false" %}

```json
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
```

{% endcode %}

**For Entries with&#x20;**<mark style="color:yellow;">**Empty Input Field**</mark>:

```json
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
```

**For a full review of the different types of dataset techniques and structures used in Axolotl please visit datasets.**


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://axolotl.continuumlabs.pro/download-the-dataset/download-cleaned-alpaca-dataset.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
