LogoLogo
Continuum Knowledge BankContinuum Applications
  • Introduction
  • Creation of Environment
    • Platform Installation
    • Axolotl Dependencies
    • setup.py objectives
      • script analysis
  • Huggingface Hub
  • Download the dataset
    • Types of Dataset Structures
    • Structuring Datasets for Fine-Tuning Large Language Models
    • Downloading Huggingface Datasets
    • Use Git to download dataset
    • Popular Datasets
    • Download cleaned Alpaca dataset
    • Template-free prompt construction
  • Downloading models
    • Phi 2.0 details
    • Downloading Phi 2.0
    • Available Models
  • Configuration for Training
  • Datasets
  • Model Selection - General
  • Phi 2.0
    • Phi 2.0 - Model Configuration
    • Phi 2.0 - Model Quantization
    • Phi 2.0 - Data Loading and Paths
    • Phi 2.0 - Sequence Configuration
    • Phi 2.0 - Lora Configuration
    • Phi 2.0 - Logging
    • Phi 2.0 - Training Configuration
    • Phi 2.0 - Data and Precision
    • Phi 2.0 - Optimisations
    • Phi 2.0 - Extra Hyperparameters
    • Phi 2.0 - All Configurations
    • Phi 2.0 - Preprocessing
    • Phi 2.0 - Training
    • Uploading Models
  • Llama2
    • Llama2 - Model Configuration
    • Llama2 - Model Quantization
    • Llama2 - Data Loading and Paths
    • Llama2 - Sequence Configuration
    • Llama2 - Lora Configuration
    • Llama2 - Logging
    • Llama2 - Training Configuration
    • Llama2 - Data and Precision
    • Llama2 - Optimisations
    • Llama2 - Extra Hyperparameters
    • Llama2- All Configurations
    • Llama2 - Training Configuration
    • Llama2 - Preprocessing
    • Llama2 - Training
  • Llama3
    • Downloading the model
    • Analysis of model files
      • Model Analysis - Configuration Parameters
      • Model Analysis - Safetensors
      • Tokenizer Configuration Files
        • Model Analysis - tokenizer.json
        • Model Analysis - Special Tokens
    • Llama3 - Model Configuration
    • Llama3 - Model Quantization
    • Llama3 - Data Loading and Paths
    • Llama3 - Sequence Configuration
    • Llama3 - Lora Configuration
    • Llama3 - Logging
    • Llama3 - Training Configuration
    • Llama3 - Data and Precision
    • Llama3 - Optimisations
    • Llama3 - Extra Hyperparameters
    • Llama3- All Configurations
    • Llama3 - Preprocessing
    • Llama3 - Training
    • Full Fine Tune
  • Special Tokens
  • Prompt Construction for Fine-Tuning Large Language Models
  • Memory-Efficient Fine-Tuning Techniques for Large Language Models
  • Training Ideas around Hyperparameters
    • Hugging Face documentation on loading PEFT
  • After fine tuning LLama3
  • Merging Model Weights
  • Merge Lora Instructions
  • Axolotl Configuration Files
    • Configuration Options
    • Model Configuration
    • Data Loading and Processing
    • Sequence Configuration
    • Lora Configuration
    • Logging
    • Training Configuration
    • Augmentation Techniques
  • Axolotl Fine-Tuning Tips & Tricks: A Comprehensive Guide
  • Axolotl debugging guide
  • Hugging Face Hub API
  • NCCL
  • Training Phi 1.5 - Youtube
  • JSON (JavaScript Object Notation)
  • General Tips
  • Datasets
Powered by GitBook
LogoLogo

This documentation is for the Axolotl community

On this page

Was this helpful?

  1. Download the dataset

Use Git to download dataset

Using git to download the datasets

PreviousDownloading Huggingface DatasetsNextPopular Datasets

Last updated 1 year ago

Was this helpful?

You should be connected to the Huggingface Hub as well as having git installed and prepared on the virtual machine. If you do not, for reference:

Install Git LFS

  • First, ensure that Git LFS is installed on your machine. If it's not installed, you can download and install it from .

  • On most systems, you can install Git LFS using a package manager. For instance, on Ubuntu, you can use:

sudo apt-get install git-lfs

Initialise Git LFS

After installation, you need to set up Git LFS. In your terminal, run:

git lfs install

The output should be as follows:

Updated git hooks.
Git LFS initialized.

Navigate to the Huggingface datasets repository and search for the dataset you wish to download. In this case we will download the 'alpaca-cleaned' dataset.

When you are in the datasets website, enter in the name of the required dataset below in the 'filtered by name' input box. in this case, filter by 'alpaca_cleaned'

Why are we using the Alpaca Cleaned dataset?

The Alpaca-Cleaned dataset is a refined version of the original Alpaca Dataset from Stanford, addressing several identified issues to improve its quality and utility for instruction-tuning of language models. Key aspects of this dataset include:

Dataset Description and Corrections:

  • Hallucinations: Fixed instances where the original dataset's instructions caused the model to generate baseless answers, typically related to external web content.

  • Merged Instructions: Separated instructions that were improperly combined in the original dataset.

  • Empty Outputs: Addressed entries with missing outputs in the original dataset.

  • Missing Code Examples: Supplemented descriptions that lacked necessary code examples.

  • Image Generation Instructions: Removed unrealistic instructions for generating images.

  • N/A Outputs and Inconsistent Inputs: Corrected code snippets with N/A outputs and standardized the formatting of empty inputs.

  • Incorrect Answers: Identified and fixed wrong answers, particularly in math problems.

  • Unclear Instructions: Clarified or re-wrote non-sensical or unclear instructions.

  • Control Characters: Removed extraneous escape and control characters present in the original dataset.

Original Alpaca Dataset Overview

  • Consisting of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine.

  • Aimed at instruction-tuning to enhance language models' ability to follow instructions.

  • Modifications from the original data generation pipeline include using text-davinci-003, a new prompt for instruction generation, and a more efficient data generation approach.

  • The dataset is noted for its diversity and cost-effectiveness in generation.

Dataset Structure and Contents:

  • Fields include instruction (task description), input (context or additional information), and output (answer generated by text-davinci-003).

  • The text field combines these elements using a specific prompt template.

  • The dataset is primarily structured for training purposes, with 52,002 instances in the training split.

Intended Use and Considerations:

  • Primarily designed for training pretrained language models on instruction-following tasks.

  • The dataset, primarily in English, poses potential risks like harmful content dissemination and requires careful use and further refinement to address errors or biases.

After filtering by dataset name, you will seen all the datasets attributable to that name. We will be downloading yahma/alpaca_cleaned:

Once in the dataset repository, click on the button with three dots positioned horizontally. This will provide the opportunity to use git clone to download the dataset to your directory.

When you click on the three horizontal dolts, a dialog box appears providing the command line for a git clone download of the dataset. Follow the instructions below to git clone the dataset into the axolotl environment.

Go into the primary axolotl directory and then enter the following command:

git clone https://huggingface.co/datasets/yahma/alpaca-cleaned

This command will create a folder called datasets and download the specified Huggingface dataset into it.

Git Clone deconstruction

In the context of the command git clone https://huggingface.co/datasets/yahma/alpaca-cleaned, "yahma" refers to the username or organisation name within the Hugging Face Datasets repository. Here's a breakdown of the components of this command:

  1. git clone: This is a Git command used to clone a repository. It makes a copy of the specified repository and downloads it to your local machine.

  2. https://huggingface.co/datasets: This URL points to the Hugging Face Datasets repository. Hugging Face hosts machine learning models, datasets, and related tools. The /datasets part indicates that the repository being cloned is a dataset repository.

  3. yahma: This is the username or the name of the organisation on the Hugging Face platform that owns the repository you are cloning. In this case, 'yahma' is the entity that has uploaded or maintained the dataset named 'alpaca-cleaned'.

  4. alpaca-cleaned: This is the name of the specific dataset repository under the user or organization 'yahma' on Hugging Face. The name suggests it might be a cleaned or processed version of a dataset related to "alpaca".

When you run this command, you are cloning the 'alpaca-cleaned' dataset from the 'yahma' user or organisation's space on Hugging Face to your local machine. This allows you to use or analyze the dataset directly on your computer.

Remember where you stored your dataset - it is required for when you prepare your Axolotl configuration file

Git LFS' website
Hugging Face – The AI community building the future.huggingface
Logo
The main landing page for Huggingface Datasets
The Alpaca_Cleaned dataset is highlighted in the yellow
git clone dataset
Page cover image