LogoLogo
Continuum Knowledge BankContinuum Applications
  • Introduction
  • Creation of Environment
    • Platform Installation
    • Axolotl Dependencies
    • setup.py objectives
      • script analysis
  • Huggingface Hub
  • Download the dataset
    • Types of Dataset Structures
    • Structuring Datasets for Fine-Tuning Large Language Models
    • Downloading Huggingface Datasets
    • Use Git to download dataset
    • Popular Datasets
    • Download cleaned Alpaca dataset
    • Template-free prompt construction
  • Downloading models
    • Phi 2.0 details
    • Downloading Phi 2.0
    • Available Models
  • Configuration for Training
  • Datasets
  • Model Selection - General
  • Phi 2.0
    • Phi 2.0 - Model Configuration
    • Phi 2.0 - Model Quantization
    • Phi 2.0 - Data Loading and Paths
    • Phi 2.0 - Sequence Configuration
    • Phi 2.0 - Lora Configuration
    • Phi 2.0 - Logging
    • Phi 2.0 - Training Configuration
    • Phi 2.0 - Data and Precision
    • Phi 2.0 - Optimisations
    • Phi 2.0 - Extra Hyperparameters
    • Phi 2.0 - All Configurations
    • Phi 2.0 - Preprocessing
    • Phi 2.0 - Training
    • Uploading Models
  • Llama2
    • Llama2 - Model Configuration
    • Llama2 - Model Quantization
    • Llama2 - Data Loading and Paths
    • Llama2 - Sequence Configuration
    • Llama2 - Lora Configuration
    • Llama2 - Logging
    • Llama2 - Training Configuration
    • Llama2 - Data and Precision
    • Llama2 - Optimisations
    • Llama2 - Extra Hyperparameters
    • Llama2- All Configurations
    • Llama2 - Training Configuration
    • Llama2 - Preprocessing
    • Llama2 - Training
  • Llama3
    • Downloading the model
    • Analysis of model files
      • Model Analysis - Configuration Parameters
      • Model Analysis - Safetensors
      • Tokenizer Configuration Files
        • Model Analysis - tokenizer.json
        • Model Analysis - Special Tokens
    • Llama3 - Model Configuration
    • Llama3 - Model Quantization
    • Llama3 - Data Loading and Paths
    • Llama3 - Sequence Configuration
    • Llama3 - Lora Configuration
    • Llama3 - Logging
    • Llama3 - Training Configuration
    • Llama3 - Data and Precision
    • Llama3 - Optimisations
    • Llama3 - Extra Hyperparameters
    • Llama3- All Configurations
    • Llama3 - Preprocessing
    • Llama3 - Training
    • Full Fine Tune
  • Special Tokens
  • Prompt Construction for Fine-Tuning Large Language Models
  • Memory-Efficient Fine-Tuning Techniques for Large Language Models
  • Training Ideas around Hyperparameters
    • Hugging Face documentation on loading PEFT
  • After fine tuning LLama3
  • Merging Model Weights
  • Merge Lora Instructions
  • Axolotl Configuration Files
    • Configuration Options
    • Model Configuration
    • Data Loading and Processing
    • Sequence Configuration
    • Lora Configuration
    • Logging
    • Training Configuration
    • Augmentation Techniques
  • Axolotl Fine-Tuning Tips & Tricks: A Comprehensive Guide
  • Axolotl debugging guide
  • Hugging Face Hub API
  • NCCL
  • Training Phi 1.5 - Youtube
  • JSON (JavaScript Object Notation)
  • General Tips
  • Datasets
Powered by GitBook
LogoLogo

This documentation is for the Axolotl community

On this page
  • Key Features and Functionalities
  • Custom Datasets

Was this helpful?

  1. Download the dataset

Downloading Huggingface Datasets

Downloading methods

Hugging Face datasets can be downloaded and loaded using various methods.

Here's a summary:

From Hugging Face Hub Without a Loading Script

You can load datasets directly from any dataset repository on the Hub using the load_dataset() function. Provide the repository namespace and dataset name to load the dataset.

Local Loading Script

If you have a local HuggingFace Datasets loading script, you can load the dataset by specifying the local path to the loading script file or the directory containing it.

Local and Remote Files

Datasets stored as CSV, JSON, TXT, Parquet, or Arrow files on your computer or remotely can be loaded using the load_dataset() function. Specify the file type and the path or URL to the data files.

In-memory Data

You can create a dataset directly from in-memory data structures like Python dictionaries and Pandas DataFrames using functions like from_dict() and from_pandas().

Offline

Datasets can be loaded offline if they are stored locally or if you have previously downloaded them.

Specific Slice of a Split

You can load specific slices of a dataset split by using the split parameter in the load_dataset() function.

Multiprocessing

For datasets consisting of several files, you can speed up the downloading and preparation using the num_proc parameter to set the number of processes for parallel execution.

SQL

Datasets can be read from SQL databases using from_sql() by specifying the URI to connect to your database.

Arrow Streaming Format

The Huggingface Datasets library can load local Arrow files directly using Dataset.from_file(). This method memory-maps the Arrow file without preparing the dataset in the cache.

Python Generator

A dataset can be created from a Python generator with from_generator(). This method supports loading data larger than available memory and can also define a sharded dataset.

Key Features and Functionalities

  • Flexibility: The library can handle datasets stored in various formats and locations, including local and remote repositories, and in-memory data.

  • Dataset Splits: Data can be mapped to specific splits like 'train', 'test', and 'validation' using the data_files parameter. This parameter accepts file paths mapped to split names.

  • Version Control: You can load different versions of a dataset based on Git tags, branches, or commits using the revision parameter.

  • Subset Loading: For large datasets, you have the option to load only a subset of files, which is useful for large datasets like C4 (around 13TB).

  • Pattern Matching: Load files that match specific patterns or from specified directories within a dataset repository.

  • No Loading Script Required: The library allows loading datasets without the need for a custom loading script, simplifying the process.

Custom Datasets

  • Custom Dataset Repositories: Users can create their dataset repositories on the Hugging Face Hub, facilitating easy sharing and loading of datasets.

PreviousStructuring Datasets for Fine-Tuning Large Language ModelsNextUse Git to download dataset

Last updated 1 year ago

Was this helpful?

Page cover image