Download the dataset

Once the model has been downloaded, the dataset is next

The next step is to download the dataset.

Before doing this we will document how datasets work on Huggingface and then provide specific instructions on how to download dataset into the Axolotl platform.

The HuggingFace Hub has numerous datasets

HuggingFace Dataset Summary

The Hugging Face Hub hosts a wide array of datasets for various tasks like translation, speech recognition, and image classification. These datasets are stored in Git repositories and contain scripts for data download and split generation.

Dataset Viewer: Many datasets, like GLUE, feature a Dataset Viewer to preview data.
Repository Structure: Each dataset repository has a specific structure for efficient data handling. Adhering to this structure ensures the dataset page on the Hub will have a Viewer.
Search and Filter: Datasets can be searched and filtered by language, tasks, and licenses on the Hub.
Privacy Settings: Dataset visibility can be toggled between private and public. For organization-owned datasets, these settings apply to all members.
Dataset Cards: Each dataset is documented with a README.md file in the repository, acting as a dataset card. This card provides context and usage instructions and can include metadata like license, language, size, and tags for easy discovery.
Metadata in Dataset Cards: Metadata added to the dataset card enables interactions on the Hub, like filtering and discovering datasets, and displaying licenses.
Linking to Papers: Including a paper link on arXiv in the dataset card adds the dataset to relevant tags and facilitates finding models citing the same paper.
Gated Datasets: Creators can control access to their datasets. Users must agree to terms and share contact information to access these datasets. Dataset owners can view user access reports.
Customizing User Access Prompts: The access request dialog can be customized with additional text and checkbox fields for specific user agreements.
Manual Approval of Access: Dataset authors can choose to manually review and approve access requests. An API is available for managing access requests.
Notifications: Default notification settings send daily emails for new access requests. These can be customized for real-time updates or sent to a specific email.
Additional Customization: Text in the gate's heading and button can be customized to suit specific requirements.

Uploading datasets for future use

Uploading Huggingface Datasets

Hugging Face's Hub allows for uploading and sharing of a wide range of datasets. Here's a summary of the key points for uploading datasets:

Account Creation: Start by creating a Hugging Face Hub account.
Upload Using the Hub UI: This user-friendly interface allows even non-developers to upload datasets.
Creating a Repository: A repository is where your dataset files and their revision history are stored. It supports multiple dataset versions.
Uploading a Dataset: After creating a repository, you can upload dataset files through the "Files and versions" tab. The Hub supports various file formats like .csv, .mp3, and .jpg.
Dataset Card Creation: This is crucial for helping users discover and understand your dataset. It involves selecting important metadata tags and writing detailed documentation about your dataset.
Dataset Viewer: For public datasets, the Dataset Viewer allows users to preview the data before downloading.
Using the huggingface_hub Client Library: This library offers advanced features for managing repositories and uploading datasets.
Using Other Libraries: Libraries like 🤗 Datasets, Pandas, Dask, or DuckDB can also upload files to the Hub.
Using Git: Dataset repositories, being Git repositories, allow the use of Git to push data files to the Hub.
Supported File Formats: The Hub supports a variety of file formats, including CSV, JSON, Parquet, Text, Images, and Audio files. Compressed files in formats like ZIP, GZIP, and others are also supported.
Downloading Datasets: The Hub also facilitates the downloading of datasets.

This summary provides a comprehensive overview of how to effectively utilize the Hugging Face Hub for uploading and managing datasets, emphasizing its accessibility and support for a wide range of file formats.

Download a Huggingface dataset

Downloading datasets from the Hugging Face Hub can be accomplished through several methods.

We will be using the 'git clone' method:

Using Git

All datasets on the Hub are stored as Git repositories, allowing for cloning directly to the local machine.
This method is particularly useful for large datasets or when you require the entire dataset repository.
Before cloning, ensure Git Large File Storage (LFS) is installed with git lfs install.
Clone the dataset using the command:

git clone [email protected]:datasets/<dataset ID>

Replace <dataset ID> with the actual ID of the dataset you wish to clone (e.g., git clone [email protected]:datasets/allenai/c4).
If you have write-access to the dataset repository, you can also commit and push revisions.
To push changes or access private repositories, add your SSH public key to your user settings on Hugging Face.

Reference: Using the Fast Download Library

The HF_HUB_ENABLE_HF_TRANSFER environment variable, when set to True, enhances the speed of uploads and downloads from the Hugging Face Hub by utilizing a Rust-based package called hf_transfer. Here's a summary of its functionality and considerations:

Enhanced Speed

By default, Hugging Face Hub employs Python-based functions like requests.get and requests.post for uploads and downloads. While reliable, these methods may not be the most efficient for high-bandwidth machines. hf_transfer is a Rust-based package that optimizes bandwidth usage by splitting large files into smaller parts and transferring them concurrently with multiple threads. This approach has the potential to double the transfer speed.

Installation

To use hf_transfer, you must install it separately from PyPI (Python Package Index) and set the environment variable HF_HUB_ENABLE_HF_TRANSFER to 1. This enables the Rust-based transfer logic for faster operations.

Limitations

It is essential to be aware of certain limitations when using hf_transfer:

Debugging may be challenging since it is not purely Python-based.
hf_transfer lacks some user-friendly features such as resumable downloads and proxy support. These omissions are intentional to maintain the simplicity and speed of the Rust logic.

Default State

f_transfer is not enabled by default in Hugging Face Hub, meaning that you need to explicitly set HF_HUB_ENABLE_HF_TRANSFER to True if you wish to utilize its enhanced transfer capabilities.

In summary, HF_HUB_ENABLE_HF_TRANSFER is an environment variable that, when activated, leverages the Rust-based hf_transfer package to significantly improve upload and download speeds from the Hugging Face Hub.

However, it's important to be aware of its limitations, including potential debugging challenges and the absence of certain features, as it is not the default transfer method in Hugging Face Hub.

PreviousHuggingface Hub NextTypes of Dataset Structures

Last updated 1 year ago

Was this helpful?