Downloading Huggingface Datasets

Downloading methods

Hugging Face datasets can be downloaded and loaded using various methods.

Here's a summary:

From Hugging Face Hub Without a Loading Script

You can load datasets directly from any dataset repository on the Hub using the load_dataset() function. Provide the repository namespace and dataset name to load the dataset.

Local Loading Script

If you have a local HuggingFace Datasets loading script, you can load the dataset by specifying the local path to the loading script file or the directory containing it.

Local and Remote Files

Datasets stored as CSV, JSON, TXT, Parquet, or Arrow files on your computer or remotely can be loaded using the load_dataset() function. Specify the file type and the path or URL to the data files.

In-memory Data

You can create a dataset directly from in-memory data structures like Python dictionaries and Pandas DataFrames using functions like from_dict() and from_pandas().

Offline

Datasets can be loaded offline if they are stored locally or if you have previously downloaded them.

Specific Slice of a Split

You can load specific slices of a dataset split by using the split parameter in the load_dataset() function.

Multiprocessing

For datasets consisting of several files, you can speed up the downloading and preparation using the num_proc parameter to set the number of processes for parallel execution.

SQL

Datasets can be read from SQL databases using from_sql() by specifying the URI to connect to your database.

Arrow Streaming Format

The Huggingface Datasets library can load local Arrow files directly using Dataset.from_file(). This method memory-maps the Arrow file without preparing the dataset in the cache.

Python Generator

A dataset can be created from a Python generator with from_generator(). This method supports loading data larger than available memory and can also define a sharded dataset.

Key Features and Functionalities

Flexibility: The library can handle datasets stored in various formats and locations, including local and remote repositories, and in-memory data.
Dataset Splits: Data can be mapped to specific splits like 'train', 'test', and 'validation' using the data_files parameter. This parameter accepts file paths mapped to split names.
Version Control: You can load different versions of a dataset based on Git tags, branches, or commits using the revision parameter.
Subset Loading: For large datasets, you have the option to load only a subset of files, which is useful for large datasets like C4 (around 13TB).
Pattern Matching: Load files that match specific patterns or from specified directories within a dataset repository.
No Loading Script Required: The library allows loading datasets without the need for a custom loading script, simplifying the process.

Custom Datasets

Custom Dataset Repositories: Users can create their dataset repositories on the Hugging Face Hub, facilitating easy sharing and loading of datasets.

PreviousStructuring Datasets for Fine-Tuning Large Language Models NextUse Git to download dataset

Last updated 1 year ago

Was this helpful?