Downloading Huggingface Datasets
Downloading methods
Hugging Face datasets can be downloaded and loaded using various methods.
Here's a summary:
From Hugging Face Hub Without a Loading Script
You can load datasets directly from any dataset repository on the Hub using the load_dataset()
function. Provide the repository namespace and dataset name to load the dataset.
Local Loading Script
If you have a local HuggingFace Datasets loading script, you can load the dataset by specifying the local path to the loading script file or the directory containing it.
Local and Remote Files
Datasets stored as CSV, JSON, TXT, Parquet, or Arrow files on your computer or remotely can be loaded using the load_dataset()
function. Specify the file type and the path or URL to the data files.
In-memory Data
You can create a dataset directly from in-memory data structures like Python dictionaries and Pandas DataFrames using functions like from_dict()
and from_pandas()
.
Offline
Datasets can be loaded offline if they are stored locally or if you have previously downloaded them.
Specific Slice of a Split
You can load specific slices of a dataset split by using the split
parameter in the load_dataset()
function.
Multiprocessing
For datasets consisting of several files, you can speed up the downloading and preparation using the num_proc
parameter to set the number of processes for parallel execution.
SQL
Datasets can be read from SQL databases using from_sql()
by specifying the URI to connect to your database.
Arrow Streaming Format
The Huggingface Datasets library can load local Arrow files directly using Dataset.from_file()
. This method memory-maps the Arrow file without preparing the dataset in the cache.
Python Generator
A dataset can be created from a Python generator with from_generator()
. This method supports loading data larger than available memory and can also define a sharded dataset.
Key Features and Functionalities
Flexibility: The library can handle datasets stored in various formats and locations, including local and remote repositories, and in-memory data.
Dataset Splits: Data can be mapped to specific splits like 'train', 'test', and 'validation' using the
data_files
parameter. This parameter accepts file paths mapped to split names.Version Control: You can load different versions of a dataset based on Git tags, branches, or commits using the
revision
parameter.Subset Loading: For large datasets, you have the option to load only a subset of files, which is useful for large datasets like C4 (around 13TB).
Pattern Matching: Load files that match specific patterns or from specified directories within a dataset repository.
No Loading Script Required: The library allows loading datasets without the need for a custom loading script, simplifying the process.
Custom Datasets
Custom Dataset Repositories: Users can create their dataset repositories on the Hugging Face Hub, facilitating easy sharing and loading of datasets.
Last updated