# Downloading Huggingface Datasets

Hugging Face datasets can be <mark style="color:yellow;">downloaded and loaded</mark> using various methods.&#x20;

Here's a summary:

<mark style="color:blue;">**From Hugging Face Hub Without a Loading Script**</mark>

You can load datasets directly from any dataset repository on the Hub using the <mark style="color:yellow;">**`load_dataset()`**</mark> function. Provide the repository namespace and dataset name to load the dataset.

<mark style="color:blue;">**Local Loading Script**</mark>

If you have a local HuggingFace Datasets loading script, you can load the dataset by specifying the local path to the loading script file or the directory containing it.

<mark style="color:blue;">**Local and Remote Files**</mark>

Datasets stored as CSV, JSON, TXT, Parquet, or Arrow files on your computer or remotely can be loaded using the <mark style="color:yellow;">**`load_dataset()`**</mark> function. Specify the file type and the path or URL to the data files.

<mark style="color:blue;">**In-memory Data**</mark>

You can create a dataset directly from in-memory data structures like Python dictionaries and Pandas DataFrames using functions like <mark style="color:yellow;">**`from_dict()`**</mark> and <mark style="color:yellow;">**`from_pandas()`**</mark><mark style="color:yellow;">**.**</mark>

<mark style="color:blue;">**Offline**</mark>

Datasets can be loaded offline if they are stored locally or if you have previously downloaded them.

<mark style="color:blue;">**Specific Slice of a Split**</mark>

You can load specific slices of a dataset split by using the <mark style="color:yellow;">**`split`**</mark> parameter in the <mark style="color:yellow;">**`load_dataset()`**</mark> function.

<mark style="color:blue;">**Multiprocessing**</mark>

For datasets consisting of several files, you can speed up the downloading and preparation using the <mark style="color:yellow;">**`num_proc`**</mark> parameter to set the number of processes for parallel execution.

<mark style="color:blue;">**SQL**</mark>

Datasets can be read from SQL databases using <mark style="color:yellow;">**`from_sql()`**</mark> by specifying the URI to connect to your database.

<mark style="color:blue;">**Arrow Streaming Format**</mark>

The Huggingface Datasets library can load local Arrow files directly using <mark style="color:yellow;">**`Dataset.from_file()`**</mark><mark style="color:yellow;">**.**</mark> This method memory-maps the Arrow file without preparing the dataset in the cache.

<mark style="color:blue;">**Python Generator**</mark>

A dataset can be created from a Python generator with <mark style="color:yellow;">**`from_generator()`**</mark><mark style="color:yellow;">**.**</mark> This method supports loading data larger than available memory and can also define a sharded dataset.

### <mark style="color:blue;">Key Features and Functionalities</mark>

* **Flexibility**: The library <mark style="color:yellow;">**can handle datasets stored in various formats and locations, including local and remote repositories, and in-memory data.**</mark>
* **Dataset Splits**: Data can be <mark style="color:yellow;">**mapped to specific splits like 'train', 'test', and 'validation'**</mark> using the <mark style="color:yellow;">**`data_files`**</mark> parameter. This parameter accepts file paths mapped to split names.
* **Version Control**: You can load different versions of a dataset based on Git tags, branches, or commits using the <mark style="color:yellow;">**`revision`**</mark> parameter.
* **Subset Loading**: For large datasets, you have the option to load only a subset of files, which is useful for large datasets like C4 (around 13TB).
* **Pattern Matching**: Load files that match specific patterns or from specified directories within a dataset repository.
* **No Loading Script Required**: The library allows loading datasets without the need for a custom loading script, simplifying the process.

### <mark style="color:blue;">Custom Datasets</mark>

* **Custom Dataset Repositories**: Users can create their dataset repositories on the Hugging Face Hub, facilitating easy sharing and loading of datasets.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://axolotl.continuumlabs.pro/download-the-dataset/downloading-huggingface-datasets.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
