# Datasets

### <mark style="color:blue;">Dataset preprocessing in Axolotl involves several steps</mark>

1. Parsing the dataset based on the specified dataset format.
2. Transforming the dataset according to the chosen prompt strategy.
3. Tokenizing the dataset using the configured model and tokenizer.
4. Shuffling and merging multiple datasets together if more than one is used.

There are two ways to perform dataset preprocessing in Axolotl:

Before starting the training process by running the command:

```bash
python -m axolotl.cli.preprocess /path/to/your.yaml --debug
```

This approach allows you to preprocess the datasets separately from the training process.

During the training process itself. In this case, the preprocessing happens automatically when you start the training.

### <mark style="color:blue;">The benefits of preprocessing datasets include</mark>

* <mark style="color:green;">**Faster training iterations:**</mark> When training interactively or performing sweeps (restarting the trainer frequently), preprocessing the datasets beforehand can save time and avoid the frustration of waiting for preprocessing to complete each time.
* <mark style="color:green;">**Caching:**</mark> Axolotl caches the tokenized/formatted datasets based on a hash of dependent training parameters. This means that if the same preprocessing configuration is used, Axolotl can intelligently retrieve the preprocessed data from the cache, saving time and resources.

The path of the cache is controlled by the <mark style="color:yellow;">**`dataset_prepared_path:`**</mark> parameter in the configuration YAML file.&#x20;

If left empty, the processed dataset will be cached in the default path <mark style="color:yellow;">**`./last_run_prepared/`**</mark> during training, but any existing cached data there will be ignored.&#x20;

By explicitly setting <mark style="color:yellow;">**`dataset_prepared_path: ./last_run_prepared`**</mark><mark style="color:yellow;">**,**</mark> the trainer will use the pre-processed data from the cache.

### <mark style="color:blue;">Edge cases to consider</mark>

* Custom prompt strategies or user-defined prompt templates:&#x20;
* If you are writing a custom prompt strategy or using a user-defined prompt template, the trainer may not be able to detect changes in the prompt templating logic automatically. In such cases, if you have <mark style="color:yellow;">**`dataset_prepared_path: ...`**</mark> set, the trainer may not pick up the changes you made and will continue using the old prompt.

Here's an example of how you can implement dataset preprocessing in Axolotl:

Define your dataset configuration in the YAML file:

```yaml
codedatasets:
  - path: /path/to/dataset1.json
    type: json
  - path: /path/to/dataset2.csv
    type: csv

dataset_prepared_path: ./preprocessed_data
```

<mark style="color:green;">Run the preprocessing command</mark>

```bash
python -m axolotl.cli.preprocess /path/to/your.yaml --debug
```

This command will preprocess the datasets specified in the YAML file and cache the preprocessed data in the `./preprocessed_data` directory.

<mark style="color:green;">Start the training process</mark>

```bash
python -m axolotl.cli.train /path/to/your.yaml
```

The trainer will automatically use the preprocessed data from the cache, saving time and resources during training.

Remember to be cautious when making changes to custom prompt strategies or user-defined prompt templates. If you have `dataset_prepared_path: ...` set, make sure to either update the cache by running the preprocessing command again or remove the `dataset_prepared_path:` parameter to force the trainer to preprocess the data with the updated prompt logic.

By understanding and properly implementing dataset preprocessing in Axolotl, you can optimize your training workflow, save time, and ensure that your datasets are properly prepared for training your machine learning models.
