Datasets

Dataset preprocessing in Axolotl involves several steps

Parsing the dataset based on the specified dataset format.
Transforming the dataset according to the chosen prompt strategy.
Tokenizing the dataset using the configured model and tokenizer.
Shuffling and merging multiple datasets together if more than one is used.

There are two ways to perform dataset preprocessing in Axolotl:

Before starting the training process by running the command:

python -m axolotl.cli.preprocess /path/to/your.yaml --debug

This approach allows you to preprocess the datasets separately from the training process.

During the training process itself. In this case, the preprocessing happens automatically when you start the training.

The benefits of preprocessing datasets include

Faster training iterations: When training interactively or performing sweeps (restarting the trainer frequently), preprocessing the datasets beforehand can save time and avoid the frustration of waiting for preprocessing to complete each time.
Caching: Axolotl caches the tokenized/formatted datasets based on a hash of dependent training parameters. This means that if the same preprocessing configuration is used, Axolotl can intelligently retrieve the preprocessed data from the cache, saving time and resources.

The path of the cache is controlled by the dataset_prepared_path: parameter in the configuration YAML file.

If left empty, the processed dataset will be cached in the default path ./last_run_prepared/ during training, but any existing cached data there will be ignored.

By explicitly setting dataset_prepared_path: ./last_run_prepared, the trainer will use the pre-processed data from the cache.

Edge cases to consider

Custom prompt strategies or user-defined prompt templates:
If you are writing a custom prompt strategy or using a user-defined prompt template, the trainer may not be able to detect changes in the prompt templating logic automatically. In such cases, if you have dataset_prepared_path: ... set, the trainer may not pick up the changes you made and will continue using the old prompt.

Here's an example of how you can implement dataset preprocessing in Axolotl:

Define your dataset configuration in the YAML file:

codedatasets:
  - path: /path/to/dataset1.json
    type: json
  - path: /path/to/dataset2.csv
    type: csv

dataset_prepared_path: ./preprocessed_data

Run the preprocessing command

python -m axolotl.cli.preprocess /path/to/your.yaml --debug

This command will preprocess the datasets specified in the YAML file and cache the preprocessed data in the ./preprocessed_data directory.

Start the training process

python -m axolotl.cli.train /path/to/your.yaml

The trainer will automatically use the preprocessed data from the cache, saving time and resources during training.

Remember to be cautious when making changes to custom prompt strategies or user-defined prompt templates. If you have dataset_prepared_path: ... set, make sure to either update the cache by running the preprocessing command again or remove the dataset_prepared_path: parameter to force the trainer to preprocess the data with the updated prompt logic.

By understanding and properly implementing dataset preprocessing in Axolotl, you can optimize your training workflow, save time, and ensure that your datasets are properly prepared for training your machine learning models.

PreviousConfiguration for Training NextModel Selection - General

Last updated 1 year ago

Was this helpful?