Llama3 - Preprocessing
To execute the training run, we will first preprocess the dataset.
Axolotl allows us to optionally pre-tokenize dataset before finetuning. This is recommended for large datasets.
Populate the config.yaml file
To execute the preprocessing, we have to make sure the datasets component of the config.yaml is correctly populated.
You will need to direct Axolotl to where your dataset is.
To do this, find the path to your dataset in VS Code by right clicking on the dataset and asking for the relative path. Then enter it into the config YAML file below which is located at:
your directory/axolotl/examples/llama-3/lora-8b.yml
your directory/axolotl/examples/llama-3/lora-8b.yml
When you have located the file, then populate the configuration file below.
To do this, find the path to your dataset in VS Code by right clicking on the dataset and asking for the relative URL. Use this in the command below. Remember to include the file name.
datasets:
- path: datasets/alpagasus/data/train-00000-of-00001-0c59455170918204.parquet
type: alpaca
ds_type: parquet
data_files:
- train-00000-of-00001-0c59455170918204.parquet
dataset_prepared_path:
val_set_size: 0.10
output_dir: ./llama3-out
The data set type (ds_type) is Parquet, and we have arbitrarily set the output directory as ./llama3-out
For a refresher of what the Alpgasus data set contains
Once you have configured the dataset component of the YAML file, then execute the following command which uses the preprocess.py script within the Axolotl library.
python -m axolotl.cli.preprocess examples/llama-3/lora-8b.yml
The output should indicate success:
Success! Preprocessed data path: `dataset_prepared_path: ./llama-data`
If you have not set a dataset_prepared_path in the configuration file, it will default to:
"preprocess CLI called without dataset_prepared_path set, using default path: last_run_prepared"
For an analysis of the output relating to the dataset preprocessing, see below:
If you would like information on how the preprocess.py script works you can view the code at axolotl/src/axolotl/cli.
Alternatively, take the time to read our analysis of the scripts, and the output it produced:
This script for our Axolotl platform will be:
Flash Attention Dependency Issues
Analysing arrow files
To ensure that the data is fit for fine-tuning, you can perform a more comprehensive analysis.
Here are some additional steps you can take:
Check the distribution of sequence lengths
Calculate the mean, median, and percentiles of the 'length' column.
Plot a histogram or density plot of the 'length' column to visualize the distribution.
Identify if there are any outliers or extreme values in the sequence lengths.
Analyze the token frequency
Flatten the 'input_ids' column and create a frequency distribution of the tokens.
Identify the most common and least common tokens.
Check if the token distribution aligns with your expectations based on the original dataset.
Examine the 'labels' column
Understand the meaning of the -100 label and how it is used in your fine-tuning task.
Calculate the distribution of unique labels and their counts.
Check if the label distribution is balanced or imbalanced, and consider if any class balancing techniques are needed.
Validate the 'attention_mask' and 'position_ids' columns
Ensure that the 'attention_mask' column correctly corresponds to the non-padding tokens in 'input_ids'.
Verify that the 'position_ids' column correctly represents the position of each token in the sequence.
Assess the quality of the preprocessed data
Randomly sample a subset of rows from the DataFrame and manually inspect the tokenized sequences.
Check if the tokenization aligns with the expected format and content of the original instructions.
Look for any anomalies, such as truncated sequences, incorrect tokenization, or missing information.
Evaluate the data split
If your Arrow file represents the entire dataset, consider splitting it into train, validation, and test subsets.
Ensure that the data split is representative and stratified based on important characteristics, such as sequence length or label distribution.
Consider domain-specific analysis
Depending on the nature of your instructions dataset, perform domain-specific analysis.
For example, if the instructions involve specific tasks or categories, analyze the distribution of those tasks or categories within the dataset.
Suggestions
Examine the specific meanings of the labels
The 'labels' column contains a large number of unique labels, with -100 being the most common.
It's crucial to understand the meaning of these labels in the context of your task.
The -100 label likely represents padding or non-labeled tokens, while other labels correspond to specific target tokens.
You should ensure that the labels align with your task requirements and represent the desired output or target sequence.
If the labels don't match your expectations or aren't suitable for your task, you may need to modify the preprocessing steps or the labeling scheme.
Validate the preprocessing steps
The preprocessing steps involve tokenization and label assignment.
Check if the tokenization process is performed correctly and matches the expected tokenization scheme for your model.
Verify that the 'input_ids' column contains the correct token IDs corresponding to the input sequences.
Ensure that the 'attention_mask' column accurately represents the padding and non-padding tokens.
Validate that the 'labels' column is correctly aligned with the 'input_ids' and contains the expected target labels.
If any issues are found in the preprocessing steps, you may need to revisit and modify the preprocessing code.
Consider the balance of the label distribution
The label distribution shows a significant class imbalance, with some labels being much more frequent than others.
Class imbalance can impact the model's performance, as it may bias towards the majority class.
Consider whether the class imbalance is expected and acceptable for your task, or if it needs to be addressed.
Techniques like oversampling minority classes, undersampling majority classes, or using class weights can help mitigate class imbalance.
Evaluate the impact of class imbalance on your model's performance and consider applying appropriate techniques if necessary.
Evaluate the quality and diversity of the input sequences
Randomly sample a subset of input sequences and manually inspect them.
Assess the quality of the sequences in terms of coherence, relevance to your task, and any potential noise or artifacts.
Check if the sequences adequately represent the desired task domain and cover a diverse range of examples.
Look for any patterns, anomalies, or issues in the input sequences that may affect the model's learning.
If the input sequences are of poor quality or lack diversity, you may need to gather additional data or refine the data collection process.
Based on your specific use case and task requirements, you should adapt the analysis accordingly. Consider the following:
Define clear criteria for assessing label suitability and alignment with your task.
Establish validation steps to ensure the preprocessing pipeline is performing as expected.
Determine acceptable levels of class imbalance and consider strategies to handle it.
Set quality standards and diversity requirements for the input sequences based on your task domain.
Remember, the provided script serves as a starting point for data exploration, and you can extend it to include additional analysis or validation steps specific to your use case.
By thoroughly examining the labels, validating the preprocessing steps, addressing class imbalance, and evaluating the quality and diversity of the input sequences, you can gain confidence in the suitability of your data for fine-tuning your large language model.
Flash Attention Issues
We had some issue with Flash Attention dependencies. For some reason the axolotl environment keep reverting to Pytorch 2.01 - Flash Attention needs Pytorch version 2.10
We executed this command in the axolotl library, and it upgraded Pytorch to 2.10.
pip install flash_attn -U --force-reinstall
Last updated
Was this helpful?