Page cover image

Llama3 - Training

LLama3

accelerate launch -m axolotl.cli.train examples/llama-3/lora-8b.yml

If you have not already done so, you will be asked to enter your Weights and Biases API Key.

Enter the key at the command line prompt:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

An analysis of the axolotl.clt.train module

Analysis of train.py script

The train.py script in the Axolotl platform is a Command Line Interface (CLI) tool designed for training machine learning models.

This script is structured to provide a user-friendly interface for configuring and executing model training. Here's a detailed analysis:

Script Structure and Functionality

Imports and Logger Setup

  • Essential modules like logging, pathlib.Path, fire, and transformers are imported.

  • The script sets up a logger LOG using the logging module for logging various events and statuses during the script's execution.

do_cli Function

  • Function Definition: The do_cli function is the main entry point of the script. It accepts a config argument (with a default value pointing to an "examples" directory) and **kwargs for additional arguments.

  • ASCII Art Display: print_axolotl_text_art() is called to display ASCII art, likely for aesthetic purposes.

  • Configuration Loading: load_cfg loads configuration details from the provided config path. These configurations are essential for setting up model training parameters.

  • Accelerator and User Token Checks: The script verifies the default configuration for the accelerator (such as a GPU) and checks the user token. These checks are crucial for ensuring that the hardware is correctly set up and the user is authenticated.

  • CLI Arguments Parsing: It uses transformers.HfArgumentParser to parse additional CLI arguments into data classes (TrainerCliArgs). This step allows for dynamic customization of training parameters via the command line.

  • Dataset Loading: load_datasets is called with the parsed configuration and CLI arguments. This function is responsible for loading the dataset as per the configuration, which is a critical step in the training process.

  • Model Training: The train function is invoked with the loaded configuration, CLI arguments, and dataset metadata. This function likely encompasses the core logic for model training.

Main Block

  • The script checks if it's being run as the main program (__name__ == "__main__") and not as a module in another script. If it's the main program, it uses fire.Fire(do_cli) to execute the do_cli function, enabling the script to be interacted with from the command line.

Issues that arose (ignore)

We had a problem with dependencies - what a surprise

What are xformers?

The GitHub repository for xFormers, as described in the provided text, offers a comprehensive overview of the library and its functionalities. Here's an analysis of the key points:

Overview of xFormers:

  1. Customizable Building Blocks:

    • xFormers provides domain-agnostic components for Transformers, usable in various fields like vision and NLP, without requiring extensive boilerplate code.

  2. Research-Oriented:

    • The library contains cutting-edge components not yet available in mainstream libraries, indicating its focus on the latest developments in Transformer technology.

  3. Efficiency:

    • Designed with speed and memory efficiency in mind, xFormers includes custom CUDA kernels and utilizes other libraries when appropriate.

Installation:

  1. Stable Versions:

    • Recommended installation via conda or pip, depending on the environment (Linux or Windows) and CUDA version (11.8 or 12.1).

    • Specific to PyTorch versions 1.13.1, 2.0.1, or 2.1.0.

  2. Development Binaries:

    • For users who want to access the latest development features.

  3. Source Installation:

    • An option for compatibility with different or nightly versions of PyTorch.

    • Additional steps like installing ninja for faster builds and setting the TORCH_CUDA_ARCH_LIST environment variable are suggested.

Key Components:

  1. Functional Operators and Components:

    • The repository is organized into directories like ops, components, benchmarks, and triton. Each contains specific functionalities like attention mechanisms, feedforward blocks, positional embeddings, etc.

  2. Attention Mechanisms and More:

    • xFormers offers a variety of attention mechanisms, feedforward styles, positional embeddings, and other Transformer components.

  3. Optimized Building Blocks:

    • Includes memory-efficient attention mechanisms and various optimized operations, indicating a focus on performance.

Benchmarks and Testing:

  • Benchmarks for memory-efficient multi-head attention (MHA) and other components are provided.

  • Instructions for testing the installation and verifying available kernels are included.

Hackability and Extensibility:

  • xFormers is designed to be hackable with composable building blocks, making it adaptable for research and development.

  • It uses Triton for some optimized parts, which are explicit and accessible, indicating ease of customization.

Install Troubleshooting:

  • The document provides troubleshooting tips for installation issues, like ensuring NVCC and CUDA runtime compatibility and setting the correct environment variables.

The setup.py script deals with issues between Xformers and Pytorch

XFormers and Pytorch

The setup.py script indicates a temporary workaround for a compatibility issue with xformers. If torch==2.1.0 is in the requirements, the standard xformers dependency is removed and replaced with a specific version installed directly from the GitHub repository. This suggests that the current release of xformers does not support Torch 2.1.0 and requires using the main branch from the GitHub repo.

The special handling of xformers in setup.py suggests that the version of xformers compatible with torch==2.1.0 is not available in the standard Python package index (PyPI). Therefore, the script directly installs xformers from the main branch of its GitHub repository.

This approach indicates that the Axolotl platform is using features or fixes from xformers that are only available in the latest code and have not yet been released officially.

Users of Axolotl need to be aware that they are using a potentially unstable or untested version of xformers, which might have implications for production use.

The configuration line eval_sample_packing: False within a machine learning training configuration file is specifically relevant to how data is managed during the evaluation phase of the training process. Here's a detailed breakdown of what this means and why it's important:

Context and Purpose

  • Sample Packing: This is a technique used in training deep learning models, especially those with sequence data (like text or time series), to optimize the utilization of computational resources like GPU memory. It involves arranging multiple sequences in a single batch in a compact way to reduce padding, which is often necessary when sequences of variable lengths are processed together.

  • Evaluation Phase: During model training, there is typically a phase called evaluation or validation where the trained model is tested against a separate dataset that was not used during the actual training. This helps in checking the model's performance and generalizability on new, unseen data. The evaluation phase is crucial for monitoring overfitting, underfitting, and for tuning the model's hyperparameters.

Impact of eval_sample_packing: False

  • Disabling Sample Packing in Evaluation: By setting eval_sample_packing to False, you instruct the training process not to use the sample packing technique during the evaluation phase. This means that the evaluation data will be processed in a straightforward, possibly less memory-efficient manner, where each sequence or data point is treated individually without attempting to optimize the batch structure by tightly packing multiple sequences together.

Why Disable Sample Packing for Evaluation?

  • Simplicity and Debugging: Sample packing can complicate the data handling process, making debugging more difficult if things go wrong. Disabling it for evaluation can simplify the computation and make it easier to trace issues or assess the model's performance straightforwardly.

  • Memory and Compute Trade-offs: While sample packing can save memory and potentially speed up training by reducing the number of operations on padded data, it may not always provide benefits during evaluation, especially if the evaluation dataset is small or if the overhead of managing packed samples outweighs the benefits.

  • Consistency and Accuracy: In some cases, packing might introduce subtle bugs or inconsistencies (e.g., incorrect handling of sequence boundaries or masking). Evaluating the model without packing ensures that the performance metrics are obtained in a straightforward and consistent manner, closely representing how the model will operate in production (assuming production use does not involve sample packing).

Practical Implications

Setting eval_sample_packing to False typically leads to a simpler and potentially more reliable evaluation phase, at the possible cost of increased memory usage and longer computational times due to less efficient data handling. This setting helps ensure that the evaluation metrics reflect the true performance of the model under standard operating conditions. Certainly! Let me explain sample packing in more detail and how it relates to other hyperparameters.

Sample packing is a technique used in natural language processing (NLP) to efficiently utilize the available computational resources, particularly when training large language models. It involves combining multiple shorter sequences into a single batch to maximize the utilization of the GPU memory and computational capacity.

In the context of training a language model like the one you are working with (based on the Meta-Llama model), sample packing helps in the following ways:

  1. GPU Memory Utilization: Language models often have a fixed input sequence length (e.g., 4096 tokens in your configuration). However, not all input sequences in a batch may have the same length. Sample packing allows you to pack multiple shorter sequences together to fill up the available sequence length in a batch. This way, you can make the most efficient use of the GPU memory by minimizing padding and ensuring that each batch contains a maximum number of actual tokens.

  2. Computational Efficiency: By packing multiple sequences into a single batch, you can process more examples in parallel, leading to faster training times. This is because GPUs are designed to perform well on parallelizable tasks, and processing a larger batch size allows for better utilization of the GPU's computational resources.

  3. Training Stability: Sample packing can help stabilize the training process by providing a more consistent batch size. When sequences of varying lengths are processed individually, the effective batch size may fluctuate, which can impact the stability of the gradients and the overall training dynamics. Sample packing helps maintain a more consistent batch size, leading to more stable training.

Now, let's discuss how sample packing relates to other hyperparameters:

  • Sequence Length: Sample packing is directly related to the sequence length hyperparameter (sequence_len in your configuration). The sequence length determines the maximum number of tokens that can be processed in a single batch. Sample packing tries to fill up this sequence length by combining multiple shorter sequences. If the sequence length is too small, it may limit the effectiveness of sample packing.

  • Batch Size: The batch size (micro_batch_size in your configuration) determines the number of sequences processed in parallel during training. Sample packing aims to maximize the number of sequences that can fit within a batch while staying within the memory constraints of the GPU. The larger the batch size, the more opportunities there are for sample packing to be effective.

  • GPU Memory: The available GPU memory is a crucial factor in determining the feasibility of sample packing. Sample packing allows you to utilize the GPU memory more efficiently by minimizing padding and maximizing the number of actual tokens processed in each batch. However, if the GPU memory is limited, you may need to adjust the batch size or sequence length accordingly.

In your specific case, the error message suggests that the evaluation dataset split is too small for sample packing. This means that the number of sequences in the evaluation dataset is not sufficient to effectively apply sample packing. By setting eval_sample_packing: false, you are disabling sample packing for the evaluation dataset, which should resolve the issue.

It's important to note that sample packing is more commonly used during training rather than evaluation. During evaluation, you typically want to process sequences individually to get accurate metrics and predictions for each example.

I hope this explanation clarifies the concept of sample packing and its relationship to other hyperparameters. Let me know if you have any further questions!

C

Last updated

Logo

This documentation is for the Axolotl community