Merge Lora Instructions

merge_lora

Training LoRA (Low-Rank Adapter)

Axolotl allows you to fine-tune a base model using LoRA, which is a parameter-efficient fine-tuning method.
You can train a LoRA adapter on top of the base model using a configuration file that specifies the training details, such as the dataset, hyperparameters, and LoRA-specific settings.

Example configuration snippet

load_in_8bit: true
load_in_4bit: false
strict: false
sequence_len: 4096
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

Merging LoRA with the Base Model

After training the LoRA adapter, you need to merge it with the base model to create a single, fine-tuned model.
Axolotl provides a command to merge the LoRA adapter using the axolotl.cli.merge_lora command.
Typical command to merge a local LoRA:

python3 -m axolotl.cli.merge_lora examples/llama-3/lora-8b.yml --lora_model_dir="llama4-out"

If the LoRA model is not stored locally, you may need to download it first and specify the local directory using the --lora_model_dir argument.
If you encounter CUDA memory issues during merging, you can try merging in system RAM by setting CUDA_VISIBLE_DEVICES="" before the merge command.

merge_lora script analysis

The script is a command-line interface (CLI) tool to merge a trained LoRA (Low-Rank Adaptation) model into a base model.

It imports the following external classes and modules

Path from the pathlib module: This class provides an object-oriented interface for working with file and directory paths.
fire module: This is a library for automatically generating command-line interfaces from Python functions and classes.
transformers module: This is the Hugging Face Transformers library, which provides state-of-the-art pre-trained models for natural language processing tasks.
do_merge_lora, load_cfg, and print_axolotl_text_art from the axolotl.cli module: These are custom functions specific to the Axolotl project, likely related to merging LoRA models and loading configuration files.
TrainerCliArgs from the axolotl.common.cli module: This is likely a custom class defining the command-line arguments for the trainer.

The script defines a do_cli function that takes a config parameter (default is Path("examples/")) and any additional keyword arguments (**kwargs).

This function is the main entry point for the CLI.

Inside the do_cli function:

It prints the Axolotl text art using the print_axolotl_text_art function.
It creates a transformers.HfArgumentParser instance with TrainerCliArgs to parse the command-line arguments.
It parses the command-line arguments into parsed_cli_args using the parse_args_into_dataclasses method.
It sets parsed_cli_args.merge_lora to True.
It loads the configuration using the load_cfg function with the provided config path and additional keyword arguments.
It performs some validation and sets default values for the lora_model_dir and output_dir based on the loaded configuration.
It sets load_in_4bit, load_in_8bit, flash_attention, deepspeed, and fsdp to False or None.
It calls the do_merge_lora function with the loaded configuration (parsed_cfg) and parsed command-line arguments (parsed_cli_args).

Finally, if the script is run as the main module (__name__ == "__main__"), it uses the fire.Fire function to automatically generate a command-line interface for the do_cli function.

To learn more about the external classes and modules used in this script:

For the fire module, refer to the Fire documentation: https://github.com/google/python-fire
For the transformers module, refer to the Hugging Face Transformers documentation: https://huggingface.co/docs/transformers/
For axolotl.cli and axolotl.common.cli, these are likely custom modules specific to the Axolotl project. You should refer to the project's documentation or codebase for more information.

python3 -m axolotl.cli.merge_lora examples/llama-3/lora-8b.yml --lora_model_dir="llama4-out"

If you trained a QLoRA (Quantized LoRA) model that can only fit into GPU memory at 4-bit quantization, merging can be challenging due to memory constraints.
To merge a QLoRA model, you need to ensure that the model remains quantized throughout the merging process.
Modify the merge script to load the model with the appropriate quantization configuration, such as using the bitsandbytes library for 4-bit quantization.
Use libraries like accelerate for managing device memory and offloading parts of the model to CPU or disk if necessary.

Warnings and Considerations

Axolotl may raise warnings related to sample packing without flash attention or SDP attention, indicating that it does not handle cross-attention in those cases.
It is recommended to set load_in_8bit: true for LoRA fine-tuning, even if the warning suggests otherwise.
Merging quantized models, especially with parameter-efficient fine-tuning methods like QLoRA, can be complex and may require adjustments to the standard merging scripts.

Merging Duration

The time taken to merge a LoRA adapter back to the base model depends on the model size and hardware.
For a 70B parameter model fine-tuned on 4 A100 GPUs, the merging process can take a significant amount of time (over an hour or more).

These are the key ideas and considerations when using model merging with Axolotl. It's important to carefully configure the training and merging processes, handle quantization appropriately, and be aware of potential memory constraints and warnings.

Consulting the official Axolotl documentation and seeking guidance from the Axolotl community can provide further assistance in navigating the model merging process.

Tips and Tricks

Ensure that you have generated the LoRA model before attempting to merge it.

The typical workflow is to first train the LoRA model and then merge it in a separate command.

Use the --merge_lora flag along with the --lora_model_dir flag to specify the directory containing the trained LoRA model.

For example:

accelerate launch scripts/finetune.py examples/openllama-3b/qlora.yml --merge_lora --lora_model_dir="./qlora-out"

Set --load_in_8bit=False and --load_in_4bit=False when merging the LoRA model to avoid compatibility issues. The 4-bit and 8-bit loading options are not supported for merging.

If you encounter the error "ValueError: .to is not supported for 4-bit or 8-bit bitsandbytes models. Please use the model as it is", try using the old command for merging:

python3 scripts/finetune.py examples/code-llama/7b/qlora.yml --merge_lora --lora_model_dir="./qlora-out" --load_in_8bit=False --load_in_4bit=False

Make sure you have the latest version of Axolotl installed, as some issues might have been resolved in newer versions.

If you encounter CUDA-related errors, try setting the CUDA_VISIBLE_DEVICES environment variable to specify the desired GPU device. For example:

CUDA_VISIBLE_DEVICES="0" accelerate launch scripts/finetune.py examples/openllama-3b/qlora.yml --merge_lora --lora_model_dir="./qlora-out" --load_in_8bit=False --load_in_4bit=False

After merging the LoRA model, only the pytorch_model.bin file will work.

Attempting to quantize it directly may fail.

To quantize the merged model, you may need to copy additional files (e.g., tokenizer.model) from the original model and use external tools like llama.cpp for quantization.

If you encounter issues with missing files or directories, double-check the paths specified in the command and ensure that the necessary files and directories exist.

Remember to refer to the Axolotl documentation and the GitHub issues for the most up-to-date information and troubleshooting steps.

PreviousMerging Model Weights NextAxolotl Configuration Files

Last updated 1 year ago

Was this helpful?