Memory-Efficient Fine-Tuning Techniques for Large Language Models

Fine-tuning large language models (LLMs) can be a memory-intensive process, especially when working with huge datasets and model architectures.

As model sizes continue to grow, it becomes increasingly important to utilise memory-efficient techniques to make the most of available hardware resources.

In this documentation, we will explore several strategies that can significantly reduce the memory footprint during fine-tuning while maintaining or even improving training speed.

Gradient Accumulation

Gradient accumulation is a powerful technique that allows you to effectively increase the batch size without the need for additional GPU memory.

The idea is to break down the batch into smaller mini-batches and perform multiple forward and backward passes, accumulating the gradients in the process. Once the desired number of gradients has been accumulated, the optimizer step is performed.

Here's how you can enable gradient accumulation using the Trainer class in the Transformers library:

training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    **default_args
)

trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()

In this example, per_device_train_batch_size is set to 1, and gradient_accumulation_steps is set to 4.

This means that the model will perform 4 forward and backward passes, accumulating the gradients, before updating the weights.

The effective batch size becomes 4 (1 * 4), while the memory usage remains equivalent to processing a batch size of 1.

Gradient accumulation can significantly reduce memory usage at the cost of slightly slower training speed due to the additional forward and backward passes. However, it allows you to train with larger effective batch sizes that would otherwise not fit into GPU memory.

Gradient Checkpointing

Gradient checkpointing is another technique that helps reduce memory usage during the backward pass.

During a typical backward pass, all the activations from the forward pass are stored in memory to compute the gradients. This can lead to a significant memory overhead, especially for deep models.

Gradient checkpointing offers a compromise by strategically saving a subset of activations at checkpoints throughout the computational graph.

During the backward pass, the activations are recomputed from these checkpoints as needed, reducing the memory footprint at the cost of some additional computation.

To enable gradient checkpointing in the Trainer, you can simply pass the gradient_checkpointing flag to the TrainingArguments:

training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    **default_args
)

trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()

Gradient checkpointing can provide memory savings of up to 20-30% but may slow down training by around 20% due to the recomputation of activations.

It is particularly useful when dealing with large models that have deep computational graphs.

Mixed Precision Training (FP16)

Mixed precision training, also known as FP16 training, is a technique that leverages the reduced precision of 16-bit floating-point numbers to speed up computations and save memory.

By using half-precision (FP16) for storing activations and performing computations, while keeping the model weights in full precision (FP32), mixed precision training can significantly reduce memory usage and improve training speed.

To enable mixed precision training in the Trainer, you can set the fp16 flag to True:

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    fp16=True,
    **default_args
)

trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()

Mixed precision training can provide up to 2x speedup compared to full precision training while reducing memory usage. However, it's important to note that not all models and hardware support FP16 training, so it's crucial to check compatibility before enabling this feature.

Optimizer Choices

The choice of optimizer can also impact memory usage during fine-tuning.

The commonly used Adam optimizer stores rolling averages of the gradients, which can add a significant memory footprint, especially for large models with millions of parameters.

Adafactor

Adafactor is an alternative optimizer that reduces memory usage by storing only aggregated information (row-wise and column-wise sums) of the rolling averages instead of the full matrices.

This can lead to substantial memory savings without sacrificing much in terms of convergence speed.

To use Adafactor in the Trainer, you can set the optim argument to "adafactor":

training_args = TrainingArguments(
    per_device_train_batch_size=4,
    optim="adafactor",
    **default_args
)

trainer = Trainer(model=model, args=training_args, train_dataset=ds)
result = trainer.train()

Adafactor can provide memory savings of around 50% compared to Adam, making it a good choice for memory-constrained environments. However, it's worth noting that in some cases, Adafactor may exhibit slower convergence compared to Adam, so experimentation is recommended.

8-bit Adam

8-bit Adam is another memory-efficient optimizer that quantizes the optimizer states to 8-bit precision, reducing memory usage while maintaining the full optimizer state.

It strikes a balance between memory efficiency and convergence speed.

To use 8-bit Adam, you need to install the bitsandbytes library and pass a custom optimizer to the Trainer:

import bitsandbytes as bnb
from torch import nn
from transformers.trainer_pt_utils import get_parameter_names

decay_parameters = get_parameter_names(model, [nn.LayerNorm])
decay_parameters = [name for name in decay_parameters if "bias" not in name]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if n in decay_parameters],
        "weight_decay": training_args.weight_decay,
    },
    {
        "params": [p for n, p in model.named_parameters() if n not in decay_parameters],
        "weight_decay": 0.0,
    },
]

optimizer_kwargs = {
    "betas": (training_args.adam_beta1, training_args.adam_beta2),
    "eps": training_args.adam_epsilon,
}
optimizer_kwargs["lr"] = training_args.learning_rate
adam_bnb_optim = bnb.optim.Adam8bit(
    optimizer_grouped_parameters,
    betas=(training_args.adam_beta1, training_args.adam_beta2),
    eps=training_args.adam_epsilon,
    lr=training_args.learning_rate,
)

trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None))
result = trainer.train()

8-bit Adam can provide memory savings similar to Adafactor while maintaining convergence speed closer to Adam. It's a good choice when memory is limited, but convergence speed is still a priority.

Combining Techniques

The real power of these memory-efficient techniques lies in combining them to achieve optimal memory usage and training speed.

By using gradient accumulation, gradient checkpointing, mixed precision training, and memory-efficient optimizers together, you can significantly reduce the memory footprint of fine-tuning large language models.

Here's an example that combines all the techniques:

training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    fp16=True,
    **default_args,
)

trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None))
result = trainer.train()

In this setup, we use a per-device batch size of 1 with gradient accumulation steps of 4, enabling gradient checkpointing, mixed precision training, and the 8-bit Adam optimizer.

This combination can lead to memory savings of up to 3x compared to the baseline while maintaining or even improving training speed.

Conclusion

Fine-tuning large language models can be a memory-intensive process, but by leveraging techniques like gradient accumulation, gradient checkpointing, mixed precision training, and memory-efficient optimizers, you can significantly reduce the memory footprint and make the most of your hardware resources.

Experiment with different combinations of these techniques to find the optimal balance between memory usage and training speed for your specific use case.

Remember to profile your training process and monitor memory usage to ensure that you are within the limits of your hardware.

Additionally, keep in mind that while these techniques can provide substantial memory savings, they may introduce slight performance overheads or impact convergence speed in some cases. It's always a good idea to validate the results and compare them against the baseline to ensure that you are achieving the desired performance.

By applying these memory-efficient techniques, you can push the boundaries of fine-tuning large language models on resource-constrained environments and unlock the potential of LLMs for a wide range of applications.

PreviousPrompt Construction for Fine-Tuning Large Language Models NextTraining Ideas around Hyperparameters

Last updated 1 year ago

Was this helpful?