LogoLogo
Continuum Knowledge BankContinuum Applications
  • Introduction
  • Creation of Environment
    • Platform Installation
    • Axolotl Dependencies
    • setup.py objectives
      • script analysis
  • Huggingface Hub
  • Download the dataset
    • Types of Dataset Structures
    • Structuring Datasets for Fine-Tuning Large Language Models
    • Downloading Huggingface Datasets
    • Use Git to download dataset
    • Popular Datasets
    • Download cleaned Alpaca dataset
    • Template-free prompt construction
  • Downloading models
    • Phi 2.0 details
    • Downloading Phi 2.0
    • Available Models
  • Configuration for Training
  • Datasets
  • Model Selection - General
  • Phi 2.0
    • Phi 2.0 - Model Configuration
    • Phi 2.0 - Model Quantization
    • Phi 2.0 - Data Loading and Paths
    • Phi 2.0 - Sequence Configuration
    • Phi 2.0 - Lora Configuration
    • Phi 2.0 - Logging
    • Phi 2.0 - Training Configuration
    • Phi 2.0 - Data and Precision
    • Phi 2.0 - Optimisations
    • Phi 2.0 - Extra Hyperparameters
    • Phi 2.0 - All Configurations
    • Phi 2.0 - Preprocessing
    • Phi 2.0 - Training
    • Uploading Models
  • Llama2
    • Llama2 - Model Configuration
    • Llama2 - Model Quantization
    • Llama2 - Data Loading and Paths
    • Llama2 - Sequence Configuration
    • Llama2 - Lora Configuration
    • Llama2 - Logging
    • Llama2 - Training Configuration
    • Llama2 - Data and Precision
    • Llama2 - Optimisations
    • Llama2 - Extra Hyperparameters
    • Llama2- All Configurations
    • Llama2 - Training Configuration
    • Llama2 - Preprocessing
    • Llama2 - Training
  • Llama3
    • Downloading the model
    • Analysis of model files
      • Model Analysis - Configuration Parameters
      • Model Analysis - Safetensors
      • Tokenizer Configuration Files
        • Model Analysis - tokenizer.json
        • Model Analysis - Special Tokens
    • Llama3 - Model Configuration
    • Llama3 - Model Quantization
    • Llama3 - Data Loading and Paths
    • Llama3 - Sequence Configuration
    • Llama3 - Lora Configuration
    • Llama3 - Logging
    • Llama3 - Training Configuration
    • Llama3 - Data and Precision
    • Llama3 - Optimisations
    • Llama3 - Extra Hyperparameters
    • Llama3- All Configurations
    • Llama3 - Preprocessing
    • Llama3 - Training
    • Full Fine Tune
  • Special Tokens
  • Prompt Construction for Fine-Tuning Large Language Models
  • Memory-Efficient Fine-Tuning Techniques for Large Language Models
  • Training Ideas around Hyperparameters
    • Hugging Face documentation on loading PEFT
  • After fine tuning LLama3
  • Merging Model Weights
  • Merge Lora Instructions
  • Axolotl Configuration Files
    • Configuration Options
    • Model Configuration
    • Data Loading and Processing
    • Sequence Configuration
    • Lora Configuration
    • Logging
    • Training Configuration
    • Augmentation Techniques
  • Axolotl Fine-Tuning Tips & Tricks: A Comprehensive Guide
  • Axolotl debugging guide
  • Hugging Face Hub API
  • NCCL
  • Training Phi 1.5 - Youtube
  • JSON (JavaScript Object Notation)
  • General Tips
  • Datasets
Powered by GitBook
LogoLogo

This documentation is for the Axolotl community

On this page

Was this helpful?

NCCL

Troubleshooting NCCL issues from the Axolotl documentation

PreviousHugging Face Hub APINextTraining Phi 1.5 - Youtube

Last updated 1 year ago

Was this helpful?

(NCCL, pronounced “Nickel”) is a library to facilitate and optimise multi-GPU communication operations, such as broadcast, all-gather, reduce, all-reduce.

Broadly, NCCL configuration is highly environment-specific and is configured via several .

NCCL Explanation

NCCL, which stands for NVIDIA Collective Communications Library, is a library developed by NVIDIA to facilitate efficient communication and synchronization between multiple GPUs

It plays a crucial role in enabling fast and scalable training of deep neural networks across multiple GPUs and nodes.

One of the key features of NCCL is its topology-aware design.

It takes into account the underlying hardware architecture and interconnects, such as PCIe, NVLink, InfiniBand, and Ethernet, to optimise communication paths and minimise latency.

This is particularly important in deep learning workloads where large amounts of data need to be exchanged between GPUs during training.

NCCL provides a set of fundamental collective communication primitives that are commonly used in deep learning frameworks. These include:

  1. AllReduce: This operation combines the input tensors from all GPUs, performs a specified reduction operation (e.g., sum, max, min), and distributes the result back to all GPUs. It is extensively used in distributed training for aggregating gradients and updating model parameters.

  2. Broadcast: This operation broadcasts the input tensor from a single source GPU to all other GPUs. It is useful for synchronizing model parameters across GPUs.

  3. Reduce: This operation performs a reduction operation on the input tensors from all GPUs and stores the result on a specified root GPU.

  4. AllGather: This operation gathers the input tensors from all GPUs and concatenates them along a specified dimension, making the result available on all GPUs.

  5. ReduceScatter: This operation performs a reduction operation on the input tensors from all GPUs and scatters the result across all GPUs, such that each GPU receives a portion of the reduced tensor.

NCCL also supports point-to-point communication primitives, such as send and receive, which enable more flexible communication patterns like scatter, gather, and all-to-all operations.

One of the key advantages of NCCL is its ability to perform collective operations in a single kernel, combining communication and computation. This tight synchronization minimises the overhead of launching multiple kernels and memory copies, allowing NCCL to achieve near-peak bandwidth utilization.

From a programming perspective, NCCL provides a simple and intuitive API that closely follows the popular Message Passing Interface (MPI) standard. This makes it easy for developers familiar with MPI to adopt NCCL in their deep learning frameworks. NCCL seamlessly integrates with the CUDA programming model, allowing collectives to be launched directly from CUDA streams.

The impact of NCCL on the deep learning community has been significant. It has become a fundamental building block for many popular deep learning frameworks, such as TensorFlow, PyTorch, and MXNet. These frameworks leverage NCCL to efficiently scale the training of large neural networks across multiple GPUs and nodes, enabling faster convergence and handling larger datasets.

NCCL's performance optimisations and topology-aware design have greatly accelerated the training of deep neural networks, allowing researchers and practitioners to train more complex models in shorter timeframes. It has become an essential tool for anyone working on large-scale deep learning projects.

A common NCCL-related problem occurs when a long-running operation times out causing the training process to abort:

Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1806948 milliseconds before timing out.

Often, this timeout will happen after 30 minutes (the default setting) and is accompanied by below-average power consumption with near 100% GPU utilization before the error is raised.

Nvidia recommends as a possible solution if this is available to you.

Forcing cross-GPU communication via may help without increasing timeouts.

To verify that your configuration is leveraging NVLink run the following command:

nvidia-smi nvlink --status

To force NCCL to use NVLink, simply set this in the environment:

export NCCL_P2P_LEVEL=NVL

If NVLink is not available in your environment there are other options for NCCL_P2P_LEVEL in the table below:

NCCL_P2P_LEVEL
Description

PIX

P2P data transfers through no more than a single PCIe bridge. Faster data transfer rates vs to paths involving multiple bridges, but slower compared to direct GPU-to-GPU communication.

PXB

P2P data transfers through multiple PCIe bridges but not going through the PCIe Host Bridge; this path involves a complex routing process, potentially incurring a moderate level of latency.

PHB

P2P data transfers occur over the PCIe and through a PCIe Host Bridge, typically involving the CPU, which can facilitate direct memory access but might introduce additional latency compared to more direct paths (ex PIX, NVL)

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

It can be useful when debugging NCCL communication timeouts to activate additional logging in both PyTorch and NCCL:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_DISTRIBUTED_DEBUG=INFO
export TORCHELASTIC_ERROR_FILE=/PATH/TO/torcherror.log

To validate that acceptable data transfer speeds exist for your training job, running can help pinpoint bottlenecks, for example:

Finally, if you believe your training job needs more time you can increase the timeout past 30 minutes by setting the ddp_timeout value in the Axolotl configuration. See for documentation on this value.

NVIDIA Collective Communications Library
environment variables
disabling PCI access control services (ACS)
NVLink
NCCL Tests
PyTorch init_process_group
Page cover image