Page cover image

NCCL

Troubleshooting NCCL issues from the Axolotl documentation

NVIDIA Collective Communications Library (NCCL, pronounced “Nickel”) is a library to facilitate and optimise multi-GPU communication operations, such as broadcast, all-gather, reduce, all-reduce.

Broadly, NCCL configuration is highly environment-specific and is configured via several environment variables.

NCCL Explanation

NCCL, which stands for NVIDIA Collective Communications Library, is a library developed by NVIDIA to facilitate efficient communication and synchronization between multiple GPUs

It plays a crucial role in enabling fast and scalable training of deep neural networks across multiple GPUs and nodes.

One of the key features of NCCL is its topology-aware design.

It takes into account the underlying hardware architecture and interconnects, such as PCIe, NVLink, InfiniBand, and Ethernet, to optimise communication paths and minimise latency.

This is particularly important in deep learning workloads where large amounts of data need to be exchanged between GPUs during training.

NCCL provides a set of fundamental collective communication primitives that are commonly used in deep learning frameworks. These include:

  1. AllReduce: This operation combines the input tensors from all GPUs, performs a specified reduction operation (e.g., sum, max, min), and distributes the result back to all GPUs. It is extensively used in distributed training for aggregating gradients and updating model parameters.

  2. Broadcast: This operation broadcasts the input tensor from a single source GPU to all other GPUs. It is useful for synchronizing model parameters across GPUs.

  3. Reduce: This operation performs a reduction operation on the input tensors from all GPUs and stores the result on a specified root GPU.

  4. AllGather: This operation gathers the input tensors from all GPUs and concatenates them along a specified dimension, making the result available on all GPUs.

  5. ReduceScatter: This operation performs a reduction operation on the input tensors from all GPUs and scatters the result across all GPUs, such that each GPU receives a portion of the reduced tensor.

NCCL also supports point-to-point communication primitives, such as send and receive, which enable more flexible communication patterns like scatter, gather, and all-to-all operations.

One of the key advantages of NCCL is its ability to perform collective operations in a single kernel, combining communication and computation. This tight synchronization minimises the overhead of launching multiple kernels and memory copies, allowing NCCL to achieve near-peak bandwidth utilization.

From a programming perspective, NCCL provides a simple and intuitive API that closely follows the popular Message Passing Interface (MPI) standard. This makes it easy for developers familiar with MPI to adopt NCCL in their deep learning frameworks. NCCL seamlessly integrates with the CUDA programming model, allowing collectives to be launched directly from CUDA streams.

The impact of NCCL on the deep learning community has been significant. It has become a fundamental building block for many popular deep learning frameworks, such as TensorFlow, PyTorch, and MXNet. These frameworks leverage NCCL to efficiently scale the training of large neural networks across multiple GPUs and nodes, enabling faster convergence and handling larger datasets.

NCCL's performance optimisations and topology-aware design have greatly accelerated the training of deep neural networks, allowing researchers and practitioners to train more complex models in shorter timeframes. It has become an essential tool for anyone working on large-scale deep learning projects.

A common NCCL-related problem occurs when a long-running operation times out causing the training process to abort:

Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1806948 milliseconds before timing out.

Often, this timeout will happen after 30 minutes (the default setting) and is accompanied by below-average power consumption with near 100% GPU utilization before the error is raised.

Nvidia recommends disabling PCI access control services (ACS) as a possible solution if this is available to you.

Forcing cross-GPU communication via NVLink may help without increasing timeouts.

To verify that your configuration is leveraging NVLink run the following command:

nvidia-smi nvlink --status

To force NCCL to use NVLink, simply set this in the environment:

export NCCL_P2P_LEVEL=NVL

If NVLink is not available in your environment there are other options for NCCL_P2P_LEVEL in the table below:

NCCL_P2P_LEVEL
Description

PIX

P2P data transfers through no more than a single PCIe bridge. Faster data transfer rates vs to paths involving multiple bridges, but slower compared to direct GPU-to-GPU communication.

PXB

P2P data transfers through multiple PCIe bridges but not going through the PCIe Host Bridge; this path involves a complex routing process, potentially incurring a moderate level of latency.

PHB

P2P data transfers occur over the PCIe and through a PCIe Host Bridge, typically involving the CPU, which can facilitate direct memory access but might introduce additional latency compared to more direct paths (ex PIX, NVL)

To validate that acceptable data transfer speeds exist for your training job, running NCCL Tests can help pinpoint bottlenecks, for example:

./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

It can be useful when debugging NCCL communication timeouts to activate additional logging in both PyTorch and NCCL:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export TORCH_DISTRIBUTED_DEBUG=INFO
export TORCHELASTIC_ERROR_FILE=/PATH/TO/torcherror.log

Finally, if you believe your training job needs more time you can increase the timeout past 30 minutes by setting the ddp_timeout value in the Axolotl configuration. See PyTorch init_process_group for documentation on this value.

Last updated

Was this helpful?