NCCL
Troubleshooting NCCL issues from the Axolotl documentation
Last updated
Was this helpful?
Troubleshooting NCCL issues from the Axolotl documentation
Last updated
Was this helpful?
(NCCL, pronounced “Nickel”) is a library to facilitate and optimise multi-GPU communication operations, such as broadcast, all-gather, reduce, all-reduce.
Broadly, NCCL configuration is highly environment-specific and is configured via several .
A common NCCL-related problem occurs when a long-running operation times out causing the training process to abort:
Often, this timeout will happen after 30 minutes (the default setting) and is accompanied by below-average power consumption with near 100% GPU utilization before the error is raised.
Nvidia recommends as a possible solution if this is available to you.
Forcing cross-GPU communication via may help without increasing timeouts.
To verify that your configuration is leveraging NVLink run the following command:
To force NCCL to use NVLink, simply set this in the environment:
If NVLink is not available in your environment there are other options for NCCL_P2P_LEVEL
in the table below:
PIX
P2P data transfers through no more than a single PCIe bridge. Faster data transfer rates vs to paths involving multiple bridges, but slower compared to direct GPU-to-GPU communication.
PXB
P2P data transfers through multiple PCIe bridges but not going through the PCIe Host Bridge; this path involves a complex routing process, potentially incurring a moderate level of latency.
PHB
P2P data transfers occur over the PCIe and through a PCIe Host Bridge, typically involving the CPU, which can facilitate direct memory access but might introduce additional latency compared to more direct paths (ex PIX, NVL)
It can be useful when debugging NCCL communication timeouts to activate additional logging in both PyTorch and NCCL:
To validate that acceptable data transfer speeds exist for your training job, running can help pinpoint bottlenecks, for example:
Finally, if you believe your training job needs more time you can increase the timeout past 30 minutes by setting the ddp_timeout
value in the Axolotl configuration. See for documentation on this value.