NCCL
Troubleshooting NCCL issues from the Axolotl documentation
Last updated
This documentation is for the Axolotl community
Troubleshooting NCCL issues from the Axolotl documentation
Last updated
(NCCL, pronounced “Nickel”) is a library to facilitate and optimise multi-GPU communication operations, such as broadcast, all-gather, reduce, all-reduce.
Broadly, NCCL configuration is highly environment-specific and is configured via several .
A common NCCL-related problem occurs when a long-running operation times out causing the training process to abort:
Often, this timeout will happen after 30 minutes (the default setting) and is accompanied by below-average power consumption with near 100% GPU utilization before the error is raised.
Nvidia recommends as a possible solution if this is available to you.
Forcing cross-GPU communication via may help without increasing timeouts.
To verify that your configuration is leveraging NVLink run the following command:
To force NCCL to use NVLink, simply set this in the environment:
If NVLink is not available in your environment there are other options for NCCL_P2P_LEVEL
in the table below:
NCCL_P2P_LEVEL | Description |
---|
It can be useful when debugging NCCL communication timeouts to activate additional logging in both PyTorch and NCCL:
To validate that acceptable data transfer speeds exist for your training job, running can help pinpoint bottlenecks, for example:
Finally, if you believe your training job needs more time you can increase the timeout past 30 minutes by setting the ddp_timeout
value in the Axolotl configuration. See for documentation on this value.
PIX | P2P data transfers through no more than a single PCIe bridge. Faster data transfer rates vs to paths involving multiple bridges, but slower compared to direct GPU-to-GPU communication. |
PXB | P2P data transfers through multiple PCIe bridges but not going through the PCIe Host Bridge; this path involves a complex routing process, potentially incurring a moderate level of latency. |
PHB | P2P data transfers occur over the PCIe and through a PCIe Host Bridge, typically involving the CPU, which can facilitate direct memory access but might introduce additional latency compared to more direct paths (ex PIX, NVL) |