NCCL
Troubleshooting NCCL issues from the Axolotl documentation
NVIDIA Collective Communications Library (NCCL, pronounced “Nickel”) is a library to facilitate and optimise multi-GPU communication operations, such as broadcast, all-gather, reduce, all-reduce.
Broadly, NCCL configuration is highly environment-specific and is configured via several environment variables.
A common NCCL-related problem occurs when a long-running operation times out causing the training process to abort:
Often, this timeout will happen after 30 minutes (the default setting) and is accompanied by below-average power consumption with near 100% GPU utilization before the error is raised.
Nvidia recommends disabling PCI access control services (ACS) as a possible solution if this is available to you.
Forcing cross-GPU communication via NVLink may help without increasing timeouts.
To verify that your configuration is leveraging NVLink run the following command:
To force NCCL to use NVLink, simply set this in the environment:
If NVLink is not available in your environment there are other options for NCCL_P2P_LEVEL
in the table below:
PIX
P2P data transfers through no more than a single PCIe bridge. Faster data transfer rates vs to paths involving multiple bridges, but slower compared to direct GPU-to-GPU communication.
PXB
P2P data transfers through multiple PCIe bridges but not going through the PCIe Host Bridge; this path involves a complex routing process, potentially incurring a moderate level of latency.
PHB
P2P data transfers occur over the PCIe and through a PCIe Host Bridge, typically involving the CPU, which can facilitate direct memory access but might introduce additional latency compared to more direct paths (ex PIX, NVL)
To validate that acceptable data transfer speeds exist for your training job, running NCCL Tests can help pinpoint bottlenecks, for example:
It can be useful when debugging NCCL communication timeouts to activate additional logging in both PyTorch and NCCL:
Finally, if you believe your training job needs more time you can increase the timeout past 30 minutes by setting the ddp_timeout
value in the Axolotl configuration. See PyTorch init_process_group for documentation on this value.
Last updated