— AI — 1 min read
When training large machine learning models across multiple GPUs, the way your GPUs are connected can have a huge impact on your training speed. One of the key technologies that can help here is NVLink, developed by NVIDIA.
In simple terms, NVLink is a high-speed connection between GPUs. It’s much faster than the traditional PCIe connection, allowing GPUs to share data more quickly. Think of it as a superhighway for your data, compared to a regular road.
When GPUs need to exchange a lot of data during training, a faster connection helps them work more efficiently. For example:
You can check how your GPUs are connected by running:
1nvidia-smi topo -m
Here’s an example from my company’s setup:
1GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity2GPU0 X NV4 SYS SYS 0,2,4,6,8,10 03GPU1 NV4 X SYS SYS 0,2,4,6,8,10 04GPU2 SYS SYS X NV4 1,3,5,7,9,11 15GPU3 SYS SYS NV4 X 1,3,5,7,9,11 16
7Legend:8
9 X = Self10 SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)11 NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node12 PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)13 PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)14 PIX = Connection traversing at most a single PCIe bridge15 NV# = Connection traversing a bonded set of # NVLinks
Legend:
Here’s a quick comparison of training a GPT-2 model with and without NVLink from [1]:
NVLink | Training Time |
---|---|
Yes | 101 seconds |
No | 131 seconds |
This shows that NVLink speeds up training by ~23%.
The benefit of NVLink depends on how often GPUs need to talk to each other:
nvidia-smi topo -m
and plan your GPU usage accordingly. For example, in the setup above, GPU0 and GPU1 work best as a pair, and GPU2 and GPU3 work best as another pair.[1] Ref: https://huggingface.co/transformers/v4.9.2/performance.html