Multi-GPU Fine-tuning
In this latest video, I cover the fundamentals of fine-tuning large language models across multiple GPUs, focusing on:
1️⃣ Naive Model Parallel: The simplest approach, splitting layers across GPUs. Works well for smaller models.
2️⃣ Distributed Data Parallel (DDP): Replicating the model on each GPU and splitting data across them. Ideal when the model fits on one GPU.
3️⃣ Fully Sharded Data Parallel (FSDP): Sharding the model weights, gradients, and optimizer states across GPUs. The go-to for large models.
I dive deep into the VRAM requirements for each approach, and how techniques like LoRA, quantization, and gradient checkpointing can dramatically reduce memory needs.
Then, I walk through code examples of how to implement DDP and FSDP in PyTorch using the Hugging Face Accelerate library. I show the key changes needed, like setting up the device map, enabling re-entrant gradient checkpointing, and gathering the full state dict for saving FSDP models.
Last of all, I share results comparing model parallel vs FSDP for fine-tuning a 1B and a 34B model. FSDP leads to significant speedups by enabling much higher GPU utilization.
Trelis Grants
For talented developers, Trelis now provides fast $500 grants - applications are open here.
Cheers, Ronan
Ronan McGovern, Trelis Research
➡️ ADVANCED-fine-tuning Repo (and individual multi-gpu scripts)