Multi-GPU Fine-Tuning for LLMs

Distributed- and Fully Sharded Data Parallel (FSDP)

Apr 10, 2024

Multi-GPU Fine-tuning

In this latest video, I cover the fundamentals of fine-tuning large language models across multiple GPUs, focusing on:

1️⃣ Naive Model Parallel: The simplest approach, splitting layers across GPUs. Works well for smaller models.

2️⃣ Distributed Data Parallel (DDP): Replicating the model on each GPU and splitting data across them. Ideal when the model fits on one GPU.

3️⃣ Fully Sharded Data Parallel (FSDP): Sharding the model weights, gradients, and optimizer states across GPUs. The go-to for large models.

I dive deep into the VRAM requirements for each approach, and how techniques like LoRA, quantization, and gradient checkpointing can dramatically reduce memory needs.

Then, I walk through code examples of how to implement DDP and FSDP in PyTorch using the Hugging Face Accelerate library. I show the key changes needed, like setting up the device map, enabling re-entrant gradient checkpointing, and gathering the full state dict for saving FSDP models.

Last of all, I share results comparing model parallel vs FSDP for fine-tuning a 1B and a 34B model. FSDP leads to significant speedups by enabling much higher GPU utilization.

Trelis Grants

For talented developers, Trelis now provides fast $500 grants - applications are open here.

Cheers, Ronan

Ronan McGovern, Trelis Research

➡️ ADVANCED-fine-tuning Repo (and individual multi-gpu scripts)

➡️ Trelis Resources/Support

Trelis Research

Multi-GPU Fine-Tuning for LLMs

Distributed- and Fully Sharded Data Parallel (FSDP)

Multi-GPU Fine-tuning

Trelis Grants

Discussion about this post