Here’s a new video walking through how to do Distributed Data Parallel training to get multi-gpu speed-up with
!Notes:
1. 2x faster than transformers + scales with number of gpus.
2. Distributed Data Parallel* only. FSDP not yet supported.
3. Recommended where your model fits fully on a single GPU.
*doing batch size larger than one remains a little tricky, but support should be there soon.
Cheers, Ronan
🤖 Purchase ADVANCED-fine-tuning Repo Access
🗝️ Get the Trelis Multi-Repo Bundle:
Access all SEVEN Trelis Github Repos (-robotics, -vision, -evals, -fine-tuning, -inference, -voice, -time-series)
Support via Github Issues & Trelis’ Private Discord
Early access to Trelis videos via Discord
TIMESTAMPS:
0:00 Faster training with multiple GPUs
0:39 Video Overview
1:24 Data parallel versus Pipeline Parallel versus Fully Sharded Data Parallel
6:38 Downloading a jupyter notebook as a python script for multi-gpu, e.g. an unsloth notebook
7:44 Unsloth vs Transformers for multi-gpu
8:13 Modifying a fine-tuning script for distributed data parallel
9:03 Starting up a GPU in one-click for fine-tuning
10:27 Converting a jupyter notebook to a python script
11:30 Installation notes for unsloth and tensorboard, and uv
13:32 Script modifications required for DDP
18:50 Training script run-through, for LoRA
22:46 Setting gradient accumulation steps
24:07 Dataset loading
26:22 Setting up the run name and training parameters
29:30 Running without multi-gpu (single gpu check)
35:47 Running with multiple GPUs using accelerate config (btw torch run can result in run hangs)
41:02 Sanity check of running with accelerate and a single gpu
44:48 Open (at time of recording) issues with loss reporting and using unsloth with batch size larger than one
53:11 Conclusion and shout-outs to spr1nter and rakshith
Distributed Data Parallel Training with Unsloth: Technical Guide
Distributed Data Parallel (DDP) training allows you to scale model training across multiple GPUs by replicating the model on each device. This article explains how to implement DDP with the Unsloth library, which offers approximately 2x faster training compared to the Transformers library.
Key Concepts
DDP works by:
Creating a full copy of the model on each GPU
Splitting training batches across GPUs
Accumulating and averaging gradients across devices
Keeping model weights synchronized during training
Implementation Requirements
To run DDP training with Unsloth:
Model must fit on a single GPU
Training script must be in .py format (not notebook)
Accelerate library needed for multi-GPU coordination
Batch size currently limited to 1 per device due to tensor view constraints
Technical Setup
Convert notebook to Python script: notebook_to_script.py input.ipynb output.py
Configure environment variables:
import os
os.environ["UNSLOTH_DISABLE_TRAINER_PATCHING"] = "1"
os.environ["UNSLOTH_NO_CUDA_EXTENSIONS"] = "1"
Set device mapping:
device_map = local_rank # Use local GPU rank
Configure DDP parameters:
training_args = SFTConfig(
find_unused_parameters=False,
per_device_train_batch_size=1
)
Performance Considerations
DDP scales nearly linearly with GPU count when properly configured
Unsloth's 2x speed advantage combines with multi-GPU scaling
Batch size limitations currently affect maximum throughput
Communication overhead minimal since only gradients are shared
Known Limitations
Current constraints include:
Batch size must be 1 per device due to tensor view operations, for now!
Custom loss averaging needed for proper gradient accumulation
Some logging inconsistencies being addressed in Transformers library
Practical Tips
For optimal results:
Test single-GPU script before scaling to multiple GPUs
Monitor training and validation losses across devices
Clear compiled caches between runs