Multi GPU Training with Unsloth

Distributed Data Parallel

Trelis Research

Sep 03, 2025

Here’s a new video walking through how to do Distributed Data Parallel training to get multi-gpu speed-up with

Unsloth AI

Notes:

1. 2x faster than transformers + scales with number of gpus.

2. Distributed Data Parallel* only. FSDP not yet supported.

3. Recommended where your model fits fully on a single GPU.

*doing batch size larger than one remains a little tricky, but support should be there soon.

Cheers, Ronan

🤖 Purchase ADVANCED-fine-tuning Repo Access

🗝️ Get the Trelis Multi-Repo Bundle:

Access all SEVEN Trelis Github Repos (-robotics, -vision, -evals, -fine-tuning, -inference, -voice, -time-series)
Support via Github Issues & Trelis’ Private Discord
Early access to Trelis videos via Discord

TIMESTAMPS:

0:00 Faster training with multiple GPUs

0:39 Video Overview

1:24 Data parallel versus Pipeline Parallel versus Fully Sharded Data Parallel

6:38 Downloading a jupyter notebook as a python script for multi-gpu, e.g. an unsloth notebook

7:44 Unsloth vs Transformers for multi-gpu

8:13 Modifying a fine-tuning script for distributed data parallel

9:03 Starting up a GPU in one-click for fine-tuning

10:27 Converting a jupyter notebook to a python script

11:30 Installation notes for unsloth and tensorboard, and uv

13:32 Script modifications required for DDP

18:50 Training script run-through, for LoRA

22:46 Setting gradient accumulation steps

24:07 Dataset loading

26:22 Setting up the run name and training parameters

29:30 Running without multi-gpu (single gpu check)

35:47 Running with multiple GPUs using accelerate config (btw torch run can result in run hangs)

41:02 Sanity check of running with accelerate and a single gpu

44:48 Open (at time of recording) issues with loss reporting and using unsloth with batch size larger than one

53:11 Conclusion and shout-outs to spr1nter and rakshith

Distributed Data Parallel Training with Unsloth: Technical Guide

Distributed Data Parallel (DDP) training allows you to scale model training across multiple GPUs by replicating the model on each device. This article explains how to implement DDP with the Unsloth library, which offers approximately 2x faster training compared to the Transformers library.

Key Concepts

DDP works by:

Creating a full copy of the model on each GPU
Splitting training batches across GPUs
Accumulating and averaging gradients across devices
Keeping model weights synchronized during training

Implementation Requirements

To run DDP training with Unsloth:

Model must fit on a single GPU
Training script must be in .py format (not notebook)
Accelerate library needed for multi-GPU coordination
Batch size currently limited to 1 per device due to tensor view constraints

Technical Setup

Convert notebook to Python script: notebook_to_script.py input.ipynb output.py

Configure environment variables:

import os
os.environ["UNSLOTH_DISABLE_TRAINER_PATCHING"] = "1"
os.environ["UNSLOTH_NO_CUDA_EXTENSIONS"] = "1"

Set device mapping:

device_map = local_rank # Use local GPU rank

Configure DDP parameters:

training_args = SFTConfig(
   find_unused_parameters=False,
   per_device_train_batch_size=1
)

Performance Considerations

DDP scales nearly linearly with GPU count when properly configured
Unsloth's 2x speed advantage combines with multi-GPU scaling
Batch size limitations currently affect maximum throughput
Communication overhead minimal since only gradients are shared

Known Limitations

Current constraints include:

Batch size must be 1 per device due to tensor view operations, for now!
Custom loss averaging needed for proper gradient accumulation
Some logging inconsistencies being addressed in Transformers library

Practical Tips

For optimal results:

Test single-GPU script before scaling to multiple GPUs
Monitor training and validation losses across devices
Clear compiled caches between runs