Professional Quality Voice Cloning

Open Source vs ElevenLabs

Trelis Research

Jun 18, 2025

Here’s the recipe for high quality voice-clones:

3 hours of audio data (ideally clean)
de-noise it
level the volume
chunk into 30-60s chunks, respecting boundaries!
Fine-tune (LoRA is fine, full fine-tuning too)
Inference with cloning AND the fine-tune

Quality gets up there with ElevenLabs. Have a listen to see if you can spot the difference.

Cheers, Ronan

Video Links:

Slides
Purchase Repo Access

Trelis Links:

🤝 Are you a talented developer? Work for Trelis

💡 Need Technical or Market Assistance? Book a Consult Here

💸 Starting a New Project/Venture? Apply for a Trelis Grant

Timestamps:

0:00 Fine-tuning Text-to-Speech Models with Unsloth

0:53 Video Overview

1:47 Video Resources

2:26 Voice Quality Examples: ElevenLabs vs Open Source

4:52 The recipe for professional quality voice cloning

6:23 How do token based speech to text models?

14:08 Data Preparation and Training Overview

16:02 Data preparation, cleaning and chunking for voice cloning

24:05 Audio transcription from uploaded audio

25:42 Dataset chunking and pushing to HuggingFace Hub

29:49 Loading Sesame CSM-1B and LoRA adapters (full fine-tuning also possible! And in the repo)

34:36 Dataset loading and creating and eval split

37:42 Training Hyperparameters

40:08 Running inference on the fine-tuned model, and evaluating

43:57 LoRA fine-tuning of Orpheus by Canopy Labs - Data loading and is very different!

50:27 Running inference and Listening to the quality with Orpheus

53:15 Professional Voice Cloning with Eleven Labs

56:18 Examining tensorboard logs from the Sesame LoRA fine-tuning

57:27 Upcoming video on serving Orpheus with vLLM

58:10 Conclusion

Professional Voice Cloning with Open Source Models

Training a model to replicate a specific voice requires careful data preparation and model fine-tuning. This article explains how to achieve professional-quality voice cloning using open source models, comparing them with commercial offerings.

Data Requirements and Preparation

Approximately 3 hours of high-quality voice recordings needed
Audio must be cleaned and normalized:
1. High-pass filter removes low-frequency noise
2. FFT-based denoiser reduces background noise
3. Loudness normalization standardizes volume
4. Dynamic audio normalization adjusts for human hearing
5. 100ms silence padding added to clip endings
Audio sampled at 24 kHz to match model requirements
Text transcription generated using Whisper Turbo model
Data chunked into 30-second segments respecting sentence boundaries

Model Architecture

Two main open source models compared:

CSM-1B (1.6B parameters)
Orpheus (3.3B parameters)

Both use token-based approaches:

Audio divided into time windows
Each window represented by multiple tokens in a hierarchy:
1. CSM-1B: 1 coarse token + 31 detail tokens
2. Orpheus: 7 tokens in series per timestamp
Models trained to predict next audio tokens given text input

Fine-tuning Process

LoRA adapters used instead of full fine-tuning
Only 1-1.75% of model parameters trained
Training settings:
1. Batch size: 16 with gradient accumulation (effective batch 32)
2. Learning rate: 2e-4
3. 5 warmup steps
4. 3 epochs total
Progress monitored via TensorBoard:
1. Training loss
2. Validation loss
3. Gradient norm

Performance Comparison

Quality tested against Eleven Labs commercial model:

CSM-1B achieved comparable quality to Eleven Labs - in a VERY loose comparison
Orpheus showed good results but slightly lower quality, likely due to:
1. Suboptimal sequence length padding
2. Fewer training epochs
Both models improved significantly with:
1. Clean, normalized audio data
2. Proper sentence boundary chunking
3. Combined fine-tuning and cloning approach

The results demonstrate that open source models can now achieve professional-quality voice cloning, though care must be taken with data preparation and training parameters.