Here’s the recipe for high quality voice-clones:
3 hours of audio data (ideally clean)
de-noise it
level the volume
chunk into 30-60s chunks, respecting boundaries!
Fine-tune (LoRA is fine, full fine-tuning too)
Inference with cloning AND the fine-tune
Quality gets up there with ElevenLabs. Have a listen to see if you can spot the difference.
Cheers, Ronan
Video Links:
Trelis Links:
🤝 Are you a talented developer? Work for Trelis
💡 Need Technical or Market Assistance? Book a Consult Here
💸 Starting a New Project/Venture? Apply for a Trelis Grant
Timestamps:
0:00 Fine-tuning Text-to-Speech Models with Unsloth
0:53 Video Overview
1:47 Video Resources
2:26 Voice Quality Examples: ElevenLabs vs Open Source
4:52 The recipe for professional quality voice cloning
6:23 How do token based speech to text models?
14:08 Data Preparation and Training Overview
16:02 Data preparation, cleaning and chunking for voice cloning
24:05 Audio transcription from uploaded audio
25:42 Dataset chunking and pushing to HuggingFace Hub
29:49 Loading Sesame CSM-1B and LoRA adapters (full fine-tuning also possible! And in the repo)
34:36 Dataset loading and creating and eval split
37:42 Training Hyperparameters
40:08 Running inference on the fine-tuned model, and evaluating
43:57 LoRA fine-tuning of Orpheus by Canopy Labs - Data loading and is very different!
50:27 Running inference and Listening to the quality with Orpheus
53:15 Professional Voice Cloning with Eleven Labs
56:18 Examining tensorboard logs from the Sesame LoRA fine-tuning
57:27 Upcoming video on serving Orpheus with vLLM
58:10 Conclusion
Professional Voice Cloning with Open Source Models
Training a model to replicate a specific voice requires careful data preparation and model fine-tuning. This article explains how to achieve professional-quality voice cloning using open source models, comparing them with commercial offerings.
Data Requirements and Preparation
Approximately 3 hours of high-quality voice recordings needed
Audio must be cleaned and normalized:
High-pass filter removes low-frequency noise
FFT-based denoiser reduces background noise
Loudness normalization standardizes volume
Dynamic audio normalization adjusts for human hearing
100ms silence padding added to clip endings
Audio sampled at 24 kHz to match model requirements
Text transcription generated using Whisper Turbo model
Data chunked into 30-second segments respecting sentence boundaries
Model Architecture
Two main open source models compared:
CSM-1B (1.6B parameters)
Orpheus (3.3B parameters)
Both use token-based approaches:
Audio divided into time windows
Each window represented by multiple tokens in a hierarchy:
CSM-1B: 1 coarse token + 31 detail tokens
Orpheus: 7 tokens in series per timestamp
Models trained to predict next audio tokens given text input
Fine-tuning Process
LoRA adapters used instead of full fine-tuning
Only 1-1.75% of model parameters trained
Training settings:
Batch size: 16 with gradient accumulation (effective batch 32)
Learning rate: 2e-4
5 warmup steps
3 epochs total
Progress monitored via TensorBoard:
Training loss
Validation loss
Gradient norm
Performance Comparison
Quality tested against Eleven Labs commercial model:
CSM-1B achieved comparable quality to Eleven Labs - in a VERY loose comparison
Orpheus showed good results but slightly lower quality, likely due to:
Suboptimal sequence length padding
Fewer training epochs
Both models improved significantly with:
Clean, normalized audio data
Proper sentence boundary chunking
Combined fine-tuning and cloning approach
The results demonstrate that open source models can now achieve professional-quality voice cloning, though care must be taken with data preparation and training parameters.