Whisper Data Preparation and Fine-tuning with Unsloth
Advanced Data Preparation Techniques
I go through the details of preparing data for fine-tuning by:
Recording audio
Transcribing with whisper-timestamped (word timestamps!)
Laying out clean segments of text audio
Correcting transcripts, manually or automatically
Fine-tuning
Cheers, Ronan
🤖 Purchase ADVANCED-audio Repo Access
Access all SEVEN Trelis Github Repos (-robotics, -vision, -evals, -fine-tuning, -inference, -voice, -time-series)
Support via Github Issues & Trelis’ Private Discord
Early access to Trelis videos via Discord
TIMESTAMPS:
0:00 Whisper preparation and fine-tuning with Unsloth
0:40 Resources: Trelis.com/ADVANCED-audio
1:23 One-click GPU and Jupyter Notebook Setup
3:37 Whisper vs Voxtral vs Kyutai
4:48 Installation of Unsloth and Whisper Timestamped
7:52 Using Whisper Large versus Turbo
8:53 Video Overview / Layout - How to prepare data and train
11:33 Audio recording and transcription (with whisper timestamped)
13:06 Whisper vs Whisper-Timestamped and the motivation for word timestamps
15:34 Creating text/audio segments using word-timestamped transcripts
18:46 Segment time-stamps using whisper (not easy to then chunk to <30s!!!)
19:50 Word time-stamps with Whisper Timestamped
20:56 Automated vs manual transcript cleanup techniques
28:48 Dataset creation from audio and text segments
20:53 Fine-tuning with Unsloth
33:26 Word Error Rate - Teacher Force versus predict_with_generate
36:11 Training hyperparameters and losses / results
37:33 Evaluating base and fine-tuned model performance
39:15 Merging, pushing to hub and preparing for inference
40:21 Conclusion
Fine-tuning Whisper Speech-to-Text Models with Unsloth
This guide covers data preparation and fine-tuning for Whisper transcription models using Unsloth, which provides approximately 2x faster training compared to the transformers library.
Model Selection
Whisper remains a practical base model for custom transcription applications despite newer alternatives. The model was trained with a pre 2023 cutoff date, meaning it lacks familiarity with terms that emerged afterward.
For comparison, three model families are available:
Whisper: Better tooling support, including faster-whisper and continuous batching servers
Voxtral: Currently achieves higher performance but has limited support in Transformers and Unsloth
Kyutai: Streaming models trained on Whisper timestamp data, more compute-efficient for real-time captions but lower accuracy
This tutorial focuses on fine-tuning Whisper Large V3 Turbo, a distilled version with fewer layers that runs faster while maintaining near-equivalent accuracy to the full model.
Data Recording and Transcription
The process begins with recording custom audio. For this demonstration, audio consists of AI-related terms that post-date Whisper’s training cutoff (GPT-5, Claude 2.0, Mixtral 8x7B, etc.). The training set contains approximately 2 minutes of audio, with a similar-sized validation set.
Word-Level Timestamps
Whisper provides segment-level timestamps by default, covering multiple words. While Whisper can generate word timestamps from attention matrices using heuristics, this approach can miss transcript chunks.
The solution uses Whisper-timestamped, which includes a separate alignment model alongside Whisper for more accurate word-level timestamps. This accuracy matters because it enables controlled chunk lengths during data preparation.
Segment Assembly
The data preparation workflow:
Transcribe audio using Whisper-timestamped to obtain word-level timestamps
Assemble words into segments following these rules:
Add words until reaching at least 20 seconds
Close the segment upon encountering a full stop, question mark, exclamation mark, or long pause
Force segment closure at 30 seconds regardless (Whisper’s input limit)
This approach creates segments between 20-30 seconds that align with natural sentence boundaries, which improves training quality. Starting training segments mid-sentence reduces model performance because the model lacks preceding context.
Transcript Cleaning
After initial transcription, the raw output requires correction. Two approaches exist:
LLM-based cleaning: Feed transcription lines with a keyword list to an LLM for correction. This fixes misspellings but cannot recover missing words or identify added words.
Human review: Manual correction while listening to audio. This method proved significantly more effective, reducing word error rate from 40% to 20% in testing.
The human review process involves listening to each segment and correcting the VTT file directly. While less scalable, smaller high-quality datasets outperform larger low-quality ones.
Dataset Preparation
Each segment becomes a row in the dataset, pairing audio with cleaned transcription text. The audio sampling rate is 16,000 Hz (Whisper’s requirement). Audio gets converted to MEL spectrogram features, and text gets tokenized.
Training Configuration
Training uses LoRA adapters with these parameters:
Rank: 32
RS-LoRA enabled (increases effective learning rate of adapters)
Targets: attention and linear layers
4-bit quantization supported
Language: English
Task: transcription
Training arguments:
Epochs: 2
Learning rate: constant schedule
Brain float 16 precision (if supported)
Evaluation during training
The data collator handles padding and ensures padding tokens are ignored during loss computation.
Teacher Forcing vs Generation
Two evaluation modes exist:
Teacher forcing: During prediction of each token, the model receives the correct text history plus audio input. This produces lower word error rates.
Generation mode: The model receives only a start token and audio, relying solely on audio input without text history. This produces higher word error rates but reflects actual inference conditions.
The notebook defaults to generation mode for realistic performance metrics. Teacher forcing can be more robust for long sequences but doesn’t represent real-world performance.
Training Results
Training completed in 36 seconds on an A40 GPU. Results showed:
Validation and Training loss decreased steadily
Word error rate decreased substantially by the end of epoch 1
Evaluation comparisons:
Before training: “Anthropic Mixed Rally by 7B Instruct” After training: “Anthropic. Mixtral 8x7B”
Before training: Various formatting errors with GPT-5, model names After training: “GPT-5 Turbo” correctly formatted
Some issues remained, such as missing spaces between certain words (WizardLMv1.0.Spin instead of WizardLM v1.0. Spin). This likely occurred because each term appeared only once in training data, and tokenization differs based on whether spaces precede words.
Model Export
The trained model (1.62 GB for turbo version) can be exported in multiple formats:
Merged and pushed to Hugging Face Hub
Converted to OpenAI format
Saved as CT2 translate model for faster-whisper inference
Deployed with continuous batching server for highest throughput
Key Findings
Word-level timestamps enable clean segment boundaries at sentence endings
Segments between 20-30 seconds optimize training
Human-reviewed transcripts significantly outperform LLM-cleaned versions
Small, high-quality datasets exceed large, low-quality datasets
Multiple appearances per term (3-5 instances) improve generalization
Training without proper segmentation produced word error rates of 60-100%+
Proper data preparation reduced word error rates to 20%
The complete workflow—from audio recording through deployment—demonstrates that careful data preparation matters more than training duration for achieving quality results with limited data.

