Whisper Data Preparation and Fine-tuning with Unsloth

Advanced Data Preparation Techniques

Trelis Research

Nov 25, 2025

I go through the details of preparing data for fine-tuning by:

Recording audio
Transcribing with whisper-timestamped (word timestamps!)
Laying out clean segments of text audio
Correcting transcripts, manually or automatically
Fine-tuning

Cheers, Ronan

🤖 Purchase ADVANCED-audio Repo Access

🗝️ Get Trelis All Access:

Access all SEVEN Trelis Github Repos (-robotics, -vision, -evals, -fine-tuning, -inference, -voice, -time-series)
Support via Github Issues & Trelis’ Private Discord
Early access to Trelis videos via Discord

TIMESTAMPS:

0:00 Whisper preparation and fine-tuning with Unsloth

0:40 Resources: Trelis.com/ADVANCED-audio

1:23 One-click GPU and Jupyter Notebook Setup

3:37 Whisper vs Voxtral vs Kyutai

4:48 Installation of Unsloth and Whisper Timestamped

7:52 Using Whisper Large versus Turbo

8:53 Video Overview / Layout - How to prepare data and train

11:33 Audio recording and transcription (with whisper timestamped)

13:06 Whisper vs Whisper-Timestamped and the motivation for word timestamps

15:34 Creating text/audio segments using word-timestamped transcripts

18:46 Segment time-stamps using whisper (not easy to then chunk to <30s!!!)

19:50 Word time-stamps with Whisper Timestamped

20:56 Automated vs manual transcript cleanup techniques

28:48 Dataset creation from audio and text segments

20:53 Fine-tuning with Unsloth

33:26 Word Error Rate - Teacher Force versus predict_with_generate

36:11 Training hyperparameters and losses / results

37:33 Evaluating base and fine-tuned model performance

39:15 Merging, pushing to hub and preparing for inference

40:21 Conclusion

Fine-tuning Whisper Speech-to-Text Models with Unsloth

This guide covers data preparation and fine-tuning for Whisper transcription models using Unsloth, which provides approximately 2x faster training compared to the transformers library.

Model Selection

Whisper remains a practical base model for custom transcription applications despite newer alternatives. The model was trained with a pre 2023 cutoff date, meaning it lacks familiarity with terms that emerged afterward.

For comparison, three model families are available:

Whisper: Better tooling support, including faster-whisper and continuous batching servers
Voxtral: Currently achieves higher performance but has limited support in Transformers and Unsloth
Kyutai: Streaming models trained on Whisper timestamp data, more compute-efficient for real-time captions but lower accuracy

This tutorial focuses on fine-tuning Whisper Large V3 Turbo, a distilled version with fewer layers that runs faster while maintaining near-equivalent accuracy to the full model.

Data Recording and Transcription

The process begins with recording custom audio. For this demonstration, audio consists of AI-related terms that post-date Whisper’s training cutoff (GPT-5, Claude 2.0, Mixtral 8x7B, etc.). The training set contains approximately 2 minutes of audio, with a similar-sized validation set.

Word-Level Timestamps

Whisper provides segment-level timestamps by default, covering multiple words. While Whisper can generate word timestamps from attention matrices using heuristics, this approach can miss transcript chunks.

The solution uses Whisper-timestamped, which includes a separate alignment model alongside Whisper for more accurate word-level timestamps. This accuracy matters because it enables controlled chunk lengths during data preparation.

Segment Assembly

The data preparation workflow:

Transcribe audio using Whisper-timestamped to obtain word-level timestamps
Assemble words into segments following these rules:

Add words until reaching at least 20 seconds
Close the segment upon encountering a full stop, question mark, exclamation mark, or long pause
Force segment closure at 30 seconds regardless (Whisper’s input limit)

This approach creates segments between 20-30 seconds that align with natural sentence boundaries, which improves training quality. Starting training segments mid-sentence reduces model performance because the model lacks preceding context.

Transcript Cleaning

After initial transcription, the raw output requires correction. Two approaches exist:

LLM-based cleaning: Feed transcription lines with a keyword list to an LLM for correction. This fixes misspellings but cannot recover missing words or identify added words.
Human review: Manual correction while listening to audio. This method proved significantly more effective, reducing word error rate from 40% to 20% in testing.

The human review process involves listening to each segment and correcting the VTT file directly. While less scalable, smaller high-quality datasets outperform larger low-quality ones.

Dataset Preparation

Each segment becomes a row in the dataset, pairing audio with cleaned transcription text. The audio sampling rate is 16,000 Hz (Whisper’s requirement). Audio gets converted to MEL spectrogram features, and text gets tokenized.

Training Configuration

Training uses LoRA adapters with these parameters:

Rank: 32
RS-LoRA enabled (increases effective learning rate of adapters)
Targets: attention and linear layers
4-bit quantization supported
Language: English
Task: transcription

Training arguments:

Epochs: 2
Learning rate: constant schedule
Brain float 16 precision (if supported)
Evaluation during training

The data collator handles padding and ensures padding tokens are ignored during loss computation.

Teacher Forcing vs Generation

Two evaluation modes exist:

Teacher forcing: During prediction of each token, the model receives the correct text history plus audio input. This produces lower word error rates.

Generation mode: The model receives only a start token and audio, relying solely on audio input without text history. This produces higher word error rates but reflects actual inference conditions.

The notebook defaults to generation mode for realistic performance metrics. Teacher forcing can be more robust for long sequences but doesn’t represent real-world performance.

Training Results

Training completed in 36 seconds on an A40 GPU. Results showed:

Validation and Training loss decreased steadily
Word error rate decreased substantially by the end of epoch 1

Evaluation comparisons:

Before training: “Anthropic Mixed Rally by 7B Instruct” After training: “Anthropic. Mixtral 8x7B”

Before training: Various formatting errors with GPT-5, model names After training: “GPT-5 Turbo” correctly formatted

Some issues remained, such as missing spaces between certain words (WizardLMv1.0.Spin instead of WizardLM v1.0. Spin). This likely occurred because each term appeared only once in training data, and tokenization differs based on whether spaces precede words.

Model Export

The trained model (1.62 GB for turbo version) can be exported in multiple formats:

Merged and pushed to Hugging Face Hub
Converted to OpenAI format
Saved as CT2 translate model for faster-whisper inference
Deployed with continuous batching server for highest throughput

Key Findings

Word-level timestamps enable clean segment boundaries at sentence endings
Segments between 20-30 seconds optimize training
Human-reviewed transcripts significantly outperform LLM-cleaned versions
Small, high-quality datasets exceed large, low-quality datasets
Multiple appearances per term (3-5 instances) improve generalization
Training without proper segmentation produced word error rates of 60-100%+
Proper data preparation reduced word error rates to 20%

The complete workflow—from audio recording through deployment—demonstrates that careful data preparation matters more than training duration for achieving quality results with limited data.

Trelis Research

Discussion about this post

Ready for more?