Audio Dataset Cleaning
All that glisters is not Gold
Said “high quality” audio datasets are not always high quality, which can leave you puzzled when training doesn’t make your transcription or TTS model any better.
The most robust approach is to start from basics, and to measure the quality of your training dataset - and ask the question of whether it is better or worse than the model you plan to fine-tune. If the data is worse, then you need to turn to either filtering OR to drafting labels with a stronger teacher model. I give some pointers on both AND explain the effects of data filtering on training.
Cheers, Ronan
🎙️ Enterprise Voice AI Services (ASR, TTS, Agents)
Trelis Links:
Timestamps:
0:00 Introduction: Clean transcripts crucial for speech recognition training
0:02 Problem: Popular datasets like Vox Populi contain transcription errors
0:45 Recommendation: Hand transcribe 50-100 examples to measure dataset error rate
1:04 Critical check: Compare base model performance vs training dataset quality
1:30 Example: 12% dataset error rate vs 9% base model error rate
2:04 Two filtering methods: edge error detection and CTC alignment
3:07 CTC alignment: Using confidence scores to identify misaligned characters
4:08 Baseline results: Whisper Large scores on multiple test sets
4:23 Raw Vox Populi training degrades out-of-domain metrics significantly
4:50 Filtering reduces degradation while maintaining target improvements
5:28 Filtered dataset results: Improved Vox Populi with minimal benchmark degradation
6:29 Alternative approach: Use strong open-source models for draft labels
6:32 Final emphasis: Human annotation essential for assessing data quality
Filtering Audio Training Data to Improve Model Performance
When training audio models for transcription or text-to-speech, clean and accurate transcripts paired with audio significantly affect outcomes. Some datasets that appear clean contain errors. LOX Populi, a parliament dataset, has errors that can result in poor training performance even when users expect high-quality transcripts.
Measuring Dataset Quality
Before using any training dataset, measure its quality. This requires hand transcribing at least 50 or 100 examples and calculating the word error rate of the training dataset against these ground truth transcriptions. This measurement provides a baseline for dataset quality.
When fine-tuning a pre-existing base model such as WHISPER or QUEN, compare the error rate of your base model against the dataset. Run the base model over the audio and measure its performance. If the starting model performs better than the dataset, improving performance through training on that dataset becomes difficult.
For example, the word error rate of a dataset measured by 50 or 100 manually annotated examples might be 12%, while the starting model scores 9% word error rate. This scenario presents two options:
Filter the raw dataset to keep only high-quality samples, potentially reducing word error rate to 7%, creating a gap where the raw data could improve the pre-trained model
Use an open-source model to generate pseudolabels or draft labels by running it over the dataset audio, then measure the word error rate of these labels against the 50 or 100 human-labeled examples
A strong draft model might score 5% word error rate, making it a better training source than the labeled data from sources like the European Parliament.
Filtering Methods
Two methods filter data when a draft model cannot outperform raw labels or the pre-trained model.
Edge Error Detection
Run a draft model over the audio to generate another set of labels. Compare the ends of these transcripts with the raw transcripts. Drop rows where mismatches occur.
CTC Alignment
Use a CTC model with both text and audio to perform alignment and calculate the confidence of each character in the transcript. Drop rows containing low-confidence characters that don’t align well with the audio. The confidence threshold can be tuned.
Combined Filtering Process
Combine these techniques in sequence:
Start with the input raw dataset
Apply draft edge detection and drop bad rows
Apply CTC model filtering and drop bad rows
Result in a dataset with lower word error rate than the pre-trained model
Trelis created a dataset called Vox Populi Platinum using multiple filtration steps on the Vox Populi dataset to reduce word error rate.
Training Results with Whisper Large
Testing on Whisper Large showed measurable differences between filtered and unfiltered data:
Base model performance without fine-tuning:
8% on withheld Vox Populi test set
4% on Flare
8% on Common Voice
10-11% on Common Voice Spontaneous
Training on raw Vox Populi transcripts improved performance on Vox Populi but degraded the model significantly on other metrics. The bad rows degraded overall model quality.
Training with proper filtering:
Further improved performance on Vox Populi
Reduced degradation on other metrics
Some degradation remains, which may be unavoidable when domains differ. Scripted and spontaneous transcription are different tasks, making it difficult to perform well on both simultaneously. Filtering maintains significantly better performance on out-of-domain benchmarks.
Results with Stronger Models
Performance patterns change with stronger models like Qwen 3 ASR. The base performance is already strong. Training on raw Vox Populi transcripts degraded performance across all benchmarks.
Training on a filtered dataset:
Improved performance significantly on the Vox Populi withheld test set
Caused relatively small degradation across other benchmarks
These techniques—CTC alignment and draft filtering for edge detection—remove bad rows from datasets.
Alternative Approach Using Draft Labels
For those unable to spend money on curation or purchasing curated datasets, one approach works well: take a strong off-the-shelf open-source model (currently Parakeet or Qwen) and run it over the dataset to generate draft labels. In many cases, this performs nearly as well as filtered data.
Start by assessing data quality with human-annotated examples to establish ground truth about data quality. Compare this against the model being used for continued pre-training and against other options like draft labels or filtered datasets.

