Align Audio and Text for Speech Recognition
A key element of data preparation for voice model training
To train audio models, whether TTS or STT, you typically need to align spoken words with their audio.
This is true whether you wish to train a transcription model that generates timestamps OR you wish to prepare clean chunks of text-audio data for model training.
Surprisingly, there are not a whole lot of alignment models available, particularly ones that may be used commercially.
I show how alignment works AND recommend some specific models and methods to use.
Cheers, Ronan
🗝️ Trelis All Access (7 Github Repos, Support via Github Issues & Private Discord)
Timestamps:
0:00 Introduction to audio-text alignment for model training
1:05 Demo of Audio Alignment interface with two models
2:06 MMS-FA model and commercially-licensed CTC aligner alternative
3:10 Word timestamps enable sentence detection for clean training chunks
4:21 Multi-step alignment process: normalization, emissions, and character probabilities
5:21 Viterbi process calculates most likely path for final alignment
6:44 Trelis Studio data preparation workflow with audio upload
7:45 Realignment process creates clean 20-30 second chunks with sentence boundaries
8:46 Review of resulting dataset with clean chunks and word timestamps
10:01 Fine-tuning Whisper without timestamps causes catastrophic forgetting
12:07 Emissions are character probabilities generated per audio frame
13:09 Wave2Vec models and text normalization process
15:49 Torch Audio forced aligner non-commercial license restriction
18:56 Viterbi method for mapping ground truth text to audio windows
21:54 Emissions model training using unlabeled data approach
23:34 Multiple valid sequences kept during alignment process
25:36 Wave2Vec pre-training uses masked audio prediction
26:36 Conclusion with repository reference at Trelis.com/advanced-audio
Aligning Text and Audio for Speech Model Training
Aligning text with audio is necessary for training text-to-speech and speech-to-text models, particularly when generating transcripts with timestamps or preparing training data that respects sentence boundaries. Token-based models like certain text-to-speech systems also require this alignment. When preparing voice training data, chunking audio at sentence boundaries rather than arbitrary time intervals produces cleaner training examples.
Two Alignment Models
The demonstration uses two different alignment models. The first is MMS-FA, a default model from Torch Audio that will be deprecated. The second uses the CTC aligner package with a commercially friendly license. The Facebook Wave2Vec2Base960H model is openly licensed for commercial use, while the Torch Audio default model has a non-commercial license.
How Alignment Works
Alignment operates on normalized text - lowercase with no punctuation. To preserve the original formatting, the process maintains a mapping from the original text to the normalized version. After alignment, this mapping is reversed to restore capitalization and punctuation. Without preserving the original text, elements like “10.7b” disappear during normalization.
The alignment process has multiple steps:
Normalize the text
Process the audio through a neural network to generate emissions
Apply a Viterbi process to calculate the most likely alignment path
Emissions are probability distributions for each character at every audio frame. The model outputs probabilities for characters A, B, C, etc., or blank characters, for each audio window. These probabilities don’t simply select the most likely character at each position. Instead, the Viterbi process finds the most likely complete sequence that is consistent across all positions.
Practical Application in Trelis Studio
In the Trelis Studio data preparation workflow, audio files are first transcribed using Fireworks. The transcription initially produces 30-second chunks that often end mid-sentence. After alignment generates word timestamps, the system can create new chunks of 20-30 seconds that respect sentence boundaries.
The process produces nine chunks from the demonstration audio, each starting and ending at clean sentence breaks. In cases where no sentence ends between 20 and 30 seconds, some chunks may still have discontinuous sentences, but the system generally produces consistent chunk lengths with proper boundaries.
Word Timestamps for Model Training
Word timestamps serve multiple purposes beyond creating clean chunks. Models like certain speech-to-text systems require explicit timestamps. When training Whisper with timestamps, timestamp tokens must be injected into the segments so the model learns temporal alignment. Whisper models are trained with timestamps in their data, which is why they can generate timestamps during inference. Fine-tuning without timestamp data often preserves this ability, but training with significant weight changes may cause catastrophic forgetting unless timestamp data is included.
Technical Implementation
The alignment code loads a CTC model for a specified language. Multiple Wave2Vec fine-tunes exist on HuggingFace for different languages. The English implementation uses Facebook’s Wave2Vec2Base960H model. The system also supports Hindi and Arabic models, though the English model works reasonably well for languages with similar alphabets.
When processing long audio, the CTC aligner library splits the waveform into chunks of up to 30 seconds. These chunks are batched together for GPU processing, making efficient use of hardware acceleration. The model generates emissions for the entire sequence in batched pieces.
After generating emissions, the system combines them with normalized text to produce alignment. The process matches probability distributions across the audio with the ground truth text, spreading the text to best fit the character probabilities. Once alignment is complete, the system calculates timing spans for each word, reverses the normalization mapping, and returns the complete alignment data.
Training Emissions Models
Emissions models are trained on unlabeled data - meaning audio paired with unaligned text. The audio generates probability distributions for each window, and the training keeps sequences that respect the ground truth character order. For the text “THE”, valid sequences might include “space-T-space-H-space-E” or “T-T-H-E” or “space-space-space”, while “H-E-T” would be invalid.
The training maximizes the sum of probabilities for valid sequences while ignoring invalid ones. This approach works because there are typically few valid alignments that respect the correct character order. Knowing only the correct sequence of characters in the ground truth text is sufficient to tune the model.
This represents post-training. Pre-training uses models like Wave2Vec from Facebook, which are trained through contrastive and masked loss. The system feeds audio into the model and has it predict masked portions of the waveform. Reconstruction loss grades the model’s ability to reproduce the masked audio. This pre-trained model handles audio well, and post-training adapts it to predict character probabilities for alignment.
Running Time and Caching
The first alignment run loads the model, which takes measurable time. Subsequent alignments run faster because the model is cached locally on the machine. The system runs entirely locally but can also operate on GPUs for faster processing.
Code Access
The alignment demonstration and commercial implementation code are available in the Trelis Advanced Audio Repository at Trelis.com/advanced-audio. The repository includes the alignment scripts, model configurations, and detailed explanations of the process.

