Train and Deploy Transcription Models - Zero to Hero
Introducing Trelis Studio - an early release
This is something I’ve been working on for the past two weeks to make voice model data preparation, training and deployment (serving) much easier.
This is relevant for developers trying to train transcription models (and soon text to speech models too).
Data quality matters a lot to get good transcription results (or synthetic voices). So do training hyperparameters (which can be fiddly to set).
This isn’t quite at the level where someone with no knowledge of models could use, but it makes things much easier for developers - and includes a lot of familiar platforms for reviewing data quality (huggingface datasets and weights and biases).
Cheers, Ronan
🤖 Project link - Trelis Studio
🗝️ Trelis All Access (7 Github Repos, Support via Github Issues & Private Discord)
Timestamps:
0:00 Introduction and overview of pipeline
1:11 Data requirements: audio recordings and transcripts
2:18 Uploading audio-text pairs and dataset preparation
3:17 Saving word swaps and transcribing with model for better training
4:20 Warning about clean text being out-of-distribution for small models
5:20 Setting up Hugging Face token and Weights & Biases key
6:21 Creating validation set using ChatGPT to rephrase text
7:40 Configuring training settings and advanced parameters
8:41 Baseline evaluation shows 7.09% word error rate
9:46 Training begins with falling loss and word error rate
10:47 Model training progress and high grad norm observation
11:53 Model and logs pushed to Hugging Face Hub
12:53 Inspecting evaluation results and specific corrections
14:11 Spelling improvements and regressions in fine-tuned model
15:16 Deploying model to endpoint with keep warm feature
16:17 Auto-sleep containers and API key access options
17:23 Testing endpoint and transcript download formats
18:26 Evaluation tab features and future text-to-speech plans
Training and Deploying Custom Whisper Models with Trelis Studio
Trelis Studio is a platform for training and deploying voice models, starting with transcription models based on Whisper. The platform handles the complete workflow: data preparation, model training, evaluation, and deployment to auto-scaling endpoints.
When to Fine-Tune a Transcription Model
Fine-tuning is relevant when standard transcription produces errors. Common causes include:
Specific accents that the base model handles poorly
Domain-specific vocabulary not well-represented in the base model’s training data
Specialized terminology or proper nouns
Data Requirements
Training requires paired audio and text data. The audio should ideally come from the same speaker or similar speakers to those in production use cases.
If transcripts don’t exist, Trelis Studio can generate them using a third-party endpoint (Fireworks). Users can then review and correct the generated transcripts.
The platform will later support generating synthetic training data from text alone using voice synthesis.
Data Preparation Workflow
The data preparation interface accepts audio files with or without corresponding transcripts. For files without transcripts, users can:
Generate transcripts automatically via the transcribe button
Review transcripts using keyboard shortcuts (Option + arrow keys on Mac, Alt + arrow keys on Windows move 5 seconds forward/backward)
Make manual corrections
Save word swaps for reuse on future datasets
Word swaps are stored and can be applied to clean up subsequent transcriptions, improving efficiency over time.
The platform processes uploaded data by: - Aligning audio with text - Splitting content into 20-30 second chunks - Pushing the processed dataset to Hugging Face
For smaller models, using transcribed-then-corrected text may perform better than clean manual transcripts because the formatting matches the model’s distribution more closely.
Creating Validation Sets
The platform can automatically split large datasets into training and validation portions. For smaller datasets where each word appears only once, separate validation sets are needed.
One approach demonstrated: take training text, use ChatGPT to rephrase while maintaining meaning and vocabulary, then record audio reading the rephrased text. This creates validation data with the same vocabulary but different phrasing and ordering.
Training Configuration
Training settings include:
Model size selection (larger models provide better quality)
Hugging Face token for authentication
Organization name for model storage
Weights & Biases key for training logs
Batch size (automatically set to 1 for very small datasets, increases with more data)
Number of epochs
Learning rate parameters
The platform provides recommended settings that can be reset with one click.
Training Pipeline Stages
The training process follows these stages:
Baseline evaluation: Tests the base model on validation data to establish starting performance
Training: Fine-tunes the model using the prepared dataset
Model conversion: Converts to CTranslate2 format for efficient deployment
Upload: Pushes both standard and CTranslate2 versions to Hugging Face
Post-training evaluation: Re-tests on validation data to measure improvement
Performance Monitoring
Weights & Biases tracks several metrics during training:
Training loss
Evaluation loss
Word error rate on validation set
Gradient norm
In the demonstration, a model trained for only 2 epochs showed a word error rate of 7.84% compared to the baseline’s 7.09%. Training with recommended settings (more epochs) improved performance to 7.08%, nearly matching the baseline despite the small dataset.
Deployment
Trained models can be deployed to auto-scaling endpoints with configurable keep-warm periods. The keep-warm setting determines how long the server stays active after receiving a request before going to sleep.
First-time deployment takes 30-60 seconds. After initial caching, cold starts take 5-10 seconds. Longer keep-warm periods provide lower latency at higher cost, while shorter periods reduce costs but increase latency.
Endpoints are billed per second of active time. When no requests are incoming, containers automatically sleep to avoid charges.
API Access
Deployed models can be accessed via:
Web interface for quick testing
API keys for programmatic access
Python scripts
cURL commands
The platform provides example code for integration. Transcription results can be downloaded as plain text, VTT, or SRT formats.
Evaluation Tool
A separate evaluation interface allows comparing different models on the same dataset. Users can:
Specify multiple model names for comparison
Select dataset and split
Configure batch size for faster inference
Limit evaluation to a subset of samples (useful for large datasets)
Push evaluation results to Hugging Face for detailed inspection
Pricing
The platform offers $5 in free credits for new users. Pricing structure:
GPU training: $4.99 per hour (single GPU)
Data preparation: $4.99 per hour of cleaned audio plus text
Platform Access
Trelis Studio is available at studio.Trelis.com. The platform includes built-in feedback submission through a Help button and automated error reporting.
Future planned features include support for text-to-speech model training and enhanced data preparation capabilities.

