Train and Deploy Transcription Models - Zero to Hero

Introducing Trelis Studio - an early release

Jan 15, 2026

This is something I’ve been working on for the past two weeks to make voice model data preparation, training and deployment (serving) much easier.

This is relevant for developers trying to train transcription models (and soon text to speech models too).

Data quality matters a lot to get good transcription results (or synthetic voices). So do training hyperparameters (which can be fiddly to set).

This isn’t quite at the level where someone with no knowledge of models could use, but it makes things much easier for developers - and includes a lot of familiar platforms for reviewing data quality (huggingface datasets and weights and biases).

Cheers, Ronan

🤖 Project link - Trelis Studio

🗝️ Trelis All Access (7 Github Repos, Support via Github Issues & Private Discord)

Timestamps:

0:00 Introduction and overview of pipeline

1:11 Data requirements: audio recordings and transcripts

2:18 Uploading audio-text pairs and dataset preparation

3:17 Saving word swaps and transcribing with model for better training

4:20 Warning about clean text being out-of-distribution for small models

5:20 Setting up Hugging Face token and Weights & Biases key

6:21 Creating validation set using ChatGPT to rephrase text

7:40 Configuring training settings and advanced parameters

8:41 Baseline evaluation shows 7.09% word error rate

9:46 Training begins with falling loss and word error rate

10:47 Model training progress and high grad norm observation

11:53 Model and logs pushed to Hugging Face Hub

12:53 Inspecting evaluation results and specific corrections

14:11 Spelling improvements and regressions in fine-tuned model

15:16 Deploying model to endpoint with keep warm feature

16:17 Auto-sleep containers and API key access options

17:23 Testing endpoint and transcript download formats

18:26 Evaluation tab features and future text-to-speech plans

Training and Deploying Custom Whisper Models with Trelis Studio

Trelis Studio is a platform for training and deploying voice models, starting with transcription models based on Whisper. The platform handles the complete workflow: data preparation, model training, evaluation, and deployment to auto-scaling endpoints.

When to Fine-Tune a Transcription Model

Fine-tuning is relevant when standard transcription produces errors. Common causes include:

Specific accents that the base model handles poorly
Domain-specific vocabulary not well-represented in the base model’s training data
Specialized terminology or proper nouns

Data Requirements

Training requires paired audio and text data. The audio should ideally come from the same speaker or similar speakers to those in production use cases.

If transcripts don’t exist, Trelis Studio can generate them using a third-party endpoint (Fireworks). Users can then review and correct the generated transcripts.

The platform will later support generating synthetic training data from text alone using voice synthesis.

Data Preparation Workflow

The data preparation interface accepts audio files with or without corresponding transcripts. For files without transcripts, users can:

Generate transcripts automatically via the transcribe button
Review transcripts using keyboard shortcuts (Option + arrow keys on Mac, Alt + arrow keys on Windows move 5 seconds forward/backward)
Make manual corrections
Save word swaps for reuse on future datasets

Word swaps are stored and can be applied to clean up subsequent transcriptions, improving efficiency over time.

The platform processes uploaded data by: - Aligning audio with text - Splitting content into 20-30 second chunks - Pushing the processed dataset to Hugging Face

For smaller models, using transcribed-then-corrected text may perform better than clean manual transcripts because the formatting matches the model’s distribution more closely.

Creating Validation Sets

The platform can automatically split large datasets into training and validation portions. For smaller datasets where each word appears only once, separate validation sets are needed.

One approach demonstrated: take training text, use ChatGPT to rephrase while maintaining meaning and vocabulary, then record audio reading the rephrased text. This creates validation data with the same vocabulary but different phrasing and ordering.

Training Configuration

Training settings include:

Model size selection (larger models provide better quality)
Hugging Face token for authentication
Organization name for model storage
Weights & Biases key for training logs
Batch size (automatically set to 1 for very small datasets, increases with more data)
Number of epochs
Learning rate parameters

The platform provides recommended settings that can be reset with one click.

Training Pipeline Stages

The training process follows these stages:

Baseline evaluation: Tests the base model on validation data to establish starting performance
Training: Fine-tunes the model using the prepared dataset
Model conversion: Converts to CTranslate2 format for efficient deployment
Upload: Pushes both standard and CTranslate2 versions to Hugging Face
Post-training evaluation: Re-tests on validation data to measure improvement

Performance Monitoring

Weights & Biases tracks several metrics during training:

Training loss
Evaluation loss
Word error rate on validation set
Gradient norm

In the demonstration, a model trained for only 2 epochs showed a word error rate of 7.84% compared to the baseline’s 7.09%. Training with recommended settings (more epochs) improved performance to 7.08%, nearly matching the baseline despite the small dataset.

Deployment

Trained models can be deployed to auto-scaling endpoints with configurable keep-warm periods. The keep-warm setting determines how long the server stays active after receiving a request before going to sleep.

First-time deployment takes 30-60 seconds. After initial caching, cold starts take 5-10 seconds. Longer keep-warm periods provide lower latency at higher cost, while shorter periods reduce costs but increase latency.

Endpoints are billed per second of active time. When no requests are incoming, containers automatically sleep to avoid charges.

API Access

Deployed models can be accessed via:

Web interface for quick testing
API keys for programmatic access
Python scripts
cURL commands

The platform provides example code for integration. Transcription results can be downloaded as plain text, VTT, or SRT formats.

Evaluation Tool

A separate evaluation interface allows comparing different models on the same dataset. Users can:

Specify multiple model names for comparison
Select dataset and split
Configure batch size for faster inference
Limit evaluation to a subset of samples (useful for large datasets)
Push evaluation results to Hugging Face for detailed inspection

Pricing

The platform offers $5 in free credits for new users. Pricing structure:

GPU training: $4.99 per hour (single GPU)
Data preparation: $4.99 per hour of cleaned audio plus text

Platform Access

Trelis Studio is available at studio.Trelis.com. The platform includes built-in feedback submission through a Help button and automated error reporting.

Future planned features include support for text-to-speech model training and enhanced data preparation capabilities.

Trelis Research

Discussion about this post

Ready for more?