FAST Streaming-Text-to-Speech Models
Kyutai TTS vs Whisper
Whisper has long been the go-to text to speech model. Now Voxtral from Mistral is higher quality, and Kyutai’s TTS - with a streaming dual-channel architecture - is faster.
I explain how to run inference locally or to set up a remote Rust based server that can handle simultaneous requests.
Cheers, Ronan
🤖 Purchase ADVANCED-audio Repo Access
Access all SEVEN Trelis Github Repos (-robotics, -vision, -evals, -fine-tuning, -inference, -voice, -time-series)
Support via Github Issues & Trelis’ Private Discord
Early access to Trelis videos via Discord
TIMESTAMPS:
0:00 Streaming Speech to Text Demo with Kyutai TTS
0:42 Demo en français
1:05 Video Overview
2:42 Resources & Repo
3:15 Running Kyutai TTS on your Mac
5:15 Run streaming TTS in a notebook
5:58 Word timestamping
8:52 Text and Audio Assisted Transcription
11:46 Fast STREAMING TTS server with Rust
15:27 Streaming vs Whisper TTS vs Voxtral
19:53 Theory of Timestamping
22:55 Whisper vs Kyutai TTS architectures
24:34 How Kyutai is trained (with whisper timestamped data)
25:50 Wrap up
Real-Time Speech-to-Text: Comparing Kyutai, Whisper, and Voxtral Architectures
Kyutai's speech-to-text model offers distinct advantages for real-time transcription compared to alternatives like Whisper and Voxtral. The key differences lie in their architectural approaches and how they handle streaming audio.
Core Architectural Differences
Kyutai uses a decoder-only architecture that processes audio in parallel channels:
One channel handles incoming audio
A second channel generates text with a fixed delay (0.5s for small model, 2s for large model)
Built-in voice activation detection for faster end-of-phrase processing
In contrast:
Whisper uses an encoder-decoder architecture requiring full audio context
Voxtral employs a decoder-only approach but without the fixed delay buffer
Model Specifications
Kyutai offers two models:
Small model: ~1B parameters, 0.5s delay
Large model: 2.6B parameters, 2s delay
Training data - Pre-trained on 2.5M hours of Whisper-timestamped data
Performance Characteristics
Latency and throughput advantages of Kyutai:
Autoregressive generation with fixed delay
Automatic word timestamping from delay alignment
Voice activation detection reduces end-of-phrase latency
Rust server implementation for high-speed processing
Whisper limitations:
Must recompute full history for streaming
Higher computational overhead due to encoder-decoder architecture
Requires attention map analysis for word-level timestamps
Implementation Options
Local inference:
Supports CPU-only operation
MLX optimization available for Mac
PyTorch implementation for broader compatibility
Server deployment:
Rust server implementation
Exposes TCP port 8080
Supports real-time audio streaming
Configurable real-time factor for faster processing
Language Support
Current language capabilities:
English
French
Practical Considerations
Accuracy trade-offs:
Quality limited by Whisper-based training data
Main advantage is streaming performance, not transcription accuracy
Fixed delay provides stability in output compared to Voxtral's immediate transcription
Error correction:
Supports text and audio-assisted correction
Requires both audio sample and correct text for optimal results
Useful for handling domain-specific terminology

