FAST Streaming-Text-to-Speech Models

Kyutai TTS vs Whisper

Aug 26, 2025

Whisper has long been the go-to text to speech model. Now Voxtral from Mistral is higher quality, and Kyutai’s TTS - with a streaming dual-channel architecture - is faster.

I explain how to run inference locally or to set up a remote Rust based server that can handle simultaneous requests.

Cheers, Ronan

🤖 Purchase ADVANCED-audio Repo Access

🗝️ Get Trelis All Access:

Access all SEVEN Trelis Github Repos (-robotics, -vision, -evals, -fine-tuning, -inference, -voice, -time-series)
Support via Github Issues & Trelis’ Private Discord
Early access to Trelis videos via Discord

TIMESTAMPS:

0:00 Streaming Speech to Text Demo with Kyutai TTS

0:42 Demo en français

1:05 Video Overview

2:42 Resources & Repo

3:15 Running Kyutai TTS on your Mac

5:15 Run streaming TTS in a notebook

5:58 Word timestamping

8:52 Text and Audio Assisted Transcription

11:46 Fast STREAMING TTS server with Rust

15:27 Streaming vs Whisper TTS vs Voxtral

19:53 Theory of Timestamping

22:55 Whisper vs Kyutai TTS architectures

24:34 How Kyutai is trained (with whisper timestamped data)

25:50 Wrap up

Real-Time Speech-to-Text: Comparing Kyutai, Whisper, and Voxtral Architectures

Kyutai's speech-to-text model offers distinct advantages for real-time transcription compared to alternatives like Whisper and Voxtral. The key differences lie in their architectural approaches and how they handle streaming audio.

Core Architectural Differences

Kyutai uses a decoder-only architecture that processes audio in parallel channels:

One channel handles incoming audio
A second channel generates text with a fixed delay (0.5s for small model, 2s for large model)
Built-in voice activation detection for faster end-of-phrase processing

In contrast:

Whisper uses an encoder-decoder architecture requiring full audio context
Voxtral employs a decoder-only approach but without the fixed delay buffer

Model Specifications

Kyutai offers two models:

Small model: ~1B parameters, 0.5s delay
Large model: 2.6B parameters, 2s delay

Training data - Pre-trained on 2.5M hours of Whisper-timestamped data

Performance Characteristics

Latency and throughput advantages of Kyutai:

Autoregressive generation with fixed delay
Automatic word timestamping from delay alignment
Voice activation detection reduces end-of-phrase latency
Rust server implementation for high-speed processing

Whisper limitations:

Must recompute full history for streaming
Higher computational overhead due to encoder-decoder architecture
Requires attention map analysis for word-level timestamps

Implementation Options

Local inference:

Supports CPU-only operation
MLX optimization available for Mac
PyTorch implementation for broader compatibility

Server deployment:

Rust server implementation
Exposes TCP port 8080
Supports real-time audio streaming
Configurable real-time factor for faster processing

Language Support

Current language capabilities:

English
French

Practical Considerations

Accuracy trade-offs:

Quality limited by Whisper-based training data
Main advantage is streaming performance, not transcription accuracy
Fixed delay provides stability in output compared to Voxtral's immediate transcription

Error correction:

Supports text and audio-assisted correction
Requires both audio sample and correct text for optimal results
Useful for handling domain-specific terminology

Trelis Research

Discussion about this post

Ready for more?