Serve a Text-to-Speech Model with vLLM

Orpheus - served with continuous batching

Jun 24, 2025

As far as I know, running a continuous-batching server is not something that’s turnkey with vLLM for text to speech models.

In principle, vLLM supports Orpheus, but the tokens generated are messed up. In fact, the output tokens are decoded using the text tokeniser, rather than the snac audio tokeniser.

I’ve addressed this by re-encoding and then decoding back with snac. This gives you access to a high throughput vLLM endpoint for a TTS model.

Cheers, Ronan

Timestamps:

0:00 Serving Orpheus Text-to-Speech model with continuous batching

0:44 Setup Demo with a one-click template from Runpod

4:12 Running inference on a fine-tuned model (poor quality, maybe don’t use fp8, and tune more)

5:25 Inference on the default orpheus model, “tara”

7:37 How vLLM works with Orpheus and how to decode audio tokens

12:38 Conclusion and Resources

Trelis Links:

🤝 Are you a talented developer? Work for Trelis

💡 Need Technical or Market Assistance? Book a Consult Here

💸 Starting a New Project/Venture? Apply for a Trelis Grant

High-Throughput Text-to-Speech Using vLLM and Orpheus

vLLM enables efficient text-to-speech inference on GPUs through continuous batching. Here's how to implement it with the Orpheus model for production use.

Setup Requirements

vLLM's latest Docker image
Orpheus FT model (available via Unsloth, no Hugging Face token needed)
GPU with sufficient VRAM (H100 recommended for faster-than-real-time processing)

Configuration Details

Key parameters:

Max model length: 2048 tokens
Quantization: FP8 precision
Trust remote code: Enabled for tokenizer loading
Concurrency: Up to 300 requests with 2048 sequence length
Port: 8000 (OpenAI-style API)

Implementation Process

The workflow involves several key steps:

Text Tokenization
1. Input text is formatted with voice prefix (e.g., "Ronan:" or "tara:")
2. Special start/end tokens are added
3. vLLM tokenizes using text tokenizer
Token Processing
1. vLLM generates token IDs
2. Tokens are initially decoded using text tokenizer
Solution:
1. Re-encoding required to get correct token IDs
2. Token IDs reorganized into groups of 7 (Orpheus requirement)
3. Snac model decodes reorganized tokens into audio
Output saved as WAV file

Performance Metrics

LLM inference: 4-6 seconds per request
Post-processing decode: ~1.7 seconds
H100 GPU can achieve faster-than-real-time processing
Supports concurrent processing of up to 300 requests w/ 2048 seq len

Trelis Research

Discussion about this post