As far as I know, running a continuous-batching server is not something that’s turnkey with vLLM for text to speech models.
In principle, vLLM supports Orpheus, but the tokens generated are messed up. In fact, the output tokens are decoded using the text tokeniser, rather than the snac audio tokeniser.
I’ve addressed this by re-encoding and then decoding back with snac. This gives you access to a high throughput vLLM endpoint for a TTS model.
Cheers, Ronan
Timestamps:
0:00 Serving Orpheus Text-to-Speech model with continuous batching
0:44 Setup Demo with a one-click template from Runpod
4:12 Running inference on a fine-tuned model (poor quality, maybe don’t use fp8, and tune more)
5:25 Inference on the default orpheus model, “tara”
7:37 How vLLM works with Orpheus and how to decode audio tokens
12:38 Conclusion and Resources
Trelis Links:
🤝 Are you a talented developer? Work for Trelis
💡 Need Technical or Market Assistance? Book a Consult Here
💸 Starting a New Project/Venture? Apply for a Trelis Grant
High-Throughput Text-to-Speech Using vLLM and Orpheus
vLLM enables efficient text-to-speech inference on GPUs through continuous batching. Here's how to implement it with the Orpheus model for production use.
Setup Requirements
vLLM's latest Docker image
Orpheus FT model (available via Unsloth, no Hugging Face token needed)
GPU with sufficient VRAM (H100 recommended for faster-than-real-time processing)
Configuration Details
Key parameters:
Max model length: 2048 tokens
Quantization: FP8 precision
Trust remote code: Enabled for tokenizer loading
Concurrency: Up to 300 requests with 2048 sequence length
Port: 8000 (OpenAI-style API)
Implementation Process
The workflow involves several key steps:
Text Tokenization
Input text is formatted with voice prefix (e.g., "Ronan:" or "tara:")
Special start/end tokens are added
vLLM tokenizes using text tokenizer
Token Processing
vLLM generates token IDs
Tokens are initially decoded using text tokenizer
Solution:
Re-encoding required to get correct token IDs
Token IDs reorganized into groups of 7 (Orpheus requirement)
Snac model decodes reorganized tokens into audio
Output saved as WAV file
Performance Metrics
LLM inference: 4-6 seconds per request
Post-processing decode: ~1.7 seconds
H100 GPU can achieve faster-than-real-time processing
Supports concurrent processing of up to 300 requests w/ 2048 seq len