Open source end-to-end models are emerging that can natively handle text and audio (and images soon, most likely). Llama 4 is expected to have these capabilities and provide a capable multi-modal model like GPT-4o.
These models are highly versatile and are capable of doing text to speech, speech to text and speech-to-speech. Last week, we saw OpenAI release specialised fine-tunes of GPT-4o and 4o-mini for TTS and STT.
What I expect now is significant downward pressure on the pricing of audio services. On a token for token basis, audio applications remain very expensive compared to text based models.
I explain how understanding the architecture of these multi modal models explains the pricing dynamics in the market for TTS, STT and conversation AI. I also compare the pricing today - and what I think it can be - of OpenAI, DeepGram and ElevenLabs.
Cheers, Ronan
Trelis Links
💡 Need Technical or Market Assistance? Book a Consult Here
💸 Starting a New Project/Venture? Apply for a Trelis Grant
Video Links:
- ChatGPT Chat: https://chatgpt.com/share/67e19d1b-245c-8003-bbe2-c2b4fe115a1f
- Canopy Labs: https://canopylabs.ai/model-releases
- OpenAI Pricing: https://platform.openai.com/docs/pricing#transcription-and-speech-generation
- OpenAI API Pricing: https://openai.com/api/pricing/
- Deepgram: https://deepgram.com/pricing
- ElevenLabs: https://elevenlabs.io/pricing
- Fireworks: https://fireworks.ai/pricing#overview
TIMESTAMPS:
00:00 Introduction to Audio Service Pricing
00:41 Understanding Audio Tokens
02:30 Comparing Audio Service Providers
03:27 OpenAI Pricing Analysis
08:39 DeepGram and ElevenLabs Pricing
12:48 Future Pricing Trends
15:08 Conclusion and Final Thoughts
Audio AI Pricing Analysis: OpenAI vs Deepgram vs Eleven Labs
Token Economics of Audio vs Text
Audio processing requires significantly more tokens than text processing. While speech generates roughly 2 text tokens per second (at 120 words per minute), audio representation demands approximately 100 tokens per second to capture semantics and acoustics. For example, the Moshi model uses 12.5 sampling points per second with 8 tokens per segment.
Current Pricing Structure
Text-to-Speech (per minute)
OpenAI: $0.03
Deepgram (Aura): $0.01
Eleven Labs: $0.06
Speech-to-Text (per minute)
OpenAI GPT-4 Transcription: $0.01
OpenAI GPT-4 Mini Transcription: $0.006
Deepgram Nova 3: $0.0035
Eleven Labs: $0.0037
Fireworks Whisper: $0.0009
OpenAI Whisper: $0.006
Conversational AI (per minute)
OpenAI GPT-4: $0.05
OpenAI GPT-4 Mini: $0.0125
Deepgram: $0.05
Eleven Labs: $0.096
Market Dynamics and Future Trends
Technical developments suggest current prices are likely to decrease.
Open Source Competition
Models like Orpheus (3B parameters) and CSM-1B demonstrate high-quality speech generation with smaller models
These smaller models can theoretically operate at single-digit cents per million tokens
Service-Specific Optimization
Text-to-speech and speech-to-text services can use specialized smaller models
Don't require full dense models with extensive reasoning capabilities
Should enable significant cost reductions
Conversational AI Complexity
Requires larger models for strong reasoning capabilities
May involve model chaining (potentially used by Deepgram and Eleven Labs)
OpenAI likely uses multimodal end-to-end approach with specific optimizations for real-time performance
Cost Structure Implications
The current pricing suggests significant profit margins, particularly given:
Small models like CSM-1B and Orpheus demonstrate capable performance
Running small models (0-4B parameters) costs single-digit cents per million tokens
Current prices are often 10x+ higher than estimated operating costs
The market appears positioned for substantial price reductions, particularly in text-to-speech and speech-to-text services where smaller, specialized models can suffice.
Share this post