Are Audio Tokens Overpriced?

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

Are Audio Tokens Overpriced?

OpenAI vs ElevenLabs vs DeepGram

Trelis Research

Mar 26, 2025

Transcript

Open source end-to-end models are emerging that can natively handle text and audio (and images soon, most likely). Llama 4 is expected to have these capabilities and provide a capable multi-modal model like GPT-4o.

These models are highly versatile and are capable of doing text to speech, speech to text and speech-to-speech. Last week, we saw OpenAI release specialised fine-tunes of GPT-4o and 4o-mini for TTS and STT.

What I expect now is significant downward pressure on the pricing of audio services. On a token for token basis, audio applications remain very expensive compared to text based models.

I explain how understanding the architecture of these multi modal models explains the pricing dynamics in the market for TTS, STT and conversation AI. I also compare the pricing today - and what I think it can be - of OpenAI, DeepGram and ElevenLabs.

Cheers, Ronan

Trelis Links

🤝 Work for Trelis

💡 Need Technical or Market Assistance? Book a Consult Here

💸 Starting a New Project/Venture? Apply for a Trelis Grant

Video Links:

- ChatGPT Chat: https://chatgpt.com/share/67e19d1b-245c-8003-bbe2-c2b4fe115a1f
- Canopy Labs: https://canopylabs.ai/model-releases
- OpenAI Pricing: https://platform.openai.com/docs/pricing#transcription-and-speech-generation
- OpenAI API Pricing: https://openai.com/api/pricing/
- Deepgram: https://deepgram.com/pricing
- ElevenLabs: https://elevenlabs.io/pricing
- Fireworks: https://fireworks.ai/pricing#overview

TIMESTAMPS:

00:00 Introduction to Audio Service Pricing

00:41 Understanding Audio Tokens

02:30 Comparing Audio Service Providers

03:27 OpenAI Pricing Analysis

08:39 DeepGram and ElevenLabs Pricing

12:48 Future Pricing Trends

15:08 Conclusion and Final Thoughts

Audio AI Pricing Analysis: OpenAI vs Deepgram vs Eleven Labs

Token Economics of Audio vs Text

Audio processing requires significantly more tokens than text processing. While speech generates roughly 2 text tokens per second (at 120 words per minute), audio representation demands approximately 100 tokens per second to capture semantics and acoustics. For example, the Moshi model uses 12.5 sampling points per second with 8 tokens per segment.

Current Pricing Structure

Text-to-Speech (per minute)

OpenAI: $0.03
Deepgram (Aura): $0.01
Eleven Labs: $0.06

Speech-to-Text (per minute)

OpenAI GPT-4 Transcription: $0.01
OpenAI GPT-4 Mini Transcription: $0.006
Deepgram Nova 3: $0.0035
Eleven Labs: $0.0037
Fireworks Whisper: $0.0009
OpenAI Whisper: $0.006

Conversational AI (per minute)

OpenAI GPT-4: $0.05
OpenAI GPT-4 Mini: $0.0125
Deepgram: $0.05
Eleven Labs: $0.096

Market Dynamics and Future Trends

Technical developments suggest current prices are likely to decrease.

Open Source Competition
1. Models like Orpheus (3B parameters) and CSM-1B demonstrate high-quality speech generation with smaller models
2. These smaller models can theoretically operate at single-digit cents per million tokens
Service-Specific Optimization
1. Text-to-speech and speech-to-text services can use specialized smaller models
2. Don't require full dense models with extensive reasoning capabilities
3. Should enable significant cost reductions
Conversational AI Complexity
1. Requires larger models for strong reasoning capabilities
2. May involve model chaining (potentially used by Deepgram and Eleven Labs)
3. OpenAI likely uses multimodal end-to-end approach with specific optimizations for real-time performance

Cost Structure Implications

The current pricing suggests significant profit margins, particularly given:

Small models like CSM-1B and Orpheus demonstrate capable performance
Running small models (0-4B parameters) costs single-digit cents per million tokens
Current prices are often 10x+ higher than estimated operating costs

The market appears positioned for substantial price reductions, particularly in text-to-speech and speech-to-text services where smaller, specialized models can suffice.