0:00
/
0:00
Transcript

Are Audio Tokens Overpriced?

OpenAI vs ElevenLabs vs DeepGram

Open source end-to-end models are emerging that can natively handle text and audio (and images soon, most likely). Llama 4 is expected to have these capabilities and provide a capable multi-modal model like GPT-4o.

These models are highly versatile and are capable of doing text to speech, speech to text and speech-to-speech. Last week, we saw OpenAI release specialised fine-tunes of GPT-4o and 4o-mini for TTS and STT.

What I expect now is significant downward pressure on the pricing of audio services. On a token for token basis, audio applications remain very expensive compared to text based models.

I explain how understanding the architecture of these multi modal models explains the pricing dynamics in the market for TTS, STT and conversation AI. I also compare the pricing today - and what I think it can be - of OpenAI, DeepGram and ElevenLabs.

Cheers, Ronan


Trelis Links

🤝 Work for Trelis

💡 Need Technical or Market Assistance? Book a Consult Here

💸 Starting a New Project/Venture? Apply for a Trelis Grant


Video Links:

  • - ChatGPT Chat: https://chatgpt.com/share/67e19d1b-245c-8003-bbe2-c2b4fe115a1f

  • - Canopy Labs: https://canopylabs.ai/model-releases

  • - OpenAI Pricing: https://platform.openai.com/docs/pricing#transcription-and-speech-generation

  • - OpenAI API Pricing: https://openai.com/api/pricing/

  • - Deepgram: https://deepgram.com/pricing

  • - ElevenLabs: https://elevenlabs.io/pricing

  • - Fireworks: https://fireworks.ai/pricing#overview

TIMESTAMPS:

00:00 Introduction to Audio Service Pricing

00:41 Understanding Audio Tokens

02:30 Comparing Audio Service Providers

03:27 OpenAI Pricing Analysis

08:39 DeepGram and ElevenLabs Pricing

12:48 Future Pricing Trends

15:08 Conclusion and Final Thoughts


Audio AI Pricing Analysis: OpenAI vs Deepgram vs Eleven Labs

Token Economics of Audio vs Text

Audio processing requires significantly more tokens than text processing. While speech generates roughly 2 text tokens per second (at 120 words per minute), audio representation demands approximately 100 tokens per second to capture semantics and acoustics. For example, the Moshi model uses 12.5 sampling points per second with 8 tokens per segment.

Current Pricing Structure

Text-to-Speech (per minute)

  1. OpenAI: $0.03

  2. Deepgram (Aura): $0.01

  3. Eleven Labs: $0.06

Speech-to-Text (per minute)

  1. OpenAI GPT-4 Transcription: $0.01

  2. OpenAI GPT-4 Mini Transcription: $0.006

  3. Deepgram Nova 3: $0.0035

  4. Eleven Labs: $0.0037

  5. Fireworks Whisper: $0.0009

  6. OpenAI Whisper: $0.006

Conversational AI (per minute)

  1. OpenAI GPT-4: $0.05

  2. OpenAI GPT-4 Mini: $0.0125

  3. Deepgram: $0.05

  4. Eleven Labs: $0.096

Market Dynamics and Future Trends

Technical developments suggest current prices are likely to decrease.

  1. Open Source Competition

    1. Models like Orpheus (3B parameters) and CSM-1B demonstrate high-quality speech generation with smaller models

    2. These smaller models can theoretically operate at single-digit cents per million tokens

  2. Service-Specific Optimization

    1. Text-to-speech and speech-to-text services can use specialized smaller models

    2. Don't require full dense models with extensive reasoning capabilities

    3. Should enable significant cost reductions

  3. Conversational AI Complexity

    1. Requires larger models for strong reasoning capabilities

    2. May involve model chaining (potentially used by Deepgram and Eleven Labs)

    3. OpenAI likely uses multimodal end-to-end approach with specific optimizations for real-time performance

Cost Structure Implications

The current pricing suggests significant profit margins, particularly given:

  1. Small models like CSM-1B and Orpheus demonstrate capable performance

  2. Running small models (0-4B parameters) costs single-digit cents per million tokens

  3. Current prices are often 10x+ higher than estimated operating costs

The market appears positioned for substantial price reductions, particularly in text-to-speech and speech-to-text services where smaller, specialized models can suffice.

Discussion about this video