What Deepseek v3 got right
I haven't made model-release videos in a while, but this model had too many key insights not to...
- Basically, Deepseek cracked the main problem with Mixture of Experts => Load balancing, and this unlocks much smaller models that train stably...
They also add:
FP8 training
Compressed / latent attention from Deepseek v2
Multi-token prediction (which also can serve as speculative decoding - with 80%+ token acceptance rate!!!)
I go deep on what we know about how frontier models are trained today - through an analogy involving libraries.
Hope ye enjoy it - I also burned $800 (oops) running the model myself on 8x H200 GPUs!!!
Cheers, Ronan
More resources at Trelis.com
DeepSeek's V3: A Much More Efficient Path to Frontier AI Performance
*New model achieves competitive results with ~10x fewer GPUs than previous approaches*
DeepSeek released their V3 model on December 24th, demonstrating performance competitive with GPT-4o and Claude Sonnet while using significantly less compute. The model was trained for approximately $5.5 million (at $2 per GPU hour) using ~2,000 H800 GPUs - compared to Meta's LLaMA2 405B which required 16,000 GPUs.
Benchmark Performance
On key metrics like SWE-Bench verified (GitHub issue resolution), DeepSeek V3 places second only to Claude Sonnet while outperforming GPT-4o. The model achieves top performance on CodeForces AIME math benchmarks among both open and closed-source models. On GPQA Diamond and MMLU Pro, it trails slightly behind Claude Sonnet but remains competitive with other leading models.
Architecture Innovations
DeepSeek employs a Mixture of Experts (MoE) architecture with 256 routed experts and one shared expert. The model activates 8 routed experts per forward pass, with experts grouped across a maximum of 4 nodes to minimize communication overhead. This allows the model to effectively operate with parameters equivalent to a 37B model despite having 600B+ total parameters.
The really big improvement is in using a) a shared expert and b) using a bias to adjust expert choice to load balance - this is unlike the auxiliary loss approaches previously used that led to large imbalances in GPU usage and slower training convergence.
Other Technical Optimisations
1. 8-bit Training: Uses 8-bit precision instead of 16-bit, halving memory and compute requirements while maintaining stability through selective use of higher precision for intermediate calculations
2. Compressed Attention: Reduces memory needed for attention mechanisms by ~20x through projection of keys and values into lower dimensional space
3. Next Token Prediction: Adds auxiliary prediction heads for future tokens, improving both training efficiency/quality and enables speculative decoding with 85-90% acceptance rates.
Inference Performance
Independent testing shows:
- 44-50 tokens/second on DeepSeek API
- Comparable to Claude Sonnet (51 tokens/second)
- Slightly slower than GPT-4o (72 tokens/second)
- Theoretical capability of 100-150 tokens/second with optimized implementation
- Potential for 200+ tokens/second with speculative decoding
Deployment Considerations
The full model requires:
- Minimum 8 H200 GPUs (or equivalent)
- Over 600GB VRAM at 8-bit weights
- Additional VRAM for KV cache with long contexts
- Approximately $32/hour operating cost at current GPU prices
- 15+ minutes for initial model loading
This represents a significant deployment cost despite training efficiency gains, suggesting most developers should use API access rather than self-hosting.
Implications for Model Development
DeepSeek V3 demonstrates that frontier-level performance can be achieved with substantially less compute BECAUSE THEY MADE MIXTURE OF EXPERTS WORK WITHOUT LOAD BALANCING ISSUES.
This should drive down frontier model costs to below $1/MM tokens.
Thank you for sharing and happy New Year!
There may be some inconsistencies: DeepSeek's official API intentionally limits the speed of non-enterprise users (which I checked with support), so it may not necessarily match the claimed speed.
But it doesn't matter, the following interpretation is still wonderful!