How Deepseek v3 made Compute and Export Controls Less Relevant

Jan 04, 2025

What Deepseek v3 got right

I haven't made model-release videos in a while, but this model had too many key insights not to...

- Basically, Deepseek cracked the main problem with Mixture of Experts => Load balancing, and this unlocks much smaller models that train stably...

They also add:

FP8 training
Compressed / latent attention from Deepseek v2
Multi-token prediction (which also can serve as speculative decoding - with 80%+ token acceptance rate!!!)

I go deep on what we know about how frontier models are trained today - through an analogy involving libraries.

Hope ye enjoy it - I also burned $800 (oops) running the model myself on 8x H200 GPUs!!!

Cheers, Ronan

More resources at Trelis.com

DeepSeek's V3: A Much More Efficient Path to Frontier AI Performance

*New model achieves competitive results with ~10x fewer GPUs than previous approaches*

DeepSeek released their V3 model on December 24th, demonstrating performance competitive with GPT-4o and Claude Sonnet while using significantly less compute. The model was trained for approximately $5.5 million (at $2 per GPU hour) using ~2,000 H800 GPUs - compared to Meta's LLaMA2 405B which required 16,000 GPUs.

Benchmark Performance

On key metrics like SWE-Bench verified (GitHub issue resolution), DeepSeek V3 places second only to Claude Sonnet while outperforming GPT-4o. The model achieves top performance on CodeForces AIME math benchmarks among both open and closed-source models. On GPQA Diamond and MMLU Pro, it trails slightly behind Claude Sonnet but remains competitive with other leading models.

Architecture Innovations

DeepSeek employs a Mixture of Experts (MoE) architecture with 256 routed experts and one shared expert. The model activates 8 routed experts per forward pass, with experts grouped across a maximum of 4 nodes to minimize communication overhead. This allows the model to effectively operate with parameters equivalent to a 37B model despite having 600B+ total parameters.

The really big improvement is in using a) a shared expert and b) using a bias to adjust expert choice to load balance - this is unlike the auxiliary loss approaches previously used that led to large imbalances in GPU usage and slower training convergence.

Other Technical Optimisations

1. 8-bit Training: Uses 8-bit precision instead of 16-bit, halving memory and compute requirements while maintaining stability through selective use of higher precision for intermediate calculations

2. Compressed Attention: Reduces memory needed for attention mechanisms by ~20x through projection of keys and values into lower dimensional space

3. Next Token Prediction: Adds auxiliary prediction heads for future tokens, improving both training efficiency/quality and enables speculative decoding with 85-90% acceptance rates.

Inference Performance

Independent testing shows:

- 44-50 tokens/second on DeepSeek API

- Comparable to Claude Sonnet (51 tokens/second)

- Slightly slower than GPT-4o (72 tokens/second)

- Theoretical capability of 100-150 tokens/second with optimized implementation

- Potential for 200+ tokens/second with speculative decoding

Deployment Considerations

The full model requires:

- Minimum 8 H200 GPUs (or equivalent)

- Over 600GB VRAM at 8-bit weights

- Additional VRAM for KV cache with long contexts

- Approximately $32/hour operating cost at current GPU prices

- 15+ minutes for initial model loading

This represents a significant deployment cost despite training efficiency gains, suggesting most developers should use API access rather than self-hosting.

Implications for Model Development

DeepSeek V3 demonstrates that frontier-level performance can be achieved with substantially less compute BECAUSE THEY MADE MIXTURE OF EXPERTS WORK WITHOUT LOAD BALANCING ISSUES.

This should drive down frontier model costs to below $1/MM tokens.

Remixa

Jan 4Edited

Thank you for sharing and happy New Year!

There may be some inconsistencies: DeepSeek's official API intentionally limits the speed of non-enterprise users (which I checked with support), so it may not necessarily match the claimed speed.

But it doesn't matter, the following interpretation is still wonderful!

Expand full comment

2 replies by Trelis Research and others

2 more comments...

Trelis Research