Is it highly likely DeepSeek was distilled from Western Models?

Apr 15, 2025

I wouldn't rule it out (and it's not a binary question) but I am not confident DeepSeek v3’s performance primarily, or largely, stems from distillation.

Reasons Against.

1. Distillation on logits is not possible with private APIs (and difficult with mismatching tokenisers). Distillation works, but it works best - particularly for broad based knowledge - when done using KL divergence on logits (token probabilities) rather than cross entropy on completions (video explanation). KL requires all (or many) of the logits for each token in the vocab. Unsupervised or supervised fine-tuning requires only the chosen token (and assigns it 100% probability and the others 0%). These logits are (very intentionally, for reasons of reverse engineering) not made available by private model APIs. Were the logits available, distilling private model A with token vocabulary tA onto model B with vocab tB is also not straightforward, although there are more recent methods to overcome this.

2. SFT improvements on synthetic data are reasonably domain-specific. Even if KL is not possible on private models, yes, doing supervised fine-tuning on completions from a private API is effective (see this s1 paper using just 1k samples to improve performance). However, improvements are domain specific (e.g. MATH) - as is typical with fine-tuning (and, incidentally, with reinforcement learning). DeepSeek v3 did well on narrow benchmarks but it did well on many narrow benchmarks AND on general vibes. In short, I don't believe narrow fine-tuning intervention would get you to a DeepSeek v3 model.

3. It is less obvious how to generate synthetic questions than synthetic answers. To go even further, let us assume that DeepSeek v3 generated a substantial pretraining (or continued-pretraining) corpus using OpenAI's API. It is worth bearing in mind that it is easier to generate answers than to automatically generate questions. Perhaps one could identify a set of knowledge domains and have OpenAI's models spit out questions and answers - but it is not obvious this would be performant. Perhaps the closest one might get - absent being able to do KL - is some form of preference tuning (ORPO or DPO) between a standard pre-training corpus and OpenAI completions. Even this isn't trivial because what you use as a "standard pre-training corpus" will very much affect performance, and private models perform well in significant part because they have filtering and pre-processing of pretraining data that is not publicly available. [There's a technical/pricing point that if you wanted to generate a pre-training dataset token for token, that will cost more than just ~#tokens x $/million input tokens because you cannot autoregressively decode more than one token without diverging - although presumably one would take some trade-off there. The formula for cost would be "Average input length X Corpus Size X $/mm input tokens + Corpus Size X $/mm output tokens”. For quality, one would want a minimum input length of 4096 (one should match the model context length, so longer will be better and perhaps 32k would be needed for DeepSeek v3). A rough calculation of cost might be 2T tokens x ($3/MM x 4096 + $15/MM) = $24.6B]. The point is not the exact calculation/numbers, but the scalings.

4. Distillation can bring you close to but not on par with or above the teacher model. At the time DeepSeek v3 was released it was competitive on many measures with private APIs. Even the best distillation approaches will bring you towards the teacher, but not right up to or beyond the teacher's performance.

5. DeepSeek Engineers are Strong. DeepSeek engineers are clearly exceptional based on DeepSeek technical papers, model architecture and inference techniques. It stands to reason they are also exceptional at preparation and filtering of a pre-training corpus - which is what gives broad based strong behaviour.

Reasons in Favour.

OpenAI reported having to shut down certain IP addresses for high and aberrant API usage. The specifics here matter and I don't know if they are conclusive.
Behavioural similarities (although OpenAI data now being widely on the web is a confounder)
I don't rule out that - if you generated ~1T tokens, or maybe even 100B tokens of synthetic data - you could gain significant benefit from a Western API.
DeepSeek’s reported costs were low, but those costs were only for GPU rental for the training run - not for data prep or ablations.

Conclusion.

It's possible that DeepSeek v3’s strong performance can be attributed to synthetic data from OpenAI - but I wouldn't go as far as to say that it is likely. Of the possibilities, the use of synthetic data for more narrow domain improvements is more likely than broad-based capabilities.

Trelis Research

Discussion about this post