DeepSeek is a Chinese company that has released an AI model that is approaching the power of OpenAI models, but developed at a fraction of the cost. I won't recap further as enough has been said elsewhere (and in this newsletter).
Q: Did DeepSeek copy outputs/answers from OpenAI to train their model?
This is a nuanced question.
The internet now has trillions of words that were written by AI - most of them probably from OpenAI, since they have been dominant so far. Since Large Language Models are trained on internet data, yes, all models inherently are trained on OpenAI data (and Anthropic, Google, DeepSeek - to a lesser degree etc.).
For this reason, large language models often think they are ChatGPT / GPT 4. This behaviour is probably more reflective of the type of data filtering that is done prior to training (i.e. omitting phrases with the word chatgpt or gpt4) than of whether training was directly done on OpenAI outputs, although it’s hard to distinguish.
Q: Did DeepSeek use OpenAI answers/outputs to train their latest reasoning/R1 model?
Reasoning models give answers in two parts. The first part is enclosed in <thinking></thinking> tags, and the final/summarized answer then follows. Here's a sample answer:
"<thinking>Let me see, in order to add one plus one plus one, I should first add one plus one, and then add on another one.</thinking>One plus one is two, and then two plus one is three, so 1+1+1=3."
OpenAI hides everything that goes in the <thinking>. They don't share that publicly because they don't want competitors training on it. So, DeepSeek (or others) can't train on that part. DeepSeek (and Google, to a lesser degree) do share the <thinking> part. [It will be surprising if OpenAI don’t now open up their “thinking”, see Manifold market here.]
To train a model to reason, you need to generate many reasoning problems where you have answers. It is not trivial to figure out how to do this. DeepSeek have not disclosed their methods or questions/answers. Neither have OpenAI. Could OpenAI's o1 or o3 models be used to help do this? Yes, possibly, it could make sense. If so, would it be possible to prove definitively? Maybe not easily, because this kind of data is spread around the internet and may be incorporated directly or indirectly in training data.
FWIW, it's possible that - in some ways - all of this AI data on the internet is helping to improve the quality of models because AI data tends to be clean and organised (although can be less diverse).... the increasing AI slop on the internet can go both ways.
Q: Is the DeepSeek model really faster/cheaper?
Yes, it is. Probably by 10X, at least compared to the next best open source models. My guess is that private models have tricks that - while maybe not the same - provide some of these speed ups. At the same time, my estimate is that DeepSeek has tricks beyond the private state of the art - simply because the incentives are strong for private model providers to overstate their capabilities.
There are two big reasons for DeepSeek being faster/cheaper, each published in papers earlier in 2024:
DeepSeek is MUCH cheaper at processing long sequences. Large language models predict the next word by looking back at all previous tokens in the sequence. This gets expensive if you are looking back at 100,000 words (which is the length many models can process). DeepSeek manage to compress this information by a factor of about ten, without losing performance/quality. Actually, they slightly improve quality because spends more of its storage on things that are important and heavily compresses what isn't.
DeepSeek is actually 257 smaller models, not one huge model. Every question (technically, every token prediction in every layer) is answered by only 9 out of those models. Each model is specialised. This allows DeepSeek to build a very big mother-model, but to run it for a fraction of the cost by only calling on the best sub-models when answers are needed. This is called Mixture of Experts - it's not new - but it's very hard to pull off. During model training, there is a tendency for knowledge to accumulate in only a few experts - leaving many experts weak or redundant. If Meta had figured out how to balance experts during training they would not have trained a 405 billion parameter model (although, as mentioned by the DeepSeek CEO, that architecture is likely 1-2 years behind OpenAI/Anthropic/Gemini approaches). They would have done like DeepSeek and trained maybe 256 2-billion parameter sub-models. DeepSeek found a technical unlock (using a bias instead of aux loss for choosing experts) that made this mixture of experts work. [Actually, running these experts on computers is also complicated, especially when you are limited - due to export controls - to computers that can communicate with each other at only half of the speed, so DeepSeek also had to come up with some tricks here too].
If you really want to dig into the technical design of DeepSeek, I made a video here:
As a side note, export controls likely ARE slowing Chinese companies, given this quote from DeepSeek CEO:
Money has never been the problem for us; bans on shipments of advanced chips are the problem.
Q: Is the Chinese government just subsidising DeepSeek and that's why they are cheaper?
I don't know what the Chinese government is or isn't doing, but DeepSeek is probably cheap (~$0.20 per million input tokens, versus ~$3+ per million tokens) because of:
a) the improvements above and
b) they are smart at setting up computers to run models. I say this because any company can run DeepSeek - the weights are open, but Fireworks - a US company - are charging $0.9 per million tokens to run the same model, which is 3-4x the price charged by DeepSeek. Probably DeepSeek are smarter on running the computers for inference. Also, here is the DeepSeek CEO, from a highly recommended Nov 2024 interview:
Our principle is that we don’t subsidize nor make exorbitant profits. This price point gives us just a small profit margin above costs.
FWIW Google is cheaper than DeepSeek for inference. One reason Google is cheap is because they don't rely only on Nvidia. Google has their own computers (TPUs, Tensor Processing Units) and has been investing in them before the current AI cycle.
Q: Is AI a bubble?
Yes, obviously AI is a bubble. Money is being piled into hiring engineers and buying computers largely because other companies are doing the same AND there is no clear revenue-based justification relative to valuation. Rather, there is an abstract and shared goal of increasing intelligence/discoveries.
Is this bad? When there are minds and money focused on something, there are bound to be new unexpected ideas that pop up (e.g. DeepSeek's improvements). In a bubble, it's irrational NOT to get involved in some way. The fact that others are working on the same broad category means there is the coordination. This increases your own chances of success. But, while bubbles can be good as a whole, that doesn't mean individual players will make money in the long run.
So, is this bubble bad? Yes, in certain ways it is bad. If you have a lot of your net worth in Nvidia and didn't take gains, that is a concentrated risk. If you have a lot of your retirement money in the S&P500 where now seven companies make up over 30% of the value, then that is also a concentrated risk - although maybe not too bad of one.
Q: So did DeepSeek cause Nvidia's stock price to crash?
One theory is that - realising the technical improvements DeepSeek made - markets realised that there is more to AI than just having lots of computers. So, perhaps the demand for Nvidia chips will go down.
I think the picture is a lot more complicated than that and I don't have an explanation for exactly why Nvidia stock dropped (and recovered a bit) when it did.
However, there are a few general things to say:
Markets and technologies are not a one dimensional game. There are unlimited dimensions along which to play. Maybe it takes some time, but if there are obvious opportunities to make money, new players will emerge that find new dimensions on which to compete. Finding tricks like DeepSeek is one dimension, but there are other dimensions that are not even possible to think about now.
A lot of the support for Nvidia's stock price is around the supply of GPUs not being able to meet demand. I think it's Taleb who said something like "A supply glut may or may not be followed by a shortage, but I've never seen a shortage that is not followed by a glut". Translation: There is a shortage of computers now, but probably that will not go on forever. Furthermore, Nvidia's computers are made by TSMC and Taiwan Semiconductor have 60%+ market share. This is a big weakness for Nvidia's bargaining power.
Trump is talking about putting tariffs on imports from Taiwan. Maybe leakage of that was part of the news affecting the market. I just don't know. The problem is under-constrained and it's foolish to draw specific conclusions from price moves. If anything, the conclusion to draw is just the reminder that there are many dimensions for discovery and progress.
Q: But what about Jevon's paradox?
Jevon's paradox, applied here, is that - as AI methods and computers get cheaper and more efficient - people actually use them more, so the overall sales of AI could increase - even if efficiency and cost per unit is falling.
However, you can't stop at Jevon's paradox. We can very well have:
a) AI gets much more efficient to run (e.g. DeepSeek tricks)
b) This results in the total sales of computers going up in dollar terms (e.g. the growth in unit sales outweighs the fall in unit price), i.e. Jevon's paradox
but:
c) Nvidia makes very little money because i) other vendors gain market share, ii) TSMC takes most of the profits, iii) some other unthought of approach to building models or computers emerges (which btw, the bigger the AI bubble, the more likely this option iii. is to happen!).
Q: Is OpenAI in trouble because of DeepSeek?
Yes, OpenAI's current (publicly announced) projects are in trouble from a profitability standpoint because they will have to cut their prices a lot on their CURRENT products, and write off capital invested in training.
However, that's what they would have expected. This is a business where they expect to be selling a radically different, and improved product each year.
OpenAI is perhaps playing this bubble optimally. OpenAI has no idea of the specifics that are going to happen and is acting accordingly by diversifying its bets.
OpenAI had planned to release GPT-5 last year, but didn't. If that was their only plan, they would have been in trouble. Instead, they released a multi-modal model (voice, text, speech) AND they ended up launching reasoning models (o1, o3) - approaches that looked like they would NOT work well. On the other hand, there is this overconfident 2023 snippet here from Sam Altman saying he thinks smaller companies with lower budget have little chance. That Sam Altman is a large investor in Helion, a nuclear fusion startup with scant public research papers, is also perhaps of some concern from a judgement/approach standpoint on large capital spending. But again, Sam Altman is maybe not always right but he is always diversified.
Anthropic planned to release Claude Opus, but it seems they are finding better performance/quality trade offs with a smaller/faster Sonnet model.
In a bubble, you are trying to raise as much money as possible and take on as many parallel big bets as possible, and OpenAI is probably winning on this metric. Although, it is far from a guaranteed strategy. There will be DeepSeeks that - while individually less diversified than OpenAI - will on aggregate be much more powerful in terms of discovery. This makes the the future of the AI market exciting and unpredictable.