All about Reasoning
I cover o1/o3 from OpenAI, Flash Thinking from Google, QwQ from Qwen and R1-Lite from Deepseek.
Then, I give some updates on the new Minimax model and compare it to Deepseek v3!
Repo Updates
For those with access to ADVANCED-evals, I’ve upgraded the repo to run on LiteLLM - this means you can now plug in any openai compatible endpoint (e.g. from llama.cpp or LMStudio or Fireworks etc.)
NEW: Annual Subscriptions for Repos
You can now get access to Trelis’ ADVANCED repos via an annual subscription option:
You’ll also see an annual subscription option for the Multi-repo Bundle (which includes all five repos plus DISCORD access)!
That’s it for this week. A written video summary follows below!
Cheers, Ronan
The Evolution of AI Reasoning Models: From ChatGPT to Chinese Labs
Recent developments in AI reasoning capabilities mark a significant advancement beyond standard language models. While early models like ChatGPT could engage in basic conversations, newer reasoning-focused models can (start to) identify inconsistencies in documents and construct higher-quality analytical outputs.
Performance Metrics
The ARC AGI benchmark demonstrates the quantitative improvement in reasoning capabilities:
- Earlier models (GPT-4o, Claude 3.5) scored well below 20%
- OpenAI's o1 series achieved ~30%
- Latest o3 models score above 75%
Technical Architecture Changes
The key technical innovations enabling better reasoning include:
- Training on reasoning traces (documented problem-solving steps)
- Verification capabilities to check solution validity
- Parallel exploration of multiple solution paths
While reasoning capabilities have improved, key constraints remain:
- High compute costs ($2,000 per complex reasoning task)
- Limited tool-calling abilities for agent applications (will come soon)
The rapid progress demonstrates how technical innovations spread quickly across labs once core capabilities are proven possible, even with export controls and compute limitations in place.
Chinese Lab Developments
Several Chinese labs have now achieved state-of-the-art performance despite compute limitations:
DeepSeek:
- Achieved GPT-4o level performance with $5M training cost
- Implemented efficient mixture of experts architecture
- Released on Christmas Eve 2024
Now, Minimax:
- Large MoE like DeepSeek v3 (probably achieved decent balance during training)
- Matches top model performance metrics
- Handles 1M token context length
- Uses sparse attention (using full attention only every 8 layers instead of every layer)