Reasoning Models and Chinese Models

New Annual Subscription Option; ADVANCED-evals update

Jan 17, 2025

All about Reasoning

I cover o1/o3 from OpenAI, Flash Thinking from Google, QwQ from Qwen and R1-Lite from Deepseek.

Then, I give some updates on the new Minimax model and compare it to Deepseek v3!

Repo Updates

For those with access to ADVANCED-evals, I’ve upgraded the repo to run on LiteLLM - this means you can now plug in any openai compatible endpoint (e.g. from llama.cpp or LMStudio or Fireworks etc.)

NEW: Annual Subscriptions for Repos

You can now get access to Trelis’ ADVANCED repos via an annual subscription option:

Fine-tuning

Inference

Multi-modal Audio / Transcription / TTS

Vision / Diffusion

Evals

You’ll also see an annual subscription option for the Multi-repo Bundle (which includes all five repos plus DISCORD access)!

That’s it for this week. A written video summary follows below!

Cheers, Ronan

The Evolution of AI Reasoning Models: From ChatGPT to Chinese Labs

Recent developments in AI reasoning capabilities mark a significant advancement beyond standard language models. While early models like ChatGPT could engage in basic conversations, newer reasoning-focused models can (start to) identify inconsistencies in documents and construct higher-quality analytical outputs.

Performance Metrics

The ARC AGI benchmark demonstrates the quantitative improvement in reasoning capabilities:

- Earlier models (GPT-4o, Claude 3.5) scored well below 20%

- OpenAI's o1 series achieved ~30%

- Latest o3 models score above 75%

Technical Architecture Changes

The key technical innovations enabling better reasoning include:

- Training on reasoning traces (documented problem-solving steps)

- Verification capabilities to check solution validity

- Parallel exploration of multiple solution paths

While reasoning capabilities have improved, key constraints remain:

- High compute costs ($2,000 per complex reasoning task)

- Limited tool-calling abilities for agent applications (will come soon)

The rapid progress demonstrates how technical innovations spread quickly across labs once core capabilities are proven possible, even with export controls and compute limitations in place.

Chinese Lab Developments

Several Chinese labs have now achieved state-of-the-art performance despite compute limitations:

DeepSeek:

- Achieved GPT-4o level performance with $5M training cost

- Implemented efficient mixture of experts architecture

- Released on Christmas Eve 2024

Now, Minimax:

- Large MoE like DeepSeek v3 (probably achieved decent balance during training)

- Matches top model performance metrics

- Handles 1M token context length

- Uses sparse attention (using full attention only every 8 layers instead of every layer)

Trelis Research

Discussion about this post