I describe:
+ Overview of the Qwen3 family of models
+ Running inference with vLLM and SGLang
+ Using Qwen3 with MCP agents
Cheers, Ronan
P.S. The scripts are in the ADVANCED-inference repo:
Video Links:
llmperf: https://github.com/ray-project/llmperf
Qwen3 blog: https://qwenlm.github.io/blog/qwen3/
Trelis Links:
🤝 Are you a talented developer? Work for Trelis
💡 Need Technical or Market Assistance? Book a Consult Here
💸 Starting a New Project/Venture? Apply for a Trelis Grant
Alibaba Releases Qwen 3 Models: Technical Analysis and Benchmarks
Alibaba has released a new series of open-source text models called Qwen 3, with performance comparable to DeepSeek V3 and DeepSeek R1. The release includes both large and small models optimized for different deployment scenarios.
Model Architecture and Sizes
Largest model: 235B parameters (22B activated) using mixture-of-experts
Mid-size model: 32B parameters (dense) with sparse variant using 3B activated parameters
Smaller models: Range from 14B down to 0.6B parameters
FP8 versions available for faster inference on modern GPUs like H100s
Benchmark Performance
The 235B parameter model achieves:
Comparable or slightly better scores than DeepSeek R1 on Arena Hard, AIME, AIDER benchmarks
Lower performance than Gemini 2.5 Pro
2-3x faster inference speed for sparse 3B model vs dense 32B model
Technical Implementation
Uses Hermes-style tool calling format
Dynamic thinking mode that can be toggled with "think" and "no_think" commands
Training process:
Pre-training of base models
Chain-of-thought training
Reinforcement learning phase
Combined thinking/non-thinking data training
General reinforcement learning
Deployment Requirements
235B model: Runs on 4-8 H100 GPUs
32B dense model: Single GPU deployment
30B-A3B sparse model: Single GPU with significantly faster inference than the 32B dense but similar quality
FP8 precision support for optimized performance