Qwen3 Inference and MCP Agents

incl. vLLM and SGLang

Trelis Research

May 05, 2025

I describe:

+ Overview of the Qwen3 family of models

+ Running inference with vLLM and SGLang

+ Using Qwen3 with MCP agents

Cheers, Ronan

P.S. The scripts are in the ADVANCED-inference repo:

Get repo access

Video Links:

llmperf: https://github.com/ray-project/llmperf
Qwen3 blog: https://qwenlm.github.io/blog/qwen3/

Trelis Links:

🤝 Are you a talented developer? Work for Trelis

💡 Need Technical or Market Assistance? Book a Consult Here

💸 Starting a New Project/Venture? Apply for a Trelis Grant

Alibaba Releases Qwen 3 Models: Technical Analysis and Benchmarks

Alibaba has released a new series of open-source text models called Qwen 3, with performance comparable to DeepSeek V3 and DeepSeek R1. The release includes both large and small models optimized for different deployment scenarios.

Model Architecture and Sizes

Largest model: 235B parameters (22B activated) using mixture-of-experts
Mid-size model: 32B parameters (dense) with sparse variant using 3B activated parameters
Smaller models: Range from 14B down to 0.6B parameters
FP8 versions available for faster inference on modern GPUs like H100s

Benchmark Performance

The 235B parameter model achieves:

Comparable or slightly better scores than DeepSeek R1 on Arena Hard, AIME, AIDER benchmarks
Lower performance than Gemini 2.5 Pro
2-3x faster inference speed for sparse 3B model vs dense 32B model

Technical Implementation

Uses Hermes-style tool calling format
Dynamic thinking mode that can be toggled with "think" and "no_think" commands
Training process:
1. Pre-training of base models
2. Chain-of-thought training
3. Reinforcement learning phase
4. Combined thinking/non-thinking data training
5. General reinforcement learning

Deployment Requirements

235B model: Runs on 4-8 H100 GPUs
32B dense model: Single GPU deployment
30B-A3B sparse model: Single GPU with significantly faster inference than the 32B dense but similar quality
FP8 precision support for optimized performance