A 27B model that fits on a H100 w/ 32k context
I review the Gemma 3 technical paper
- Easily Fits on one 80 GB GPU
- 5x smaller KV cache than Gemma 2
- Similar performance to Gemini Pro 1.5
- Maybe similar on vision tasks to Qwen VL 2.5
Fine-tuning and inferencing video to come soon.
Cheers, Ronan
💡 Consulting (Technical Assistance OR Market Insights)
🤝 Join the Trelis Team / Trelis Developer Colabs
Google Releases Gemma 3: Technical Analysis of Latest Open-Weight Models
Google has released Gemma 3, a new series of open-weight models with sizes ranging from 1B to 27B parameters. The largest model (27B) has achieved top-10 ranking on Chatbot Arena, a blind human evaluation benchmark, alongside models like o1 and Gemini 2 Flash.
Architecture Details
Traditional transformer architecture with a key modification to attention
Uses sliding window attention in 4 out of 5 layers (~1000-token window)
Full global attention in every 5th layer
262,000-token vocabulary
Context length support up to 128K tokens
Vision encoder included in 4B, 12B, and 27B variants (not in 1B)
Training Infrastructure
Trained on TPU V5E and V5P hardware
Models available in BF16 format
4-bit and 8-bit quantized versions provided
Memory requirements for 27B model:
BF16: 54GB model + 72GB KV cache (32K context)
8-bit: 27.4GB model + 46.1GB KV cache
Training Process
Pre-training token counts:
27B model: 14 trillion tokens
12B model: 12 trillion tokens
4B model: 4 trillion tokens
1B model: 2 trillion tokens
Three-phase training:
Pre-training
Distillation from stronger teacher models
Instruction tuning including:
Human feedback
Code execution feedback
Math problem ground truth rewards
Benchmark Performance
MMLU Pro scores:
Gemma 3 (27B): 67.5%
Gemini 1.5 Flash: 67.3%
Key comparisons to other models:
Math (zero-shot):
Gemma 3: 89%
QWEN 2.5 VL: 83%
GPQA Diamond:
Gemma 3: 42%
QWEN 2.5 VL: 49%
Technical Improvements
Regurgitation rate reduced to <0.001% (down from 0.01% in Gemma 2)
Sliding window attention provides 5x memory savings
Vision performance relatively consistent across model sizes due to shared encoder
Strong performance scaling with context length up to 128K tokens