I make a recommendation for the best small and large VLMs to choose - for raw usage and fine-tuning.
I demo Moondream inference and SmolVLM fine-tuning.
I then record myself one-shotting a fine-tune of Qwen2.5-VL with my brain set in agentic RL mode to recover from errors.
Includes fine-tuning + performance benchmarks on chess piece recognition task
In short:
- Moondream: New gaze detection, structured outputs, improved OCR by @vikhyatk
- SmolVLM by HuggingFace: 250M/500M param variants using pixel mixing for efficiency
- Qwen 2.5-VL: 3B/7B/72B models outperforming private APIs on benchmarks
With thanks to Andi Marafioti for tips on webgpu with smolvlm.
For those of you new to Trelis, I also recommend looking at Florence 2 - still the highest quality of the smallest models, in my opinion:
For data preparation (and understanding how these vision language models work), I recommend the original Llava video (I think now the most viewed video on the Trelis YouTube channel):
Cheers, Ronan
🛠 Explore Fine-tuning, Vision, Audio, and Inference Tools
💡 Consulting (Technical Assistance OR Market Insights)
Vision Language Models in 2025
A detailed analysis of current vision language models, focusing on Moondream, SmolVLM, and Qwen2.5-VL, with recommendations for different use cases.
Model Selection Guide
For on-device deployment:
- Florence 2 (encoder-decoder): Highest quality responses
- SmolVLM: Available in 250M and 500M parameter versions
- Moondream: Includes gaze recognition capability
For maximum quality:
- Qwen 2.5 series
- 3B parameter model
- 7B parameter model
- 72B parameter model
- Outperforms many private models on benchmarks
Moondream Updates
Key features in January 2025 release:
- Structured output support (XML, JSON, YAML)
- Gaze detection for tracking where subjects are looking
- Improved OCR performance
- Enhanced text recognition in images
Technical capabilities:
- Object detection with bounding boxes
- Point-to-object functionality
- Face detection with gaze tracking
- Performance comparable to Qwen 2 2B
SmolVLM Improvements
Technical updates:
- New 250M and 500M parameter versions
- Smaller vision encoder without performance degradation
- Pixel mixing technique for improved efficiency
- Increased OCR training data (41% vs 25% previously)
- Optimized for document comprehension
## Qwen2.5-VL Technical Details
Architecture improvements:
- Dynamic token allocation based on image size
- Precise object grounding with sub-labels
- Enhanced text recognition
- Document parsing with HTML conversion
- Video understanding capabilities
Performance metrics:
- Outperforms Sonnet 3.5 and GPT-4o on visual QA
- Strong results on video understanding
- Leading performance on visual agent tasks
Fine-tuning Results
Test case using chess piece recognition:
- SmolVLM: 2/7 pieces correctly identified
- Florence 2: 5/7 pieces correctly identified
- Qwen2.5-VL (3B): 5/7 pieces correctly identified
Memory optimization techniques:
- Image resolution reduction
- LoRA fine-tuning
- Selective module training (remove lm_head and embed_tokens)
- Gradient checkpointing with re-entrancy