Top Vision Models 2025

Qwen 2.5 VL, Moondream, & SmolVLM (Fine-Tuning & Benchmarks)

Trelis Research

Feb 03, 2025

I make a recommendation for the best small and large VLMs to choose - for raw usage and fine-tuning.

I demo Moondream inference and SmolVLM fine-tuning.
I then record myself one-shotting a fine-tune of Qwen2.5-VL with my brain set in agentic RL mode to recover from errors.
Includes fine-tuning + performance benchmarks on chess piece recognition task

In short:

- Moondream: New gaze detection, structured outputs, improved OCR by @vikhyatk
- SmolVLM by HuggingFace: 250M/500M param variants using pixel mixing for efficiency
- Qwen 2.5-VL: 3B/7B/72B models outperforming private APIs on benchmarks

With thanks to Andi Marafioti for tips on webgpu with smolvlm.

For those of you new to Trelis, I also recommend looking at Florence 2 - still the highest quality of the smallest models, in my opinion:

For data preparation (and understanding how these vision language models work), I recommend the original Llava video (I think now the most viewed video on the Trelis YouTube channel):

Cheers, Ronan

🛠 Explore Fine-tuning, Vision, Audio, and Inference Tools

💡 Consulting (Technical Assistance OR Market Insights)

🤝 Join the Trelis Team

💸 Grants Program

Vision Language Models in 2025

A detailed analysis of current vision language models, focusing on Moondream, SmolVLM, and Qwen2.5-VL, with recommendations for different use cases.

Model Selection Guide

For on-device deployment:

- Florence 2 (encoder-decoder): Highest quality responses
- SmolVLM: Available in 250M and 500M parameter versions
- Moondream: Includes gaze recognition capability

For maximum quality:

- Qwen 2.5 series

- 3B parameter model
- 7B parameter model
- 72B parameter model

- Outperforms many private models on benchmarks

Moondream Updates

Key features in January 2025 release:

- Structured output support (XML, JSON, YAML)
- Gaze detection for tracking where subjects are looking
- Improved OCR performance
- Enhanced text recognition in images

Technical capabilities:

- Object detection with bounding boxes
- Point-to-object functionality
- Face detection with gaze tracking
- Performance comparable to Qwen 2 2B

SmolVLM Improvements

Technical updates:

- New 250M and 500M parameter versions
- Smaller vision encoder without performance degradation
- Pixel mixing technique for improved efficiency
- Increased OCR training data (41% vs 25% previously)
- Optimized for document comprehension

## Qwen2.5-VL Technical Details

Architecture improvements:

- Dynamic token allocation based on image size
- Precise object grounding with sub-labels
- Enhanced text recognition
- Document parsing with HTML conversion
- Video understanding capabilities

Performance metrics:

- Outperforms Sonnet 3.5 and GPT-4o on visual QA
- Strong results on video understanding
- Leading performance on visual agent tasks

Fine-tuning Results

Test case using chess piece recognition:

- SmolVLM: 2/7 pieces correctly identified
- Florence 2: 5/7 pieces correctly identified
- Qwen2.5-VL (3B): 5/7 pieces correctly identified

Memory optimization techniques:

- Image resolution reduction
- LoRA fine-tuning
- Selective module training (remove lm_head and embed_tokens)
- Gradient checkpointing with re-entrancy