Fine-tune Vision Models for Object and Bounding Box Detection
Florence 2 is a phenomenal multi-modal model from Microsoft.
It does everything:
- Captioning
- Bounding boxes
- Segmentation via polygons
And it's only 0.5 GB (base) or 1.5 GB (large) in size.
I provide a Colab notebook to run all these types of inference, but you can run the notebook on your CPU too.
And then I show how to prepare a dataset to fine-tune a model for bounding box detection.
Last of all, I walk through how to fine-tune Florence 2 using LoRA. And I get good results detecting specific chess pieces.
Cheers, Ronan
Trelis Assistant for Jupyter Lab
AI Summary: Fine-Tuning Florence 2 for Object Detection and Bounding Boxes
Trelis presents a comprehensive walkthrough of object detection and bounding box generation using Microsoft's Florence 2 model. The guide covers both inference and fine-tuning, with particular emphasis on creating custom datasets for specialized applications.
Understanding Florence 2's Architecture
Florence 2 stands out as a relatively small but powerful multimodal model, available in two versions: a large variant at 1.5GB and a base model at 500MB. The architecture employs separate encoders for image and text inputs, followed by a decoder that generates outputs autoregressively. Unlike some recent models like Pixtral or Llava, Florence 2 maintains distinct encoder-decoder architecture rather than using a decoder-only approach.
Key Technical Components
The model processes images by splitting them into patches and converting pixel information (RGB values plus position) into vectors through embeddings. These vectors are then transformed using a vision transformer that may incorporate both convolution and attention mechanisms. The normalized coordinate system (0-999) ensures consistent handling of different image sizes and aspect ratios.
Practical Implementation
The tutorial demonstrates three primary tasks:
Basic caption generation
Dense region caption generation (with bounding boxes)
Referring expression segmentation (polygon outlines)
Custom Dataset Creation
A significant portion of the tutorial focuses on dataset annotation using a custom tool developed for the Advanced Vision repository. The tool allows efficient labeling of images with bounding boxes, demonstrated using a chess piece dataset. The annotation process generates normalized coordinates that can be used for fine-tuning.
Fine-Tuning Methodology
The fine-tuning process employs Low-Rank Adapters (LoRA) rather than updating all model parameters. This approach significantly reduces the number of trainable parameters (to about 4% of the total) while maintaining effectiveness. The tutorial recommends:
Using a rank of 32 for adapters
Setting LoRA alpha based on model size
Applying adapters to all linear layers
Training with a constant learning rate followed by an annealing phase
Implementation Details
The training process utilizes a custom data collator that handles both images and bounding box annotations. The tutorial demonstrates how to:
Process input prompts and images
Generate appropriate label formats
Handle normalized coordinate systems
Implement effective training loops
Results and Performance
With just 48 training samples, the fine-tuned model shows significant improvement in chess piece detection and classification. The tutorial demonstrates how the model progresses from generic object detection to specific piece identification with accurate bounding boxes.
Technical Considerations
The guide emphasizes several important technical aspects:
Batch size considerations for small datasets
Learning rate selection based on model size
Gradient checkpointing for VRAM optimization
Proper model merging and deployment
Practical Applications
This approach can be applied to various domains requiring object detection and localization, such as:
Sports analytics (player tracking)
Medical imaging
Waste sorting
Industrial inspection
I am amazed this works with such a small set. One way of reducing the tedium of manual annotation is to use a better object detection algorithm to do it for you. Grounding DINO is one of these. You can get this on hugging face spaces eg. https://huggingface.co/spaces/merve/Grounding_DINO_demo
It is slow to use on hugging face but can be downloaded and used on your PC with GPU or on runpod/vast.ai/COLAB.
I just tried it with "chess piece" and it did a pretty good job. It is not smart enough to know what a black king is though. So you would need to manually change annotation class for Ronan's sample chess piece demo. This is an example of a zero shot detector so it can work with objects outside of those in standard classes like COCO. With IDE's like cursor or (my fave) cline you can also generate python code to quickly cycle through a dataset and change/add/delete bounding boxes and class names without too much fuss. Grounding DINO is by no means perfect so it will make errors but can dramatically reduce the amount of manual annotation needed.