Faster Embeddings that Generalise Well
I cover the ModernBERT family of encoder models that underpin a new suite of fine-tunes - including an embedding model from Nomic and a Contextual Embedding approach that generalises very well.
BTW, I should have said this in the video, but ModernBERT is also trained on code! So should be better for that application than previous embedding models!
Cheers, Ronan
ModernBERT and Contextual Document Embeddings: Latest Advances in RAG Systems
Recent developments in embedding models have introduced significant improvements for retrieval augmented generation (RAG) systems. This article examines three key advances: ModernBERT base models, Nomic's fine-tuned embeddings, and Contextual Document Embeddings (CDE).
ModernBERT Architecture
ModernBERT, released by AnswerAI, achieves better performance and faster inference compared to previous BERT variants through two key optimizations:
1. Periodic Attention: Full attention mechanisms are only applied in select transformer layers rather than throughout the network, reducing computational overhead
2. Unpadded Sequence Processing: Multiple sequences are combined into single longer sequences instead of using padding tokens, improving compute efficiency
Performance testing shows ModernBERT models cluster in the upper left quadrant when plotting GLUE score (quality) versus runtime, indicating superior speed-quality tradeoff compared to earlier models.
Nomic's ModernBERT Implementations
Nomic has fine-tuned ModernBERT specifically for embedding tasks, creating models like Nomic ModernBERT embed base. These models support Matryoshka embeddings - a technique where embedding dimensions are sorted by importance, allowing dimensional truncation to save storage space while preserving most semantic meaning.
Testing on a Touch Rugby rules dataset showed:
- Base ModernBERT, fine-tuned by Nomic: 72% accuracy
- With Matryoshka truncation (256 dims): 69% accuracy
- Fine-tuned Nomic ModernBERT: 81% accuracy
- CDE model: 81% accuracy.
- CDE model - combined with BM25: 90% accuracy
Contextual Document Embeddings
CDE takes a novel two-stage approach:
1. First stage embeds a reference set of 512 background documents
2. Second stage combines query embeddings with the reference embeddings
This allows the model to better generalize to new domains without fine-tuning by incorporating domain context through the reference documents. Testing showed CDE achieved 81% accuracy out-of-the-box without fine-tuning on the Touch Rugby dataset.
Implementation Considerations
For production deployments:
- Matryoshka truncation reduces vector storage needs but doesn't affect inference memory usage
- GPU quantization often doesn't improve speed due to optimizations for FP16/FP32
- CPU quantization can provide 4x memory reduction using INT8 format
- Minimum recommended retrieval chunks is 6 for embeddings and 6 for BM25
- CDE requires additional storage for reference embeddings
Recommendations
1. Use BM25 alongside embeddings for better generalization
2. Consider ModernBERT-based models for improved speed-quality balance
3. Evaluate CDE if fine-tuning isn't feasible
4. Combine approaches (BM25 + embeddings) for optimal performance
5. Use Matryoshka embeddings if vector storage is constrained