Advanced Embedding Models and Techniques for RAG

ModernBERT and Contextual Embeddings

Jan 20, 2025

Faster Embeddings that Generalise Well

I cover the ModernBERT family of encoder models that underpin a new suite of fine-tunes - including an embedding model from Nomic and a Contextual Embedding approach that generalises very well.

BTW, I should have said this in the video, but ModernBERT is also trained on code! So should be better for that application than previous embedding models!

Cheers, Ronan

More resources at Trelis.com

ModernBERT and Contextual Document Embeddings: Latest Advances in RAG Systems

Recent developments in embedding models have introduced significant improvements for retrieval augmented generation (RAG) systems. This article examines three key advances: ModernBERT base models, Nomic's fine-tuned embeddings, and Contextual Document Embeddings (CDE).

ModernBERT Architecture

ModernBERT, released by AnswerAI, achieves better performance and faster inference compared to previous BERT variants through two key optimizations:

1. Periodic Attention: Full attention mechanisms are only applied in select transformer layers rather than throughout the network, reducing computational overhead

2. Unpadded Sequence Processing: Multiple sequences are combined into single longer sequences instead of using padding tokens, improving compute efficiency

Performance testing shows ModernBERT models cluster in the upper left quadrant when plotting GLUE score (quality) versus runtime, indicating superior speed-quality tradeoff compared to earlier models.

Nomic's ModernBERT Implementations

Nomic has fine-tuned ModernBERT specifically for embedding tasks, creating models like Nomic ModernBERT embed base. These models support Matryoshka embeddings - a technique where embedding dimensions are sorted by importance, allowing dimensional truncation to save storage space while preserving most semantic meaning.

Testing on a Touch Rugby rules dataset showed:

- Base ModernBERT, fine-tuned by Nomic: 72% accuracy
- With Matryoshka truncation (256 dims): 69% accuracy
- Fine-tuned Nomic ModernBERT: 81% accuracy
- CDE model: 81% accuracy.
- CDE model - combined with BM25: 90% accuracy

Contextual Document Embeddings

CDE takes a novel two-stage approach:

1. First stage embeds a reference set of 512 background documents

2. Second stage combines query embeddings with the reference embeddings

This allows the model to better generalize to new domains without fine-tuning by incorporating domain context through the reference documents. Testing showed CDE achieved 81% accuracy out-of-the-box without fine-tuning on the Touch Rugby dataset.

Implementation Considerations

For production deployments:

- Matryoshka truncation reduces vector storage needs but doesn't affect inference memory usage
- GPU quantization often doesn't improve speed due to optimizations for FP16/FP32
- CPU quantization can provide 4x memory reduction using INT8 format
- Minimum recommended retrieval chunks is 6 for embeddings and 6 for BM25
- CDE requires additional storage for reference embeddings

Recommendations

1. Use BM25 alongside embeddings for better generalization

2. Consider ModernBERT-based models for improved speed-quality balance

3. Evaluate CDE if fine-tuning isn't feasible

4. Combine approaches (BM25 + embeddings) for optimal performance

5. Use Matryoshka embeddings if vector storage is constrained

Trelis Research

Discussion about this post