Why use keyword versus Vector Search?
and an update on Postgres libraries, incl. vectorchord-bm25
A few things I think are under appreciated:
Using bm25 (keyword) type search instead of (or combined with) vector search. I see a lot of demos using only embeddings, but the best commercial implementations will nearly always use hybrid.
Pagination. It’s common to use top-k results, but adding the ability to paginate allows for a more exhaustive search than just re-searching with a new query.
Strengths of bm25 for new or unfamiliar terms. Embeddings don’t generalise as well.
Strengths of bm25 for holistic document comprehension/search.
I explain in detail in this video.
Enjoyed the video or have Qs? Drop a comment below or reply to this email with a 🤙.
Cheers, Ronan
🛠 Explore Fine-tuning, Inference, Vision, Audio, and Evaluation Tools
💡 Consulting (Technical Assistance OR Market Insights)
Vector vs Keyword Search
And their pros, cons and combos.
How Vector Search Works
Vector search represents documents and queries as high-dimensional arrows (vectors) that capture semantic meaning. The process involves:
Converting text into vectors using embedding models
Comparing query vectors to document vectors using similarity metrics (cosine similarity or dot product)
Finding documents whose vectors point in similar directions to the query vector
Key limitations:
Maximum input length constrained by embedding model (typically 8,000 words)
Performance depends on embedding model's training data
May struggle with new or rare terms not seen during training
How Keyword Search (BM25) Works
BM25 is an advanced keyword matching algorithm that:
Tokenizes text into subwords
Applies stemming to normalize word variations
Counts term frequencies in documents
Downweights common terms across the corpus
Prioritizes rare query terms that appear in documents
Advantages:
Works with documents of any length
Handles new terminology effectively
More predictable matching behavior
Fast and memory-efficient
Language Support Considerations
Vector search can work across languages since semantically similar concepts in different languages tend to have similar vector representations. However, this can lead to mixed-language results that may not be desirable.
Keyword search is language-specific by default, only matching exact terms. This provides cleaner single-language results but requires explicit multi-language query support.
Document Length Handling
BM25 has built-in length normalization and works with any document length. Vector search requires chunking long documents to fit embedding model limits, which can lose document-level context.
Performance on New Terms
BM25 excels at finding new or rare terms through exact matching. Vector search may fail completely on terms outside its training vocabulary. This makes BM25 more robust for specialized domains with unique terminology.
LLM Query Optimization
When LLMs generate search queries, they can:
Avoid spelling mistakes that hurt keyword matching
Use query expansion to include synonyms
Optimize query structure for the search method
Implementation in PostgreSQL
Modern PostgreSQL implementations can use:
pgvector for vector search with HNSW indexing
vectorchord-bm25 for native BM25 with Block-WeakAnd indexing
Rank fusion to combine both approaches using reciprocal rank
The optimal approach often combines both methods:
Use BM25 as primary search layer
Add vector search for semantic matching
Merge results using rank fusion with constant K (typically 60-200)
This hybrid approach leverages the strengths of both exact and semantic matching while compensating for their individual weaknesses.