You can use a third party service to store embeddings, but making those third party calls can add to latency. Maybe this isn’t an issue - if your LLM calls are the bottleneck for overall speed.
But, if you want sub-second response times, the speed of your vector retrieval system matters - and hosting your database on the same server results in significant savings.
With a PostgreSQL build, you have the benefits of:
Lower latency (run the database on a server you own or rent)
Lower cost (no third party service to pay for)
More capability (run custom embedding + advanced keyword methods for the best retrieval performance).
It turns out that Postgres has tools for doing RAG - fairly out of the box. I go through the setup in great detail in this video:
Cheers, Ronan
Learn more at Trelis.com (updated website!)
Building a High-Performance RAG System with Postgres (AI Summary)
Trelis presents a comprehensive approach to implementing Retrieval Augmented Generation (RAG) using Postgres, offering both basic and advanced implementations that prioritize performance and customization.
Why Postgres for RAG?
The tutorial begins by addressing a common challenge in RAG implementations: the latency introduced by third-party vector databases. While many developers default to specialized vector databases, Postgres offers several compelling advantages: it's cheaper (running on your own infrastructure), faster (eliminating third-party API latency), and highly customizable.
The Dual Search Approach
A key insight presented is the combination of both dense (vector) and sparse (text) search methods. Vector search excels at capturing semantic meaning, while text search (specifically BM25) performs better for exact matches and specific terms. Through detailed performance comparisons, Trelis demonstrates how combining these approaches yields superior results compared to using either method alone.
Technical Implementation
The implementation leverages two crucial Postgres extensions:
- pgvector for handling dense embeddings
- pg_bestmatch for optimized keyword search
The basic implementation demonstrates how to:
1. Set up the database structure
2. Create and store embeddings using a NOMIC AI model
3. Implement both vector and text-based search
4. Combine results effectively
Advanced Optimizations
The advanced implementation introduces several sophisticated features:
- Asynchronous database calls for improved throughput
- Optimized BM25 implementation with custom indexing
- Sophisticated text processing including stemming and stop word handling
- Binary search for efficient document chunking
- Parallel processing for document uploads
Document Processing/Chunking with Binary Search
For PDFs, the system maintains page-level chunks when possible (utilizing Nomic's 8000-token limit), making it easier to reference source material. For text documents, it employs binary search to optimize chunk creation, significantly improving processing speed.
Performance Considerations
The implementation addresses several critical performance aspects:
- Hierarchical Navigable Small World (HNSW) indexing for vector search
- Binary tree and generalized inverted indexes for BM25
- Batch processing for document uploads
- Parallel embedding generation
- Optimized text normalization
Practical Application
The tutorial demonstrates the system's capabilities through a real-world application (assistant.trelis.com), showing sub-second response times for sparse searches and approximately half-second responses for dense searches. When combined with fast LLMs like Cerebras or optimized Llama models, this enables complete RAG responses within seconds.
Conclusion
This Postgres-based RAG implementation offers a compelling alternative to third-party vector databases, providing superior control, performance, and cost-effectiveness. The combination of dense and sparse search methods, along with numerous optimizations, results in a production-ready system suitable for enterprise deployment.
The complete implementation is available in two repositories: a basic version (publicly available on GitHub under Trelis Research) and an advanced version with enterprise-grade features (available through trelis.com).
Interesting post! But as general,I was wondering, since we can host Chroma, Faiss, or Qdrant on our own infrastructure without relying on APIs, wouldn’t they offer better performance and scalability compared to Postgre for large-scale vector search? Is my understanding correct, or am I missing something?