🔍 Finding the Most Relevant Context with Language Models 🔍
The issue with traditional vector-based search (RAG, retrieval augmented generation) is that vector comparison can miss relevant snippets of information. This [https://lnkd.in/dgw3WWyq] recent paper from Netflix shows that cosine comparisons can often be arbitrary, and badly represent underlying information.
So, I started playing around with small language models - as a preprocessing step - to decide which chunks of a document are most relevant to a query. However, this ends up being slow and expensive because you are waiting for the language model to return the text that is relevant.
Then, instead of asking the LLM to summarise or extract, I thought of using it to simply rate the relevance of a chunk on a scale of 1-5. This requires the LLM to only respond with one character/token (i.e. 1, 2, 3, 4, or 5), which gives a big speedup.
Now, getting a 1 character consistent response is tricky - because LLMs sometimes blab on. However, there’s now a technique called regex forcing, that forces the model to output a certain format (see Outlines on GitHub).
This method is not as quick and scalable as vector search, but it improves on quality. Compared to using a long-context model like Claude or GPT-4-Turbo, it can also improve on quality, while significantly reducing costs.
➡️ How it works:
Break your long input text into chunks (similar to vector database search)
Instead of using vector search, use a language model to rate the relevance of each chunk to the question on a 1-5 scale (using regex forcing)
Take the highest rated chunks (4s and 5s) and include those in the final prompt (with a minimum of 3 chunks)
➡️ Why it's promising:
* Language models are very good at determining if a text snippet is relevant to a question
* Allows flexibly including more or less context based on relevance
* Can outperform both providing the full context or using standard retrieval augmented generation (RAG) with vector search
➡️ Some key implementation details:
* Use a grammar field (regex forcing) to restrict the language model to only outputting 1-5 relevance scores
* Hit the language model API in parallel for each chunk to utilize GPU effectively
* Experiment with different base language models (instruction tuned ones like Smaug 34B performed best)
➡️ Potential advantages over other approaches:
* More accurate than vector search at identifying truly relevant snippets
* More focused than dumping in full context which can confuse the model
* Costs can be comparable to commercial APIs when utilizing GPU well
I ran some initial experiments with ALL-SORT on some tricky queries about Berkshire Hathaway annual meetings and reports. The results were promising, outperforming both full context and RAG baselines.
Best LLM for large batches on 1xA100 80 GB
While working on ALL-SORT, I ran a few speed comparisons between what I thought were the strongest models.
If you are limited to one A100, the best option I see - for inferencing large batch sizes - is to use a model that is 34B or smaller.
If you run Mixtral-8x7B models, they will only fit in 80GB if run in 8-bits (e.g. EETQ) or AWQ, but these methods are slow at high batch size.
This leads to considering 34B models that can fit in bf16 (16-bit), such as CodeLlama (but that is for coding) or Yi-34B. As per my ALL-SORT work, I found that there’s a fine-tune (albeit less guardrailed) than Yi-34B that performs much better, and that’s Smaug 34B. You can try it out on RunPod with this one-click template:
Cheers, Ronan
Ronan McGovern, Trelis Research
➡️ ADVANCED-inference Repo (and individual ALL-SORT scripts)
➡️ One-click API Templates (incl. Smaug 34B!)
➡️ ADVANCED-transcription Repo
➡️ Trelis Function-calling Models
➡️ Tip Jar