Techniques to Boost RAG Performance in Production

Retrieval-Augmented Generation (RAG) is a powerful tool in the domain of machine learning, offering significant potential for improving the quality of text generation in various applications. However, optimizing its performance can be a challenging task. For the introductory text on RAG see my other article. This article discusses several advanced techniques that can be applied at different stages of the RAG pipeline to enhance its performance in a production setting.

Leveraging Hybrid Search
Utilizing Summaries for Data Chunks
Applying Query Transformations
Query Compression
Optimal Chunking Strategy
Fine-tuning Embedding Models
Enriching Metadata
Employing Re-ranking
Addressing the 'Lost in the Middle' Problem
Meta-data Filtering
Query Routing
References

Leveraging Hybrid Search

Hybrid search, a fusion of semantic search and keyword search, can be employed to retrieve pertinent data from a vector store. This method often yields superior results across a range of use cases. It essentially combines the strength of keyword search (precision) and semantic search (recall), providing a more comprehensive search solution. dups/hybrid_search

Utilizing Summaries for Data Chunks

An efficient way to enhance the quality of generation and reduce the number of tokens in the input is by summarizing the chunks of data and storing these summaries in the vector store. This technique is especially useful when dealing with data that includes numerous filler words. By summarizing the chunks, we can eliminate these superfluous elements, thereby refining the quality of the input data.

Applying Query Transformations

Query transformations can significantly enhance the quality of responses. For instance, if a system does not find relevant context for a query, the LLM can rephrase the query and try again. See the RAG-Fusion - Enhancing Information Retrieval in Large Language Models.

Similarly, the HyDE strategy generates a hypothetical response to a query and uses both for embedding lookup, which has been found to dramatically enhance performance.

Another technique involves breaking down complex queries into sub-queries, a process that LLMs tend to handle better. This approach can be integrated into the RAG system to decompose a query into multiple simpler questions.

Query Compression

Query compression, (see a tool like LongLLMLingua) is a technique for improving RAG's performance in long context scenarios where large language models often face challenges such as increased computational and financial costs, longer latency, and inferior performance. By enhancing the density and optimizing the position of key information in the input prompt, LongLLMLingua improves LLMs' perception of key information, which in turn, reduces computational load, decreases latency, and improves performance. This strategy ensures that vital information is not lost or diluted in lengthy contexts, thereby enhancing the relevance and quality of the generated output.

Optimal Chunking Strategy

There are multiple strategies that can be applied to chunking see Chunking strategies. One of the aspects can be controlling the chunk overlap. Semantic retrieval may pose a challenge when a selected chunk has meaningful context in adjacent chunks that could be missed. To mitigate this, an overlap of chunks can be implemented, whereby neighboring chunks are also passed to the Language Model (LLM) for generation. This guarantees that the surrounding context is incorporated, thus enhancing the output's quality.

Fine-tuning Embedding Models

While off-the-shelf embedding models such as BERT and Ada may suffice for many use cases, they might not adequately represent specific domains in the vector space, leading to suboptimal retrieval quality. In such instances, it would be advantageous to fine-tune an embedding model using domain-specific data to significantly improve retrieval quality.

Enriching Metadata

The provision of metadata like source information about the chunks being processed can enhance the LLM's comprehension of the context, leading to a better output generation. This additional layer of information can provide the LLM with a more holistic understanding of the data, enabling it to generate more accurate and relevant responses.

Employing Re-ranking

Semantic search may yield top-k results that are too similar to each other. To ensure a wider array of snippets, it is beneficial to re-rank the results based on other factors such as metadata and keyword matches. This diversification of snippets can lead to a more nuanced and comprehensive context for the LLM to generate responses. Re-ranker can be based on a cross-encoder.

Addressing the 'Lost in the Middle' Problem

LLMs tend not to assign equal weight to all tokens in the input, often overlooking tokens located in the middle. This phenomenon, known as the 'lost in the middle' problem, can be addressed by reordering the context snippets to place the most vital snippets at the beginning and end of the input, with less important snippets situated in the middle.

Meta-data Filtering

Meta-data, such as date tags, can be added to your chunks to improve retrieval. For example, filtering by recency can be beneficial when querying email history. Recent emails may not necessarily be the most similar from an embedding standpoint, but they are more likely to be relevant.

Query Routing

Having multiple indexes and routing queries to the appropriate index can be beneficial. For instance, different indexes could handle summarization questions, pointed questions, and date-sensitive questions. Trying to optimize one index for all these behaviors may compromise its effectiveness.

The performance of RAG in production can be significantly improved by applying a range of techniques, including hybrid search, chunk summarization, overlapping chunks, fine-tuned embedding models, metadata enhancement, re-ranking, addressing the 'lost in the middle' problem, query transformations, meta-data filtering, and query routing. These strategies will help to optimize the RAG pipeline, ensuring higher quality output and improved overall performance.

References

Retrieval Augmented Generation (RAG): What, Why and How? | LLMStack
[2307.03172] Lost in the Middle: How Language Models Use Long Contexts
10 Ways to Improve the Performance of Retrieval Augmented Generation Systems | by Matt Ambrogi | Sep, 2023 | Towards Data Science
Hypothetical Document Embeddings (HyDE) - Precise Zero-Shot Dense Retrieval without Relevance Labels
Retrieve & Re-Rank - Sentence-Transformers documentation
Improving RAG effectiveness with Retrieval-Augmented Dual Instruction Tuning (RA-DIT) | by Emanuel Ferreira | Oct, 2023 | LlamaIndex Blog
Improving RAG (Retrieval Augmented Generation) Answer Quality with Re-ranker | by Shivam Solanki | Towards Generative AI | Medium
SingleStore (db), finetuning embeddings model, CacheGPT, Nemo-Guardrails, Secrets to Optimizing RAG LLM Apps for Better Performance, Accuracy and Lower Costs! | by Madhukar Kumar | madhukarkumar | Sep, 2023 | Medium
run-llama/finetune-embedding: Fine-Tuning Embedding for RAG with Synthetic Data
zilliztech/GPTCache: Semantic cache for LLMs. Fully integrated with LangChain and llama_index.
NVIDIA/NeMo-Guardrails: NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
library to evaluate the context retrieved from your enterprise corpus of data (how do you know if the context being retrieved is accurate) GitHub - explodinggradients/ragas: Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
LangSmith, introduced by LangChain - a highly effective tool for monitoring and examining the responses between the app and the LLM.
[2310.15123] Branch-Solve-Merge Improves Large Language Model Evaluation and Generation