Advanced Retrieval Augmented Generation

January 27, 2024

Scientist

 


This write up focuses on Advanced Retrieval Augmented Generation (RAG) techniques and assume you understand the basics of RAG. We also concentrate on Llamaindex.

Many of my colleagues have asked me about how we built a RAG (Retrieval augmented generation) system, i.e a chatbot that talks to your private data. You can build them with a few lines of code with libraries like Langchain and LlamaIndex. While this is really easy, you will find it will only get you 70% of the way. The completions may still not be production ready as you will encounter inaccuracies and undesired outputs. Some may not be in line with your product objectives or corporate identity.

So lets look at some ways to tackle this, but first what are some of the main benefits of RAG?

Benefits

Lets also look at some issues we still find we have after implementing a basic RAG system?

Issues

The Process

Here is the RAG pipeline process we cover.

Scientist

 

Of course before we dive into this let’s look at prompting.

Prompting

So first off in our armouring we have the prompt. Let look at some that techniques that can help us steer the LLM into focusing on our required use case.

Few-Shot Prompting is a technique where the LLM is provided with a limited number of examples or shots to learn from. This approach is useful in situations where large datasets are not available and the model needs to adapt and respond accurately with minimal training data.

Chain of Thought (CoT) is an advanced technique in prompt engineering for LLMs that involves guiding the model through a step-by-step reasoning process to arrive at a conclusion. By doing this Chain of Thought improves the the models ability to tackle complex tasks that require logical reasoning, deep understanding or multi-step calculations. CoT has been particularly effective in improving performance on arithmetic, logic puzzles, and comprehension tasks.

Tree of Thoughts (ToT) is a framework that extends the Chain of Thought approach by creating a tree-like structure of thoughts or reasoning steps. This method allows for more extensive exploration and branching in the thought process, enabling LLMs to consider multiple pathways and scenarios before arriving at it’s conclusion.

Multimodal CoT (Chain of Thought) involves using multiple modes of thought or reasoning processes within a single prompt. This approach can combine different reasoning styles, such as logical, creative, or empirical thinking, to produce more comprehensive and well-rounded responses from LLMs.

Prompt Chaining involves linking multiple prompts in a sequential manner, where the output of one prompt serves as the input for the next. This method is particularly effective for complex tasks that can be broken down into smaller, more manageable sub-tasks, leading to more accurate and refined outputs.

Self-Consistency in prompt engineering aims to ensure that responses from LLMs are not only accurate but also consistent across different iterations or contexts. This involves techniques to check and reinforce the consistency of the model’s responses, thereby enhancing reliability and trustworthiness.

As you can see there are plenty of techniques and more emerging everyday. Check out the prompt guide for more ways to enhance your prompting here.

Scientist

 

Indexing

Choosing the index: The search index is one of the main components of a RAG pipeline as it holds your unstructured vectorised content. Efficient retrieval is achieved using algorithms such as Approximate Nearest Neighbours which include clustering, trees, or the HNSW algorithm.

Some managed solutions like ElasticSearch, Pinecone, Weaviate, or Chroma will handle your data ingestion pipelines out of the box. Also make sure your chosen vector store supports storing metadata as this is vital to support citations..

LlamaIndex is a data framework for LLM-based applications to ingest, structure, and access private or domain-specific data. It’s available in Python and Typescript.

Querying

There are many ways to enhance this part of the process. Methods like query construction, query expansion and query transformations can each can play a part in enhancing the search process.

Query Construction: Transforms NLP queries to data source formats. It changes questions into vector formats for unstructured data or reformats queries for structured data sources

Query Expansion:

Scientist

 

Retrieval

Top-k Retrieval:

Playing with your top-k and using a similarity cutoff on a post processor will get you so far. But let’s look at some more advanced ways to retrieve those right documents.

This involves retrieving smaller chunks for better search quality and then adding surrounding context for the LLM to reason upon. This is also called Small-to-Big Retrieval. Here are two methods:

Others include.

 

Scientist

 

Post processing

A node postprocessor takes in a set of retrieved nodes and applies transformations, filtering or re-ranking logic to them.

SimilarityPostprocessor:

This module is designed to remove nodes that fall below a set similarity score threshold. It ensures that only nodes with a high degree of relevance or similarity are added to the context.

Response Synthesiser

A Response Synthesiser is what generates a response from an LLM, using a user query and a given set of text chunks. The method for doing this can take many forms, from iterating over text chunks to something more complex like building a tree. The main idea here is to simplify the process of generating a response using an LLM across your data. In LlamaIndex you can set modes.

Modes:

Refine Mode:

Processes each retrieved text chunk sequentially, making separate LLM calls for each node. This is the default mode for a list index. The list of nodes is traversed sequentially and at each step, the query, the response so far and the context of the current node are embedded in a prompt template that prompts a LLM to refine the response to the query according to the new information in the current node. This mode is ideal for detailed answers.

Evaluate

Evaluation and benchmarking are crucial concepts in LLM development. To improve the performance of an LLM RAG app you must have a way to measure it. There are many techniques emerging but here are the main points to focus on.

Component-Wise Evaluation: Focuses on assessing each part of the RAG system separately, particularly retrieval accuracy and relevance.

End-to-End Evaluation: End-to-End evaluation examines the entire RAG system, including the final response’s correctness and relevance. Best to start with to see how your whole pipeline is fairing.

Create Ground Truth Datasets: This involves generating question-context pairs, either manually or synthetically. These will me ground truths that are specifically designed for testing and assessing the performance of the RAG system.

Metrics for Evaluation: Correctness: Checks if the generated answer matches the ground truth answer.

Semantic Similarity: Assesses semantic similarity to the ground truth answer.

Faithfulness: Evaluates the answer’s faithfulness to the retrieved contexts so checking against the factual content of the source material

Groundedness: Similar to above but groundedness would assess the extent to which the generated content can be traced back to specific information in the source material.

Context Relevancy: Determines the relevancy of retrieved context in relation to the query.

Answer Relevancy: Assesses the generated answer’s relevance in relation to the query.

Guideline Adherence: Evaluates adherence to specific guidelines, like corporate identity, tone and behaviour.

RAG Triad Framework (TruLens): This is one of many frameworks appearing. Trulens focuses on context relevance, groundedness, answer relevance and emphasises ‘honest, harmless, helpful’ criteria for evaluating LLM applications. Its worth a look here.

Scientist Credit Trulens

LlamaIndex offers key modules to measure the quality of generated results and your RAG pipeline. You can learn more about how evaluation works in LlamaIndex in the module guides.

Fine tuning and Agents. It would be wrong to leave out a mention on Fine-tuning and Agents. However they warrant their own articles so these will be covered in some next instalments.