In this post, I will first define what LlamaIndex, RAG (Retrieval-Augmented Generation) is, describe the basic architecture pipeline that I am building, and implement the concept using Python code.

LlamaIndex

LlamaIndex is a powerful data framework design to connect custom(text, HTML, pdf, etc.) data sources to the Large Language Models. It acts as an interface to manage the interaction with the LLMs like loading data from source, create the index form the input data, which than used to respond to the users queries.

RAG (Retrieval-Augmented Generation)

LLMs are trained with publicly available data but do not have access to private data, meaning they are not trained with your private data. RAG (Retrieval-Augmented Generation) comes in to bridge the gap. RAG adds your private data to the LLMs. So that we can build LLMs app on top of private data (structure and un-structure data).

RAG is the technique to query over both structured and unstructured documents using the large language model(LLM).

Architecture

Here, I will explain the architecture of a basic RAG (Retrieval-Augmented Generation) pipeline designed for summarization and Q&A tasks using query engines with below architecture diagram. Building a RAG Pipeline for Summarization and Q&A with Llamaindex

Here’s an explanation of the components and the flow:

Document

The input data source, which could be one or more PDF documents. The pipeline processes these documents to build indexes.

Vector Index

An index created from the document for Q&A tasks. It enables fast similarity searches using vector embeddings of the document content. Connected to the Q&A Query Engine for retrieving context-relevant information.

Summary Index

An index generated specifically for summarization tasks. It processes and organizes the document content to facilitate quick and accurate summarization.

Q&A Query Engine

This query engine interacts with the Vector Index to answer specific questions by retrieving the most relevant document sections.

Summarization Query Engine

This query engine interacts with the Summary Index to provide concise and coherent summaries of the document content.

Router (RouterQueryEngine):

The central component that receives the Query from the user. Dynamically selects the appropriate query engine (Q&A or Summarization) based on the nature of the query. Combines the response from the chosen engine and sends it back to the user.

Query and Response:

Query:

The input from the user, which could be a question or a request for summarization.

Response:

The processed output generated by the selected query engine

Step-by-Step Implementation

Here, I will write a code to implement the concept that I was discussing above

Step 1: Load Environment Variables

The first step is Install the necessary dependencies. Here, I am using pip to install them pip install llama-index llama-index-core python-dotenv and to load environment variables (such as API keys) from a .env file.

1
2
from dotenv import load_dotenv
load_dotenv()

Here, from .env file we will be loading OPENAI_API_KEY=your-key-goes-here to connect OpenAI LLM

Step 2: Load Documents

Use SimpleDirectoryReader to load documents from a directory. Here, we specify a folder named pdf/ where the documents are stored. I have downloaded The Google PageRank Algorithm pdf file and placed inside pdf folder.

1
2
3
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("pdf/").load_data()

Step 3: Split Documents into Chunks

Large documents are split into smaller, manageable chunks (or “nodes”) using SentenceSplitter. This improves the efficiency and accuracy of LLMs.

1
2
3
4
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

Here, each node contains up to 1024 characters, balancing granularity and processing time.

Step 4: Configure LLM and Embedding Models

Here I configure OpenAI’s gpt-3.5-turbo as the primary language model and text-embedding-ada-002 for generating embeddings. These settings are applied globally using the Settings object.

1
2
3
4
5
6
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

Step 5: Build Indices

Two indices are created:

SummaryIndex for summarizing documents.
VectorStoreIndex for semantic search and retrieval.

1
2
3
4
5
6
# Step 5: Build indices for summarization and vector-based retrieval

from llama_index.core import SummaryIndex, VectorStoreIndex

summary_index = SummaryIndex(nodes)  # Creates a summarization index
vector_index = VectorStoreIndex(nodes)  # Creates a vector-based index for semantic search

Step 6: Create Query Engines

Query engines enable us to interact with the indices. The summarization engine uses a tree summarization approach, while the vector engine supports general queries.

1
2
3
4
5
6
# Step 6: Create query engines from the indices

summary_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",  # Use tree summarization mode
    use_async=True,  # Enable faster query generation by leveraging asynchronous processing
)

Step 7: Define Query Tools

Tools are abstractions that associate a query engine with a description. They are used to route queries based on their intent.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Step 7: Define tools for summarization and specific queries

from llama_index.core.tools import QueryEngineTool

summary_tool = QueryEngineTool.from_defaults(
    query_engine=summary_query_engine,
    description="Useful for summarization questions related to The Google PageRank Algorithm",
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description="Get the important concept form the paper",
)

Step 8: Combine Tools with a Router Query Engine

A RouterQueryEngine selects the appropriate tool for each query using a selector, such as LLMSingleSelector. This setup makes the pipeline flexible and scalable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Step 7: Combine tools into a router query engine

from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector

query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),  # Selector to route queries to the appropriate tool
    query_engine_tools=[
        summary_tool,  # Tool for summarization
        vector_tool,  # Tool for specific questions
    ],
    verbose=True  # Enable verbose output for debugging
)

Step 9: Query the Documents

Finally, we query the documents using the router query engine. The engine automatically selects the best tool to process each query.

1
2
3
4
5
6
7
8
response = query_engine.query("What is the summary of the document?")
print("Summary Response:", str(response))

response = query_engine.query("Who is the author of the paper and when was published?")
print("Author and Date Response:", str(response))

response = query_engine.query("What is about?")
print("Papaer is about:", str(response))

Let’s all put together

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
from dotenv import load_dotenv
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, SummaryIndex
from llama_index.llms.openai import OpenAI
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.tools import QueryEngineTool
from llama_index.core.query_engine.router_query_engine import RouterQueryEngine
from llama_index.core.selectors import LLMSingleSelector

# Step 1: Load environment variables from a .env file
load_dotenv()

# Step 2: Load documents from the specified directory
documents = SimpleDirectoryReader("pdf/").load_data()

# Step 3: Split documents into smaller chunks (nodes) for efficient processing
splitter = SentenceSplitter(chunk_size=1024)
nodes = splitter.get_nodes_from_documents(documents)

# Step 4: Configure the LLM and embedding models
Settings.llm = OpenAI(model="gpt-3.5-turbo")  # Language model for NLP tasks
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")  # Embedding model for vector representation

# Step 5: Build indices for summarization and vector-based retrieval
summary_index = SummaryIndex(nodes)  # Creates a summarization index
vector_index = VectorStoreIndex(nodes)  # Creates a vector-based index for semantic search

# Step 6: Create query engines from the indices
summary_query_engine = summary_index.as_query_engine(
    response_mode="tree_summarize",  # Use tree summarization mode
    use_async=True,  # Enable faster query generation by leveraging asynchronous processing
)

vector_query_engine = vector_index.as_query_engine()  # Standard query engine for the vector index

# Step 7: Define tools for summarization and specific queries
summary_tool = QueryEngineTool.from_defaults(
    query_engine=summary_query_engine,
    description="Useful for summarization questions related to The Google PageRank Algorithm",
)

vector_tool = QueryEngineTool.from_defaults(
    query_engine=vector_query_engine,
    description="Get the important concept form the paper",
)

# Step 8: Combine tools into a router query engine
query_engine = RouterQueryEngine(
    selector=LLMSingleSelector.from_defaults(),  # Selector to route queries to the appropriate tool
    query_engine_tools=[
        summary_tool,  # Tool for summarization
        vector_tool,  # Tool for specific questions
    ],
    verbose=True  # Enable verbose output for debugging
)

# Step 9: Query the documents using the router query engine
response = query_engine.query("What is the summary of the document?")
print("Summary Response:", str(response))

response = query_engine.query("Who is the author of the paper and when was published?")
print("Author and Date Response:", str(response))

response = query_engine.query("What is about?")
print("Papaer is about:", str(response))

Response

1
2
3
4
5
6
7
8
9
/Users/prakash/code/lab/lamaindex/.venv/bin/python /Users/prakash/code/lab/lamaindex/project1.py 
Selecting query engine 1: This choice focuses on extracting the important concept from the paper, which is essential for summarizing the document..
Summary Response: The document discusses the importance of outbound links for ranking on Google, emphasizing the significance of high-quality links. It mentions the impact of Google PageRank on optimization strategies and predicts continued use of the PageRank concept in various applications. Additionally, it highlights the application of the PageRank algorithm beyond the web, such as in social impact analysis and text summarization. The document also introduces the basic framework of PageRank and its enhancements by researchers.
Selecting query engine 1: This choice focuses on extracting the important concept from the paper, which is more relevant to identifying the author and publication date..
Author and Date Response: The author of the paper is J. He, and it was published in 2023.
Selecting query engine 0: The question is asking for the usefulness of the information related to The Google PageRank Algorithm, which aligns with choice 1..
Papaer is about: The information provided discusses the PageRank algorithm, its mathematical foundation, and its applications beyond web page ranking. It explains how PageRank works as a random walk model on a directed graph, determining the importance of nodes based on their connectivity. The algorithm has been extended to various domains like social network analysis, link recommendation, and prediction. Additionally, it introduces the concept of personalized PageRank for tailored recommendations and explores the potential of community discovery within networks. The text also touches upon the future prospects of PageRank and its continued relevance in information retrieval and ranking algorithms.

Process finished with exit code 0

In this post, I built a basic Retrieval-Augmented Generation (RAG) pipeline using LlamaIndex to handle both summarization and Q&A tasks. I explored the concept of RAG through a step-by-step guide and demonstrated how to load documents, split them into chunks, create vector and summary indexes, and set up query engines. I then implemented a dynamic query router that selects the appropriate engine based on the user’s query type.

Github Repo : https://github.com/dev-scripts/Building-a-RAG-Pipeline-for-Summarization-and-Q-A-with-Llamaindex

LlamaIndex#

RAG (Retrieval-Augmented Generation)#

Architecture#

Document#

Vector Index#

Summary Index#

Q&A Query Engine#

Summarization Query Engine#

Router (RouterQueryEngine):#

Query and Response:#

Query:#

Response:#

Step-by-Step Implementation#

Step 1: Load Environment Variables#

Step 2: Load Documents#

Step 3: Split Documents into Chunks#

Step 4: Configure LLM and Embedding Models#

Step 5: Build Indices#

Step 6: Create Query Engines#

Step 7: Define Query Tools#

Step 8: Combine Tools with a Router Query Engine#

Step 9: Query the Documents#