Building a RAG Pipeline for Summarization and Q&A with Llamaindex and OpenAI
In this post, I will build a basic RAG pipeline using LlamaIndex, featuring both a Q&A Query Engine and a Summarization Query Engine. A Router (`RouterQueryEngine`) will dynamically select the most appropriate query engine to process each query. Here, I’ll walk through how to build a custom RAG pipeline using Python, LlamaIndex, and OpenAI models.
January 10, 2025 · 9 min · 1722 words · Prakash Bhandari
In this post, I will first define what LlamaIndex, RAG (Retrieval-Augmented Generation) is, describe the basic architecture pipeline that I am building, and implement the concept using Python code.
LlamaIndex is a powerful data framework design to connect custom(text, HTML, pdf, etc.) data sources to the Large Language Models.
It acts as an interface to manage the interaction with the LLMs like loading data from source, create the index form the input data, which than used to respond
to the users queries.
LLMs are trained with publicly available data but do not have access to private data,
meaning they are not trained with your private data. RAG (Retrieval-Augmented Generation) comes in to bridge the gap. RAG adds your private data to the LLMs.
So that we can build LLMs app on top of private data (structure and un-structure data).
RAG is the technique to query over both structured and unstructured documents using the large language model(LLM).
Here, I will explain the architecture of a basic RAG (Retrieval-Augmented Generation) pipeline designed for summarization and Q&A tasks using query engines with below architecture diagram.
Here’s an explanation of the components and the flow:
An index created from the document for Q&A tasks. It enables fast similarity searches using vector embeddings of the document content.
Connected to the Q&A Query Engine for retrieving context-relevant information.
An index generated specifically for summarization tasks. It processes and organizes the document content to facilitate quick and accurate summarization.
The central component that receives the Query from the user.
Dynamically selects the appropriate query engine (Q&A or Summarization) based on the nature of the query.
Combines the response from the chosen engine and sends it back to the user.
The first step is Install the necessary dependencies. Here, I am using pip to install them pip install llama-index llama-index-core python-dotenv
and
to load environment variables (such as API keys) from a .env file.
Use SimpleDirectoryReader to load documents from a directory. Here, we specify a folder named pdf/ where the documents are stored. I have downloaded The Google PageRank Algorithm pdf file and placed inside pdf folder.
Here I configure OpenAI’s gpt-3.5-turbo as the primary language model and text-embedding-ada-002 for generating embeddings.
These settings are applied globally using the Settings object.
# Step 5: Build indices for summarization and vector-based retrievalfromllama_index.coreimportSummaryIndex,VectorStoreIndexsummary_index=SummaryIndex(nodes)# Creates a summarization indexvector_index=VectorStoreIndex(nodes)# Creates a vector-based index for semantic search
Query engines enable us to interact with the indices. The summarization engine uses a tree summarization approach, while the vector engine supports general queries.
# Step 6: Create query engines from the indicessummary_query_engine=summary_index.as_query_engine(response_mode="tree_summarize",# Use tree summarization modeuse_async=True,# Enable faster query generation by leveraging asynchronous processing)
# Step 7: Define tools for summarization and specific queriesfromllama_index.core.toolsimportQueryEngineToolsummary_tool=QueryEngineTool.from_defaults(query_engine=summary_query_engine,description="Useful for summarization questions related to The Google PageRank Algorithm",)vector_tool=QueryEngineTool.from_defaults(query_engine=vector_query_engine,description="Get the important concept form the paper",)
A RouterQueryEngine selects the appropriate tool for each query using a selector, such as LLMSingleSelector. This setup makes the pipeline flexible and scalable.
# Step 7: Combine tools into a router query enginefromllama_index.core.query_engine.router_query_engineimportRouterQueryEnginefromllama_index.core.selectorsimportLLMSingleSelectorquery_engine=RouterQueryEngine(selector=LLMSingleSelector.from_defaults(),# Selector to route queries to the appropriate toolquery_engine_tools=[summary_tool,# Tool for summarizationvector_tool,# Tool for specific questions],verbose=True# Enable verbose output for debugging)
response=query_engine.query("What is the summary of the document?")print("Summary Response:",str(response))response=query_engine.query("Who is the author of the paper and when was published?")print("Author and Date Response:",str(response))response=query_engine.query("What is about?")print("Papaer is about:",str(response))
fromdotenvimportload_dotenvfromllama_index.coreimportVectorStoreIndex,SimpleDirectoryReader,SummaryIndexfromllama_index.llms.openaiimportOpenAIfromllama_index.core.node_parserimportSentenceSplitterfromllama_index.coreimportSettingsfromllama_index.embeddings.openaiimportOpenAIEmbeddingfromllama_index.core.toolsimportQueryEngineToolfromllama_index.core.query_engine.router_query_engineimportRouterQueryEnginefromllama_index.core.selectorsimportLLMSingleSelector# Step 1: Load environment variables from a .env fileload_dotenv()# Step 2: Load documents from the specified directorydocuments=SimpleDirectoryReader("pdf/").load_data()# Step 3: Split documents into smaller chunks (nodes) for efficient processingsplitter=SentenceSplitter(chunk_size=1024)nodes=splitter.get_nodes_from_documents(documents)# Step 4: Configure the LLM and embedding modelsSettings.llm=OpenAI(model="gpt-3.5-turbo")# Language model for NLP tasksSettings.embed_model=OpenAIEmbedding(model="text-embedding-ada-002")# Embedding model for vector representation# Step 5: Build indices for summarization and vector-based retrievalsummary_index=SummaryIndex(nodes)# Creates a summarization indexvector_index=VectorStoreIndex(nodes)# Creates a vector-based index for semantic search# Step 6: Create query engines from the indicessummary_query_engine=summary_index.as_query_engine(response_mode="tree_summarize",# Use tree summarization modeuse_async=True,# Enable faster query generation by leveraging asynchronous processing)vector_query_engine=vector_index.as_query_engine()# Standard query engine for the vector index# Step 7: Define tools for summarization and specific queriessummary_tool=QueryEngineTool.from_defaults(query_engine=summary_query_engine,description="Useful for summarization questions related to The Google PageRank Algorithm",)vector_tool=QueryEngineTool.from_defaults(query_engine=vector_query_engine,description="Get the important concept form the paper",)# Step 8: Combine tools into a router query enginequery_engine=RouterQueryEngine(selector=LLMSingleSelector.from_defaults(),# Selector to route queries to the appropriate toolquery_engine_tools=[summary_tool,# Tool for summarizationvector_tool,# Tool for specific questions],verbose=True# Enable verbose output for debugging)# Step 9: Query the documents using the router query engineresponse=query_engine.query("What is the summary of the document?")print("Summary Response:",str(response))response=query_engine.query("Who is the author of the paper and when was published?")print("Author and Date Response:",str(response))response=query_engine.query("What is about?")print("Papaer is about:",str(response))
In this post, I built a basic Retrieval-Augmented Generation (RAG) pipeline using LlamaIndex to handle both summarization and Q&A tasks.
I explored the concept of RAG through a step-by-step guide and demonstrated how to load documents, split them into chunks,
create vector and summary indexes, and set up query engines.
I then implemented a dynamic query router that selects the appropriate engine based on the user’s query type.