TL;DR
Traditional RAG follows a rigid pipeline: chunk documents, embed them, retrieve the top-K, stuff them into a prompt, and generate. This works well for single-source factoid questions, but it collapses under real-world complexity -- ambiguous queries that need routing across multiple data sources, multi-hop questions requiring iterative retrieval, or situations where the first retrieval attempt simply returns irrelevant noise.
Agentic RAG solves this by placing an AI agent at the center of the retrieval pipeline. The agent does not just retrieve; it reasons about what to retrieve, evaluates the results, and decides the next action -- whether that is refining the query, switching data sources, or generating the final answer. This article dissects the architecture, walks through four production-tested design patterns, and provides working code you can adapt today.
📋 Table of Contents
- What is Agentic RAG?
- Four Design Patterns
- Architecture Comparison
- Implementation with LangGraph
- Production Best Practices
- FAQ
- Summary
✨ Key Takeaways
- Paradigm Shift: Agentic RAG transforms retrieval from "passive execution" to "active reasoning" using agent-controlled loops.
- Core Patterns: Routing, Multi-step, Corrective (CRAG), and Adaptive RAG are the essential design patterns for production.
- Self-Correction: By introducing a "grading" step, Agentic RAG effectively mitigates hallucinations and improves accuracy for complex queries.
- Deployment Strategy: Success in production requires strict max-iteration guards, tiered model costs, and deep observability.
💡 Quick Tool: Awesome Prompt Directory — Discover and optimize your agent prompts to significantly improve routing and grading accuracy in Agentic RAG.
Why Naive RAG Hits a Wall
If you have built a RAG system following the standard pipeline described in our RAG fundamentals guide, you have likely encountered these failure modes:
Single-source blindness. The pipeline retrieves from one vector database, but the answer requires joining information from a SQL database, an API, and a document store.
One-shot retrieval failure. The user asks "Compare the 2025 and 2024 annual reports," but the single retrieval pass only surfaces chunks from one year.
Irrelevant context poisoning. The top-K retrieval returns tangentially related documents. The LLM dutifully generates an answer grounded in the wrong context, producing a confident but incorrect response -- the textbook definition of hallucination.
No self-correction. Once the pipeline commits to its retrieved context, there is no mechanism to evaluate whether that context actually answers the question.
These are not edge cases. In production RAG systems serving thousands of queries per day, 15-30% of failures trace back to retrieval quality issues that a static pipeline cannot self-correct. The solution is to give the retrieval pipeline the ability to think.
What Makes RAG "Agentic"
An agentic workflow introduces a reasoning layer that sits between the user query and the retrieval-generation pipeline. Instead of a fixed DAG (directed acyclic graph), the system becomes a stateful graph with conditional edges, loops, and decision nodes.
The key capabilities that distinguish Agentic RAG from traditional RAG:
| Capability | Naive RAG | Agentic RAG |
|---|---|---|
| Retrieval strategy | Fixed (single vector search) | Dynamic (agent selects sources, refines queries) |
| Number of retrieval rounds | Always 1 | Variable (1 to N based on need) |
| Quality evaluation | None | Agent grades retrieved documents |
| Error recovery | None | Agent detects failure and retries with new strategy |
| Tool usage | None | Agent can call APIs, SQL, web search as tools |
| Query decomposition | None | Agent breaks complex queries into sub-questions |
The agent itself typically follows the ReAct pattern (Reason + Act) or a more structured state-machine approach. It maintains a scratchpad of its reasoning, tracks which sources it has queried, and makes explicit decisions at each step.
Four Design Patterns of Agentic RAG
Production Agentic RAG systems generally follow one of four patterns, each addressing a specific class of retrieval challenge. These patterns can be composed -- a single system might use Routing at the top level and Corrective RAG within each route.
Pattern 1: Routing RAG
Problem: Your knowledge is distributed across multiple backends -- a vector store for unstructured documents, a SQL database for structured data, a graph database for entity relationships, and live APIs for real-time information.
Solution: The agent classifies the incoming query and routes it to the appropriate retrieval backend (or combination of backends) before generation.
User Query
|
v
[Router Agent] -- "product specs" --> Vector Store
|-- "revenue Q3 2025" --> SQL Database
|-- "how is X related to Y" --> Knowledge Graph
|-- "latest stock price" --> Live API
|
v
[Merge & Generate]
The routing decision can be implemented via function calling, where each data source is exposed as a tool:
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI
@tool
def search_product_docs(query: str) -> str:
"""Search the product documentation vector store for technical specs,
feature descriptions, and user guides."""
results = vectorstore.similarity_search(query, k=5)
return "\n\n".join([doc.page_content for doc in results])
@tool
def query_analytics_db(sql_query: str) -> str:
"""Execute a read-only SQL query against the analytics database.
Use for revenue, user metrics, and quantitative business data."""
return db.run(sql_query)
@tool
def search_knowledge_graph(entity: str, relation: str) -> str:
"""Traverse the knowledge graph to find relationships between entities.
Use for org-structure, dependency, and causal-chain questions."""
return kg.query(entity, relation)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
agent = llm.bind_tools([search_product_docs, query_analytics_db, search_knowledge_graph])
The LLM inspects the tool descriptions and selects the appropriate one based on the query intent. For reliability, add a fallback: if the selected tool returns no results, the agent should try the next most likely source.
Pattern 2: Multi-step RAG
Problem: Complex queries require information that cannot be retrieved in a single pass. "How did our customer churn rate change after we launched the new pricing tier, and what did customers say about it?" requires retrieving pricing launch dates, churn metrics, and customer feedback separately.
Solution: The agent decomposes the query into sub-questions, executes retrieval for each, and synthesizes the results.
This pattern extends the chain-of-thought reasoning approach to the retrieval phase itself. The agent generates a retrieval plan before executing any searches:
from langgraph.graph import StateGraph, START, END
from typing import TypedDict, Annotated
import operator
class AgentState(TypedDict):
question: str
sub_questions: list[str]
retrieved_contexts: Annotated[list[str], operator.add]
final_answer: str
def decompose_query(state: AgentState) -> AgentState:
prompt = f"""Break this complex question into 2-4 independent sub-questions
that can each be answered by a single retrieval pass.
Question: {state['question']}
Return a JSON array of sub-questions."""
response = llm.invoke(prompt)
sub_questions = parse_json_array(response.content)
return {"sub_questions": sub_questions}
def retrieve_for_subquestion(state: AgentState) -> AgentState:
contexts = []
for sq in state["sub_questions"]:
docs = vectorstore.similarity_search(sq, k=3)
context = f"[Sub-Q: {sq}]\n" + "\n".join([d.page_content for d in docs])
contexts.append(context)
return {"retrieved_contexts": contexts}
def synthesize(state: AgentState) -> AgentState:
all_context = "\n\n---\n\n".join(state["retrieved_contexts"])
prompt = f"""Based on the following retrieved information, provide a
comprehensive answer to the original question.
Original question: {state['question']}
Retrieved information:
{all_context}"""
answer = llm.invoke(prompt)
return {"final_answer": answer.content}
graph = StateGraph(AgentState)
graph.add_node("decompose", decompose_query)
graph.add_node("retrieve", retrieve_for_subquestion)
graph.add_node("synthesize", synthesize)
graph.add_edge(START, "decompose")
graph.add_edge("decompose", "retrieve")
graph.add_edge("retrieve", "synthesize")
graph.add_edge("synthesize", END)
app = graph.compile()
Multi-step RAG is particularly effective when combined with hybrid search at each retrieval step, as described in our guide on hybrid search and rerank optimization.
Pattern 3: Corrective RAG (CRAG)
Problem: The retrieval step returns documents, but they may not actually be relevant to the query. Without a quality gate, the LLM generates answers grounded in irrelevant context.
Solution: After retrieval, a grader evaluates each document for relevance. If the retrieved set fails the quality check, the agent falls back to an alternative strategy -- typically web search.
This pattern was formalized in the CRAG paper (Yan et al., 2024) and has become one of the most impactful additions to production RAG systems. The architecture introduces an explicit evaluation node:
def grade_documents(state: AgentState) -> AgentState:
"""Evaluate whether retrieved documents are relevant to the question."""
question = state["question"]
documents = state["documents"]
grading_prompt = f"""You are a relevance grader. For each document,
determine if it contains information relevant to answering: "{question}"
Score each document as 'relevant' or 'irrelevant'.
Return JSON: {{"scores": ["relevant", "irrelevant", ...]}}"""
graded = []
for doc in documents:
result = llm.invoke(f"{grading_prompt}\n\nDocument: {doc.page_content}")
score = parse_score(result.content)
if score == "relevant":
graded.append(doc)
return {
"documents": graded,
"all_irrelevant": len(graded) == 0
}
def decide_next_step(state: AgentState) -> str:
"""Conditional edge: route based on grading results."""
if state.get("all_irrelevant"):
return "web_search"
return "generate"
def web_search_fallback(state: AgentState) -> AgentState:
"""Fall back to web search when vector retrieval fails."""
query = state["question"]
web_results = tavily_client.search(query, max_results=5)
web_docs = [Document(page_content=r["content"]) for r in web_results]
return {"documents": web_docs}
graph = StateGraph(AgentState)
graph.add_node("retrieve", retrieve_documents)
graph.add_node("grade", grade_documents)
graph.add_node("web_search", web_search_fallback)
graph.add_node("generate", generate_answer)
graph.add_edge(START, "retrieve")
graph.add_edge("retrieve", "grade")
graph.add_conditional_edges("grade", decide_next_step, {
"web_search": "web_search",
"generate": "generate"
})
graph.add_edge("web_search", "generate")
graph.add_edge("generate", END)
The grading step adds one LLM call per query, but the accuracy gains are substantial. In our benchmarks, adding CRAG to a naive pipeline reduced hallucination rates by 35-45% on adversarial datasets where retrieval noise was intentionally high.
Pattern 4: Adaptive RAG
Problem: Not all queries are equal. "What is RAG?" needs a simple lookup. "Compare the trade-offs of HNSW vs IVF indexing for billion-scale datasets across latency, recall, and memory footprint" requires multi-step retrieval with reranking. Using the same heavyweight pipeline for both wastes resources and adds unnecessary latency.
Solution: The agent first classifies the query complexity, then selects the appropriate retrieval strategy -- from a simple vector lookup to full multi-step corrective retrieval.
def classify_query(state: AgentState) -> AgentState:
prompt = f"""Classify this query into one of three complexity levels:
- SIMPLE: Factoid question answerable from a single document chunk
- MODERATE: Requires retrieval + some reasoning or comparison
- COMPLEX: Multi-hop, multi-source, or requires iterative retrieval
Query: {state['question']}
Return JSON: {{"complexity": "SIMPLE|MODERATE|COMPLEX"}}"""
result = llm.invoke(prompt)
complexity = parse_complexity(result.content)
return {"complexity": complexity}
def route_by_complexity(state: AgentState) -> str:
complexity = state["complexity"]
if complexity == "SIMPLE":
return "simple_rag"
elif complexity == "MODERATE":
return "corrective_rag"
else:
return "multi_step_rag"
Adaptive RAG composes the other three patterns. It is the strategy you want at the top of a production system that handles diverse query types:
+--> [Simple RAG: retrieve + generate]
|
[Query] --> [Classify] -+--> [Corrective RAG: retrieve + grade + fallback + generate]
|
+--> [Multi-step RAG: decompose + retrieve_N + synthesize]
This mirrors the Self-RAG approach (Asai et al., 2023), which trains the LLM to emit special reflection tokens that control retrieval behavior. In practice, the prompt-based classification shown above achieves similar routing accuracy without requiring model fine-tuning.
Architecture: Naive RAG vs. Agentic RAG
To make the structural differences concrete, here is a side-by-side comparison of the execution flow:
Naive RAG:
Query --> Embed --> Vector Search (top-K) --> Stuff into Prompt --> Generate --> Done
- Linear, single-pass
- No evaluation or feedback loops
- Fast but brittle
Agentic RAG:
Query --> [Agent: Classify & Plan]
|
+--> [Route to Source(s)]
| |
| v
| [Retrieve] --> [Grade Relevance]
| | |
| | (pass) (fail)
| v v
| [Generate] [Refine Query / Web Search]
| | |
| v v
| [Validate] [Retrieve Again] --> [Grade] --> ...
| |
| (grounded) (not grounded)
| v v
| [Return] [Retry with new strategy]
- Stateful graph with conditional edges
- Self-correcting via evaluation loops
- Higher latency, but dramatically better accuracy on hard queries
The key insight: Agentic RAG trades fixed compute per query for variable compute. Simple queries take one hop. Complex queries take as many hops as needed, up to a configurable maximum.
Implementation with LangGraph
LangGraph is the most mature framework for building Agentic RAG as a stateful graph. Below is a complete implementation combining Routing, Corrective, and Adaptive patterns into a single production-ready pipeline.
TypeScript Implementation
For teams running TypeScript backends, LangGraph.js provides the same graph primitives:
import { StateGraph, START, END } from "@langchain/langgraph";
import { ChatOpenAI } from "@langchain/openai";
import { z } from "zod";
interface RAGState {
question: string;
complexity: "SIMPLE" | "MODERATE" | "COMPLEX";
documents: Array<{ content: string; score: number }>;
needsWebSearch: boolean;
answer: string;
iterations: number;
}
const MAX_ITERATIONS = 3;
const llm = new ChatOpenAI({ model: "gpt-4o", temperature: 0 });
async function classifyQuery(state: RAGState): Promise<Partial<RAGState>> {
const response = await llm.invoke([
{
role: "system",
content: `Classify the query complexity as SIMPLE, MODERATE, or COMPLEX.
SIMPLE: single fact lookup. MODERATE: needs reasoning or comparison.
COMPLEX: multi-hop or multi-source required.`,
},
{ role: "user", content: state.question },
]);
const complexity = parseComplexity(response.content);
return { complexity, iterations: 0 };
}
async function retrieveDocuments(state: RAGState): Promise<Partial<RAGState>> {
const results = await vectorStore.similaritySearchWithScore(
state.question,
state.complexity === "SIMPLE" ? 3 : 6
);
const documents = results.map(([doc, score]) => ({
content: doc.pageContent,
score,
}));
return { documents, iterations: state.iterations + 1 };
}
async function gradeDocuments(state: RAGState): Promise<Partial<RAGState>> {
const relevant = [];
for (const doc of state.documents) {
const grade = await llm.invoke([
{
role: "system",
content: "Is this document relevant to answering the question? Reply 'yes' or 'no'.",
},
{
role: "user",
content: `Question: ${state.question}\nDocument: ${doc.content}`,
},
]);
if (grade.content.toLowerCase().includes("yes")) {
relevant.push(doc);
}
}
return {
documents: relevant,
needsWebSearch: relevant.length === 0,
};
}
async function webSearchFallback(state: RAGState): Promise<Partial<RAGState>> {
const webResults = await tavilySearch(state.question, 5);
const documents = webResults.map((r) => ({ content: r.content, score: 1.0 }));
return { documents };
}
function shouldRetryOrGenerate(state: RAGState): string {
if (state.needsWebSearch && state.iterations < MAX_ITERATIONS) {
return "web_search";
}
return "generate";
}
const workflow = new StateGraph<RAGState>({
channels: {
question: { value: (a: string, b: string) => b ?? a },
complexity: { value: (a, b) => b ?? a },
documents: { value: (a, b) => b ?? a },
needsWebSearch: { value: (a, b) => b ?? a },
answer: { value: (a, b) => b ?? a },
iterations: { value: (a, b) => b ?? a },
},
})
.addNode("classify", classifyQuery)
.addNode("retrieve", retrieveDocuments)
.addNode("grade", gradeDocuments)
.addNode("web_search", webSearchFallback)
.addNode("generate", generateAnswer)
.addEdge(START, "classify")
.addEdge("classify", "retrieve")
.addEdge("retrieve", "grade")
.addConditionalEdges("grade", shouldRetryOrGenerate, {
web_search: "web_search",
generate: "generate",
})
.addEdge("web_search", "generate")
.addEdge("generate", END);
const app = workflow.compile();
Key Implementation Details
Max-iteration guards. Without a hard limit on retrieval loops, an agent chasing irrelevant results can loop indefinitely. Always cap iterations (3 is a reasonable default) and fall back to generating with whatever context is available.
State management. The AgentState must carry the full history of retrieval attempts. This enables the generate step to cite which sources were consulted and which were rejected, improving answer transparency.
Streaming. In production, users should not stare at a blank screen while the agent reasons through multiple retrieval hops. LangGraph supports streaming intermediate steps:
async for event in app.astream({"question": user_query}):
if "retrieve" in event:
yield f"Searching {len(event['documents'])} documents..."
elif "grade" in event:
yield f"Evaluating relevance of retrieved documents..."
elif "generate" in event:
yield event["final_answer"]
Production Best Practices
Observability is Non-negotiable
Every agent decision -- routing choice, grading result, query rewrite, fallback trigger -- must be logged with structured metadata. Without this, debugging a failing query in production is nearly impossible.
import structlog
logger = structlog.get_logger()
def grade_documents_with_logging(state):
for doc in state["documents"]:
score = grade(doc, state["question"])
logger.info(
"document_graded",
question=state["question"],
doc_id=doc.metadata.get("id"),
relevance=score,
retrieval_round=state.get("iterations", 1),
)
Integrate with LangSmith, Phoenix, or your existing observability stack. At minimum, track: (1) which pattern was selected, (2) how many retrieval rounds occurred, (3) whether web search fallback was triggered, and (4) end-to-end latency.
Cost Control
Agentic RAG makes multiple LLM calls per query. A complex query might invoke the LLM 5-8 times (classify + decompose + N grades + synthesize). To manage costs:
- Use smaller models for grading. Document relevance grading does not require GPT-4o. A fine-tuned GPT-4o-mini or even a cross-encoder rerank model handles binary relevance classification at a fraction of the cost.
- Cache aggressively. If the same document appears in multiple queries, cache its grading result keyed on
(doc_id, query_embedding_bucket). - Set complexity-based budgets. SIMPLE queries get at most 2 LLM calls. COMPLEX queries get up to 8. This prevents runaway costs on edge cases.
Latency Optimization
Multi-step retrieval adds latency. Mitigation strategies:
- Parallel sub-query retrieval. When the decompose step yields independent sub-questions, execute all retrieval calls concurrently rather than sequentially.
- Speculative retrieval. Start retrieving from the most likely source while the classifier is still running. If the classification confirms the guess, you save one round-trip.
- Semantic search index warm-up. Pre-compute and cache embeddings for common query patterns to reduce vector search latency.
Guardrails and Safety
Agentic RAG systems with web search fallback are particularly susceptible to prompt injection via retrieved web content. Apply guardrails at multiple layers:
- Input guardrails: Validate the user query before it enters the agent loop.
- Retrieval guardrails: Sanitize web search results before they enter the context window.
- Output guardrails: Validate that the final answer does not contain injected instructions or harmful content.
def sanitize_web_results(results: list[Document]) -> list[Document]:
sanitized = []
for doc in results:
content = doc.page_content
if contains_injection_patterns(content):
logger.warning("injection_detected", content_preview=content[:200])
continue
sanitized.append(doc)
return sanitized
When Not to Use Agentic RAG
Agentic RAG is not universally superior. The added complexity is unjustified when:
- Queries are homogeneous. If 95% of queries are simple factoid lookups against a single source, naive RAG with good chunking strategies and a reranker is sufficient and much faster.
- Latency SLAs are strict. Sub-200ms response requirements are difficult to meet when the agent needs 3-5 LLM round-trips. Consider pre-computing answers for known query patterns instead.
- Budget is constrained. If you are processing millions of queries daily, the cost multiplier of 3-8x LLM calls per query adds up fast. Start with naive RAG + hybrid search and add agentic capabilities only for the query classes where retrieval quality is demonstrably poor.
- Your data is simple. A single-table FAQ with 200 entries does not need an agent to decide retrieval strategy.
The decision framework is straightforward: measure your naive RAG system's failure rate. If it is below 5% and failures are random rather than systematic, optimize the retrieval pipeline itself (better chunking, hybrid search, reranking). If failures are systematic and cluster around specific query types (multi-hop, multi-source, ambiguous), those query types are candidates for agentic patterns.
Agentic RAG and the Broader Agent Ecosystem
Agentic RAG sits at the intersection of two major trends in AI engineering: the maturation of RAG as a retrieval paradigm, and the rise of AI agents as a deployment pattern.
Looking at the multi-agent landscape, Agentic RAG is often a component within a larger agent system rather than a standalone application. For example:
- A research agent might use Adaptive RAG to gather information, then pass its findings to a writing agent for report generation.
- A customer support agent might route billing questions through SQL-backed RAG and product questions through document-backed RAG, with CRAG ensuring neither path hallucinates.
- A code generation agent might use multi-step RAG to retrieve relevant API documentation, code examples, and issue discussions before writing code.
The integration with GraphRAG is particularly powerful. By combining the entity-relationship awareness of knowledge graphs with the dynamic retrieval strategies of Agentic RAG, systems can handle questions that require both structural reasoning ("Which teams depend on Service X?") and semantic retrieval ("What are the known issues with Service X?").
For multimodal RAG scenarios -- where the retrieval corpus includes images, tables, and diagrams alongside text -- the agent layer becomes even more critical. The agent must decide not just which source to query, but what modality of retrieval to use and how to fuse results across modalities.
Summary
Agentic RAG represents a fundamental shift from static retrieval pipelines to dynamic, self-correcting systems. The four patterns -- Routing, Multi-step, Corrective, and Adaptive -- provide a toolkit for handling the full spectrum of query complexity in production.
The implementation path is clear: start with naive RAG, measure systematic failure modes, and introduce agentic patterns incrementally where they deliver measurable accuracy gains. Use LangGraph or equivalent frameworks to encode the agent logic as an explicit, debuggable graph rather than opaque prompt chains.
The cost of Agentic RAG is higher per query, but the cost of wrong answers in enterprise applications -- lost trust, incorrect decisions, compliance violations -- is higher still. For systems where retrieval accuracy is a hard requirement, Agentic RAG is no longer optional; it is the engineering standard.
FAQ
Q1: What is Agentic RAG and how does it differ from traditional RAG?
Traditional (naive) RAG follows a fixed retrieve-then-generate sequence. Agentic RAG wraps this pipeline inside an AI agent that can reason about what to retrieve, evaluate the quality of retrieved documents, and dynamically decide its next action, introducing loops and self-correction.
Q2: What are the four main design patterns of Agentic RAG?
The four main patterns are: (1) Routing RAG (selecting the best data source); (2) Multi-step RAG (decomposing complex questions); (3) Corrective RAG (CRAG, evaluating relevance with fallback); and (4) Adaptive RAG (dynamically selecting strategy based on complexity).
Q3: When should I use Agentic RAG instead of naive RAG?
Use Agentic RAG when your application requires multi-source retrieval, handles queries of varying complexity, needs to guarantee answer grounding, or must self-correct when initial retrieval fails. For simple single-source FAQ bots, naive RAG remains sufficient.
Q4: What frameworks support building Agentic RAG systems?
LangGraph (from LangChain) is the most mature framework for building Agentic RAG as stateful graphs. Other options include LlamaIndex Workflows, CrewAI for multi-agent RAG, and Haystack pipelines.
Q5: What are the main challenges of deploying Agentic RAG in production?
Key challenges include increased latency from multi-step loops, higher LLM API costs, the need for robust observability to debug agent decisions, handling infinite loops with guards, and balancing thoroughness against response time.
Related Resources
- AI Agent Development Complete Guide — Deep dive into AI agent architecture and implementation.
- RAG Fundamentals Guide — Master the core principles of Retrieval-Augmented Generation.
- Hybrid Search & Rerank Optimization — Practical tips for improving retrieval recall and accuracy.
- GraphRAG Engineering Guide — Explore the intersection of Knowledge Graphs and RAG.
- Awesome Prompt Directory — Optimize your agent instructions online.