The Definitive Guide to Self-Reflective RAG (Self-RAG): Building “System 2” Thinking for AI
The Definitive Guide to Self-Reflective RAG (Self-RAG): Building “System 2” Thinking for AI
Date: January 3, 2026
Category: Artificial Intelligence / Advanced NLP
Reading Time: 35 Minutes
Author: Rahul Kolekar
1. The Problem: LLMs are Sycophants
In 2024, we discovered a fatal flaw in standard RAG (Retrieval Augmented Generation) systems: Sycophancy. If you provide an LLM with a retrieved document that contains false information, the LLM will happily lie to the user to “honor” the context.
Even worse, if the retrieval fails and fetches irrelevant documents about “Apple Pie” when the user asked about “Apple Inc,” the LLM will often try to hallucinate a bridge between the two to be “helpful.”
Standard RAG is “System 1” thinking: Fast, intuitive, and prone to error.
Self-RAG is “System 2” thinking: Slow, deliberative, and self-correcting.
In this comprehensive guide, we will implement the Self-RAG framework originally proposed by Akari Asai (University of Washington/Meta). We won’t just use their training method; we will implement their inference logic using LangGraph to create a pipeline that checks its own work before responding.
2. The Theory: The Four Control Tokens
The core innovation of Self-RAG is that it doesn’t just generate text; it generates “reflection tokens” (internal monologue) to evaluate four distinct steps. In our implementation, we will model these as four distinct Graph Nodes.
- Retrieve? (Decision): “Does this query actually require external knowledge, or can I answer it from memory?”
- IsRel (Relevance Check): “Is the document I just fetched actually relevant to the query?”
- IsSup (Supported/Grounding Check): “Is the answer I just wrote fully supported by the document, or did I make stuff up?”
- IsUse (Utility Check): “Is the answer actually useful to the user?”
3. The Architecture
We are building a cyclic graph (Loop). This is not a straight line. The data can flow backwards.
- Node 1: Retriever. Fetches top-k docs.
- Node 2: Grader. Filters out garbage docs. If all are garbage, it triggers a Web Search (fallback).
- Node 3: Generator. Drafts an initial answer.
- Node 4: Hallucination Grader. Checks the draft against the docs.
- Failure: Loop back to Generator (Retry).
- Node 5: Answer Grader. Checks if the answer solves the user’s problem.
- Failure: Loop back to Query Rewriter.
4. Step-by-Step Implementation
We will use LangGraph and Pydantic to enforce strict structure on our “Thought Process.”
Step A: Environment and Imports
### PREREQUISITES ###
# pip install langgraph langchain langchain-openai tavily-python tiktoken
import os
from typing import Annotated, List, Dict, TypedDict
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langgraph.graph import END, StateGraph
# Set your API keys
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["TAVILY_API_KEY"] = "tvly-..." # For web search fallback
# We use GPT-4o for the reasoning (Controller)
llm = ChatOpenAI(model="gpt-4o", temperature=0)
Step B: The State Definition
The state is the shared memory of our graph. Every node can read and write to this.
class GraphState(TypedDict):
"""
Represents the state of our graph.
"""
question: str
generation: str
web_search: str # "Yes" or "No" flag
documents: List[str]
loop_count: int # Safety valve to prevent infinite loops
Step C: The “Self-Correction” Nodes
This is the most critical part. We need to prompt the LLM to act as a harsh critic, not a helpful assistant.
1. The Retrieval Grader (IsRel)
This node reads a document and decides if it is worth keeping. We use structured output (JSON) to force a binary decision.
class GradeDocuments(BaseModel):
"""Binary score for relevance check on retrieved documents."""
binary_score: str = Field(description="Documents are relevant to the question, 'yes' or 'no'")
structured_llm_grader = llm.with_structured_output(GradeDocuments)
system_prompt = """You are a grader assessing relevance of a retrieved document to a user question. \n
If the document contains keyword(s) or semantic meaning related to the question, grade it as relevant. \n
Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question."""
grade_prompt = ChatPromptTemplate.from_messages(
[("system", system_prompt), ("human", "Retrieved document: \n\n {document} \n\n User question: {question}")]
)
retrieval_grader = grade_prompt | structured_llm_grader
def grade_documents(state):
"""
Determines whether the retrieved documents are relevant to the question.
If any document is not relevant, we will set a flag to run web search.
"""
print("---CHECK DOCUMENT RELEVANCE---")
question = state["question"]
documents = state["documents"]
filtered_docs = []
web_search = "No"
for d in documents:
score = retrieval_grader.invoke({"question": question, "document": d.page_content})
grade = score.binary_score
if grade == "yes":
print("---GRADE: DOCUMENT RELEVANT---")
filtered_docs.append(d)
else:
print("---GRADE: DOCUMENT NOT RELEVANT---")
# If we are filtering out documents, we might need web search
web_search = "Yes"
continue
return {"documents": filtered_docs, "question": question, "web_search": web_search}
2. The Hallucination Grader (IsSup)
This node checks if the generation is grounded in the documents. It prevents the model from making up facts.
class GradeHallucinations(BaseModel):
"""Binary score for hallucination check in generation documents."""
binary_score: str = Field(description="Answer is grounded in the facts, 'yes' or 'no'")
structured_llm_hallucination = llm.with_structured_output(GradeHallucinations)
system_prompt_hallucination = """You are a grader assessing whether an LLM generation is grounded in / supported by a set of retrieved facts. \n
Give a binary score 'yes' or 'no'. 'Yes' means the answer is fully supported by the facts. 'No' means there is information in the answer that is not in the documents."""
hallucination_prompt = ChatPromptTemplate.from_messages(
[("system", system_prompt_hallucination), ("human", "Set of facts: \n\n {documents} \n\n LLM generation: {generation}")]
)
hallucination_grader = hallucination_prompt | structured_llm_hallucination
def check_hallucinations(state):
print("---CHECK FOR HALLUCINATIONS---")
documents = state["documents"]
generation = state["generation"]
score = hallucination_grader.invoke({"documents": documents, "generation": generation})
grade = score.binary_score
if grade == "yes":
print("---DECISION: GENERATION IS GROUNDED---")
return "grounded"
else:
print("---DECISION: GENERATION IS HALLUCINATED---")
return "not grounded"
3. The Answer Grader (IsUse)
Even if the answer is factual, it might not answer the question. This node checks for utility.
class GradeAnswer(BaseModel):
"""Binary score to assess answer addresses question."""
binary_score: str = Field(description="Answer addresses the question, 'yes' or 'no'")
structured_llm_answer = llm.with_structured_output(GradeAnswer)
system_prompt_answer = """You are a grader assessing whether an answer addresses / resolves a question. \n
Give a binary score 'yes' or 'no'. Yes' means the answer resolves the question."""
answer_prompt = ChatPromptTemplate.from_messages(
[("system", system_prompt_answer), ("human", "User question: \n\n {question} \n\n LLM generation: {generation}")]
)
answer_grader = answer_prompt | structured_llm_answer
Step D: The Graph Construction
Now we wire it all together. Note the conditional edges—this is where the “Looping” logic lives.
workflow = StateGraph(GraphState)
# Define Nodes
workflow.add_node("web_search", web_search_node) # Assumed defined via Tavily
workflow.add_node("retrieve", retrieve_node)
workflow.add_node("grade_documents", grade_documents)
workflow.add_node("generate", generate_node)
workflow.add_node("transform_query", transform_query_node)
# Entry Point
workflow.set_entry_point("retrieve")
# Edge 1: Retrieve -> Grade
workflow.add_edge("retrieve", "grade_documents")
# Conditional Edge 2: Grade -> (Web Search OR Generate)
def decide_to_generate(state):
print("---ASSESS GRADED DOCUMENTS---")
web_search = state["web_search"]
if web_search == "Yes":
# If the local vector store failed, go to Google/Tavily
return "web_search"
else:
return "generate"
workflow.add_conditional_edges(
"grade_documents",
decide_to_generate,
{
"web_search": "web_search",
"generate": "generate",
},
)
# Conditional Edge 3: Generate -> (Self-Reflection Loop)
def grade_generation_v_documents_and_question(state):
print("---CHECK HALLUCINATIONS---")
question = state["question"]
documents = state["documents"]
generation = state["generation"]
# Check 1: Is it Grounded? (IsSup)
score = hallucination_grader.invoke({"documents": documents, "generation": generation})
if score.binary_score == "yes":
print("---DECISION: GENERATION IS GROUNDED---")
# Check 2: Is it Useful? (IsUse)
print("---GRADE GENERATION vs QUESTION---")
score = answer_grader.invoke({"question": question, "generation": generation})
if score.binary_score == "yes":
print("---DECISION: GENERATION ADDRESSES QUESTION---")
return "useful"
else:
print("---DECISION: GENERATION DOES NOT ADDRESS QUESTION---")
# If not useful, rewrite the query and try again
return "not useful"
else:
print("---DECISION: GENERATION IS HALLUCINATION, RETRY---")
return "not supported"
workflow.add_conditional_edges(
"generate",
grade_generation_v_documents_and_question,
{
"useful": END, # Success!
"not useful": "transform_query", # Loop back to query rewriting
"not supported": "generate", # Loop back to generation (Retry)
},
)
# Edge 4: Web Search -> Generate
workflow.add_edge("web_search", "generate")
# Edge 5: Transform Query -> Retrieve
workflow.add_edge("transform_query", "retrieve")
# Compile
app = workflow.compile()
5. Why This Changes Everything for 2026
Implementing this architecture provides three distinct competitive advantages:
1. The “I Don’t Know” Fallback
Standard RAG hates admitting ignorance. Self-RAG, through the IsRel node, can determine: “None of these documents match. I will not generate an answer. I will perform a web search instead.” This hybrid approach (Local RAG -> Fallback -> Web RAG) is the holy grail of reliability.
2. Resistance to Poisoned Data
If your vector database contains outdated documents (e.g., “The CEO is Steve Jobs”), but the user asks about “Tim Cook,” a standard RAG might get confused. The Hallucination Grader acts as a sanitizer, ensuring that the final output doesn’t mix contradictory facts.
3. Self-Healing Loops
The transform_query edge is powerful. If the model realizes it failed to answer the question, it doesn’t just error out. It thinks: “Maybe I phrased the search wrong. Let me try searching for ‘Q3 Earnings’ instead of ‘Q3 Report’.” This mimics human research behavior.
6. Production Considerations
While this code works for a tutorial, here is what you need to tune for production in Jan 2026:
- Latency: This loop adds time. Each “Grading” step is an LLM call. To speed this up, do not use GPT-4o for grading. Use a fine-tuned Llama-3-8B or Phi-3.5 model specifically trained as a classifier. It will be 10x faster and 10x cheaper.
- Cycle Limits: Always add a
loop_countcheck in your conditional edges. If the model loops 3 times without success, force it to exit with a polite apology (“I am having trouble finding that info”) rather than looping forever and burning your API credits. - Streaming: Since the chain takes longer, you must stream the “thoughts” to the UI. Show the user: “Searching… Checking relevance… Rewriting query…” This psychological transparency makes the wait tolerable.
7. Conclusion
Self-RAG is the difference between a demo and a product. In a demo, you cherry-pick the questions. In a product, users ask messy, vague, or impossible questions.
By implementing a Cognitive Architecture using LangGraph, you give your AI the ability to pause, think, critique, and correct itself. That is the definition of Agentic AI.
Related reading
- Master Class: Fine-Tuning Microsoft’s Phi-3.5 MoE for Edge Devices
- GraphRAG vs. Vector RAG: Which One Wins in 2026?
- Gemini Nano on the Web: A Guide to Chrome’s “Built-in AI”
Author update
I will expand this with real retrieval metrics and failure cases from production. If you want sample eval sets or a reference pipeline, let me know.

