Retrieval-Augmented Generation (RAG) is considered the silver bullet to solve the "hallucination" and "knowledge update lag" of Large Language Models (LLMs). Its logic is straightforward: since the model will make things up, I'll feed it the standard answer (document snippets) and let it read from that.

However, in actual enterprise-level implementations, developers quickly discover a frustrating fact: Even with RAG, models still hallucinate.

Why does this happen? How can we systematically mitigate this problem from an engineering perspective? This article will uncover the causes of RAG hallucinations and provide 5 effective engineering defense strategies.

1. Why Does RAG Still Hallucinate?

In a RAG architecture, hallucinations are mainly caused by three reasons:

  1. Retrieval Failure: The most common reason. Due to improper Chunking strategies or vague Query semantics, the vector database fails to recall the correct context. At this point, the model can only rely on its pre-trained knowledge to "blindly guess."
  2. Context Conflict: Multiple related document snippets are recalled, but the information in these snippets contradicts each other (for example, an old version and a new version of a document are recalled simultaneously). The model makes faulty inferences during synthesis.
  3. Instruction Ignorance: The model "ignores" the constraint in the Prompt to "answer based ONLY on the provided information," forcibly using pre-trained knowledge to answer content not mentioned in the document.

2. Strategy 1: High-Quality Chunking and Metadata Enhancement

Garbage In, Garbage Out. The first step to solving hallucinations is to put effort into the data ingestion phase.

2.1 Avoid "One-Size-Fits-All" Chunking

Do not simply chunk documents by a fixed length (e.g., 500 characters). This easily cuts off complete semantic logic (e.g., forcibly cutting a table or a block of code in half).

Improvement Plan:

  • Semantic Chunking: Use Markdown headers, HTML tags, or paragraph breaks for chunking (such as LangChain's MarkdownHeaderTextSplitter).
  • Overlap: Maintain a 10%-20% overlap between adjacent Chunks to preserve contextual continuity.

2.2 Inject Metadata to Provide Global Perspective

A single Chunk often lacks global context. For example, a Chunk's content is "The QPS limit for this interface is 100." The model doesn't know which API "this interface" refers to.

Improvement Plan: Before vectorization, inject Metadata (such as document title, chapter name, update time) into each Chunk:

json
{
  "content": "The QPS limit for this interface is 100",
  "metadata": {
    "source": "api_v2_docs.md",
    "section": "User Authentication API",
    "last_updated": "2026-03-01"
  }
}

3. Strategy 2: Query Rewrite and Expansion

Users' questions are often very brief or vague (e.g., "How do I get a refund?"). Directly using this sentence to search the vector database usually yields poor results.

Improvement Plan: Use LLMs to Rewrite Queries Before retrieval, have the LLM "translate" the user's Query into a form more suitable for retrieval.

python
# Query Rewrite Prompt Example
prompt = f"""
You are a Search Engine Optimization expert. The user's question is: "{user_query}"
Please combine context to rewrite this question into 3 search keywords from different angles to improve the recall rate in the knowledge base.
Output only the keywords, do not include extra explanations.
"""
# Output Example: 
# 1. Refund process
# 2. How to apply for return and refund
# 3. After-sales policy

Use these 3 expanded queries to retrieve separately, then deduplicate the results and fuse them using RRF.

4. Strategy 3: Strict Prompt Constraints and the "I Don't Know" Mechanism

This is the most effective means to prevent models from "forcibly answering." You can find more related safety templates in QubitTool's Prompt Directory.

Improvement Plan: Set Firm Boundaries In the System Prompt, you must clearly and firmly define the model's code of conduct.

text
You are a rigorous customer service robot.
You must answer user questions [STRICTLY] and [ONLY] based on the provided <context>.

Constraints:
1. If the <context> does not contain the information needed to answer the question, you must reply: "Sorry, there is no detailed information regarding this issue in the current knowledge base."
2. Absolutely do not use your pre-trained knowledge to fabricate answers.
3. Your answer must be objective and neutral; do not add personal speculation.

<context>
{retrieved_chunks}
</context>

User Question: {user_query}

5. Strategy 4: Introduce a Citation Mechanism

To make the model's answers Explainable and force it to "think twice" when generating content, you can require the model to attach source citations to its answers.

Improvement Plan: Require Output of Source Identifiers Add the following requirement to the Prompt:

text
In your answer, every key fact must be accompanied by a source citation, formatted as [Source: filename].
For example: According to the latest policy, refunds will be returned via the original payment method within 3 working days [Source: after-sales-policy.pdf].

This not only greatly reduces hallucinations but also allows developers to quickly pinpoint whether the problem occurred during Retrieval or Generation when a hallucination does happen.

6. Strategy 5: Self-Correction / Verification

This is the heaviest step, usually used in scenarios requiring extremely high accuracy (such as medical, legal, or financial QA).

Improvement Plan: Introduce a "Referee" Model After generating a preliminary answer, do not return it directly to the user. Instead, call another (or the same) LLM for Self-Check. You can use the Regex Tester Tool to extract the validation results.

python
verification_prompt = f"""
You are a fact-checker.
Please check if the following [Answer] is completely supported by the [Reference Materials], and does not contain any fabricated information outside the [Reference Materials].

[Reference Materials]: {context}
[Answer]: {generated_answer}

Please output in JSON format:
{{
  "is_hallucinated": true/false,
  "reason": "If there is a hallucination, please explain the reason"
}}
"""

If is_hallucinated is true, refuse to return the answer, or require the generation model to regenerate it.

Conclusion

The hallucination problem of RAG is not a magic trick that can be solved by a single algorithm, but a systematic engineering problem. From high-quality Chunking and Metadata injection (Data Layer), to Query Rewrite (Retrieval Layer), to strict Prompt constraints, Citation attribution, and Self-Correction (Generation Layer), these 5 strategies must work together to build a truly reliable and trustworthy enterprise-grade AI knowledge base system.